kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v6 00/92] VM introspection
@ 2019-08-09 15:59 Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem) Adalbert Lazăr
                   ` (94 more replies)
  0 siblings, 95 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

The KVM introspection subsystem provides a facility for applications running
on the host or in a separate VM, to control the execution of other VM-s
(pause, resume, shutdown), query the state of the vCPUs (GPRs, MSRs etc.),
alter the page access bits in the shadow page tables (only for the hardware
backed ones, eg. Intel's EPT) and receive notifications when events of
interest have taken place (shadow page table level faults, key MSR writes,
hypercalls etc.). Some notifications can be responded to with an action
(like preventing an MSR from being written), others are mere informative
(like breakpoint events which can be used for execution tracing).
With few exceptions, all events are optional. An application using this
subsystem will explicitly register for them.

The use case that gave way for the creation of this subsystem is to monitor
the guest OS and as such the ABI/API is highly influenced by how the guest
software (kernel, applications) sees the world. For example, some events
provide information specific for the host CPU architecture
(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
to implement a critical feature (fast system calls).

At the moment, the target audience for KVMI are security software authors
that wish to perform forensics on newly discovered threats (exploits) or
to implement another layer of security like preventing a large set of
kernel rootkits simply by "locking" the kernel image in the shadow page
tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
made KVMI a separate subsystem, even though many of these features are
available in the device manager (eg. QEMU). The ability to build a security
application that does not interfere (in terms of performance) with the
guest software asks for a specialized interface that is designed for minimum
overhead.

This patch series is based on 5.0-rc7,
commit de3ccd26fafc ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role").

The previous RFC (v5) can be read here:

	https://www.spinics.net/lists/kvm/msg179441.html

Thanks to Samuel Laurén and Mathieu Tarral, the previous version has
been integrated and tested with libVMI.

	KVM-VMI: https://github.com/KVM-VMI/kvm-vmi
	Kernel:  https://github.com/KVM-VMI/kvm/tree/kvmi
	QEMU:    https://github.com/KVM-VMI/qemu/tree/kvmi
	         (not all patches, but enough to work)
	libVMI:  https://github.com/KVM-VMI/libvmi/tree/kvmi

Thanks to Weijiang Yang, the previous version has been integrated and
tested with the SPP patch series.

	https://github.com/adlazar/kvm/tree/kvmi-v5-spp

Quickstart:

	https://github.com/adlazar/kvm/blob/kvmi-v5-spp/tools/kvm/kvmi/README

I hope this version will be merged into KVM-VMI project too.

Patches 1-20: unroll a big part of the KVM introspection subsystem,
sent in one patch in the previous versions.

Patches 21-24: extend the current page tracking code.

Patches 25-33: make use of page tracking to support the
KVMI_SET_PAGE_ACCESS introspection command and the KVMI_EVENT_PF event
(on EPT violations caused by the tracking settings).

Patches 34-42: include the SPP feature (Enable Sub-page
Write Protection Support), already sent to KVM list:

	https://lore.kernel.org/lkml/20190717133751.12910-1-weijiang.yang@intel.com/

Patches 43-46: add the commands needed to use SPP.

Patches 47-63: unroll almost all the rest of the introspection code.

Patches 64-67: add single-stepping, mostly as a way to overcome the
unimplemented instructions, but also as a feature for the introspection
tool.

Patches 68-70: cover more cases related to EPT violations.

Patches 71-73: add the remote mapping feature, allowing the introspection
tool to map into its address space a page from guest memory.

Patches 74: add a fix to hypercall emulation.

Patches 75-76: disable some features/optimizations when the introspection
code is present.

Patches 77-78: add trace functions for the introspection code and change
some related to interrupts/exceptions injection.

Patches 79-92: new instruction for the x86 emulator, including cmpxchg
fixes.

To do:

  - run stress tests with SPP enabled
  - add introspection support for alternate EPT views (almost done)
  - add introspection support for virtualization exceptions #VE (almost done)
  - add KVM tests

Changes since v5:

  - small changes to the protocol, but enough to make it backward
    incompatible with v5
  - fix CR3 interception (thanks to Mathieu Tarral for reporting the issue)
  - add SPP support (thanks to Weijiang Yang)
  - add two more ioctls in order to let userspace/QEMU control
    the commands/events allowed for introspection
  - extend the breakpoint event with the instruction length
  - complete the descriptor table registers interception
  - add new instructions to the x86 emulator
  - move arch dependent code to arch/x86/kvm/
  - lots of fixes, especially on page tracking, single-stepping, exception
    injection and remote memory mapping
  - the guests are much more stable (on pair with our introspection
    products using Xen)
  - speed improvements (the penalty on web browsing actions is 50% lower,
    at least)


Adalbert Lazăr (25):
  kvm: introspection: add basic ioctls (hook/unhook)
  kvm: introspection: add permission access ioctls
  kvm: introspection: add the read/dispatch message function
  kvm: introspection: add KVMI_GET_VERSION
  kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE
  kvm: introspection: honor the reply option when handling the
    KVMI_GET_VERSION command
  kvm: introspection: add KVMI_CHECK_COMMAND and KVMI_CHECK_EVENT
  kvm: introspection: add KVMI_CONTROL_VM_EVENTS
  kvm: introspection: add a jobs list to every introspected vCPU
  kvm: introspection: make the vCPU wait even when its jobs list is
    empty
  kvm: introspection: add KVMI_EVENT_UNHOOK
  kvm: x86: intercept the write access on sidt and other emulated
    instructions
  kvm: introspection: add KVMI_CONTROL_SPP
  kvm: introspection: extend the internal database of tracked pages with
    write_bitmap info
  kvm: introspection: add KVMI_GET_PAGE_WRITE_BITMAP
  kvm: introspection: add KVMI_SET_PAGE_WRITE_BITMAP
  kvm: add kvm_vcpu_kick_and_wait()
  kvm: introspection: add KVMI_PAUSE_VCPU and KVMI_EVENT_PAUSE_VCPU
  kvm: x86: add kvm_arch_vcpu_set_guest_debug()
  kvm: introspection: add custom input when single-stepping a vCPU
  kvm: x86: keep the page protected if tracked by the introspection tool
  kvm: x86: filter out access rights only when tracked by the
    introspection tool
  kvm: x86: disable gpa_available optimization in
    emulator_read_write_onepage()
  kvm: x86: disable EPT A/D bits if introspection is present
  kvm: introspection: add trace functions

Marian Rotariu (1):
  kvm: introspection: add KVMI_GET_CPUID

Mihai Donțu (47):
  kvm: introduce KVMI (VM introspection subsystem)
  kvm: introspection: add KVMI_GET_GUEST_INFO
  kvm: introspection: handle introspection commands before returning to
    guest
  kvm: introspection: handle vCPU related introspection commands
  kvm: introspection: handle events and event replies
  kvm: introspection: introduce event actions
  kvm: introspection: add KVMI_GET_VCPU_INFO
  kvm: page track: add track_create_slot() callback
  kvm: x86: provide all page tracking hooks with the guest virtual
    address
  kvm: page track: add support for preread, prewrite and preexec
  kvm: x86: wire in the preread/prewrite/preexec page trackers
  kvm: x86: add kvm_mmu_nested_pagefault()
  kvm: introspection: use page track
  kvm: x86: consult the page tracking from kvm_mmu_get_page() and
    __direct_map()
  kvm: introspection: add KVMI_CONTROL_EVENTS
  kvm: x86: add kvm_spt_fault()
  kvm: introspection: add KVMI_EVENT_PF
  kvm: introspection: add KVMI_GET_PAGE_ACCESS
  kvm: introspection: add KVMI_SET_PAGE_ACCESS
  kvm: introspection: add KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL
  kvm: introspection: add KVMI_GET_REGISTERS
  kvm: introspection: add KVMI_SET_REGISTERS
  kvm: introspection: add KVMI_INJECT_EXCEPTION + KVMI_EVENT_TRAP
  kvm: introspection: add KVMI_CONTROL_CR and KVMI_EVENT_CR
  kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  kvm: introspection: add KVMI_GET_XSAVE
  kvm: introspection: add KVMI_GET_MTRR_TYPE
  kvm: introspection: add KVMI_EVENT_XSETBV
  kvm: introspection: add KVMI_EVENT_BREAKPOINT
  kvm: introspection: add KVMI_EVENT_HYPERCALL
  kvm: introspection: use single stepping on unimplemented instructions
  kvm: x86: emulate a guest page table walk on SPT violations due to A/D
    bit updates
  kvm: x86: do not unconditionally patch the hypercall instruction
    during emulation
  kvm: x86: emulate movsd xmm, m64
  kvm: x86: emulate movss xmm, m32
  kvm: x86: emulate movq xmm, m64
  kvm: x86: emulate movq r, xmm
  kvm: x86: emulate movd xmm, m32
  kvm: x86: enable the half part of movss, movsd, movups
  kvm: x86: emulate lfence
  kvm: x86: emulate xorpd xmm2/m128, xmm1
  kvm: x86: emulate xorps xmm/m128, xmm
  kvm: x86: emulate fst/fstp m64fp
  kvm: x86: make lock cmpxchg r, r/m atomic
  kvm: x86: emulate lock cmpxchg8b atomically
  kvm: x86: emulate lock cmpxchg16b m128
  kvm: x86: fallback to the single-step on multipage CMPXCHG emulation

Mircea Cîrjaliu (5):
  kvm: introspection: add vCPU related data
  kvm: introspection: add KVMI_EVENT_CREATE_VCPU
  mm: add support for remote mapping
  kvm: introspection: add memory map/unmap support on the guest side
  kvm: introspection: use remote mapping

Nicușor Cîțu (5):
  kvm: x86: block any attempt to disable MSR interception if tracked by
    introspection
  kvm: introspection: add KVMI_EVENT_DESCRIPTOR
  kvm: introspection: add single-stepping
  kvm: introspection: add KVMI_EVENT_SINGLESTEP
  kvm: x86: add tracepoints for interrupt and exception injections

Yang Weijiang (9):
  Documentation: Introduce EPT based Subpage Protection
  KVM: VMX: Add control flags for SPP enabling
  KVM: VMX: Implement functions for SPPT paging setup
  KVM: VMX: Introduce SPP access bitmap and operation functions
  KVM: VMX: Add init/set/get functions for SPP
  KVM: VMX: Introduce SPP user-space IOCTLs
  KVM: VMX: Handle SPP induced vmexit and page fault
  KVM: MMU: Enable Lazy mode SPPT setup
  KVM: MMU: Handle host memory remapping and reclaim

 Documentation/virtual/kvm/api.txt        |  104 ++
 Documentation/virtual/kvm/hypercalls.txt |   68 +-
 Documentation/virtual/kvm/kvmi.rst       | 1640 +++++++++++++++++
 Documentation/virtual/kvm/spp_kvm.txt    |  173 ++
 arch/x86/Kconfig                         |    9 +
 arch/x86/include/asm/cpufeatures.h       |    1 +
 arch/x86/include/asm/kvm_emulate.h       |    3 +-
 arch/x86/include/asm/kvm_host.h          |   52 +-
 arch/x86/include/asm/kvm_page_track.h    |   33 +-
 arch/x86/include/asm/kvmi_guest.h        |   10 +
 arch/x86/include/asm/kvmi_host.h         |   51 +
 arch/x86/include/asm/vmx.h               |   12 +
 arch/x86/include/uapi/asm/kvmi.h         |  124 ++
 arch/x86/include/uapi/asm/vmx.h          |    2 +
 arch/x86/kernel/Makefile                 |    1 +
 arch/x86/kernel/cpu/intel.c              |    4 +
 arch/x86/kernel/kvmi_mem_guest.c         |   26 +
 arch/x86/kvm/Kconfig                     |    7 +
 arch/x86/kvm/Makefile                    |    1 +
 arch/x86/kvm/emulate.c                   |  315 +++-
 arch/x86/kvm/kvmi.c                      | 1141 ++++++++++++
 arch/x86/kvm/mmu.c                       |  645 ++++++-
 arch/x86/kvm/mmu.h                       |    5 +
 arch/x86/kvm/page_track.c                |  147 +-
 arch/x86/kvm/svm.c                       |  168 +-
 arch/x86/kvm/trace.h                     |  118 +-
 arch/x86/kvm/vmx/capabilities.h          |    5 +
 arch/x86/kvm/vmx/vmx.c                   |  347 +++-
 arch/x86/kvm/vmx/vmx.h                   |    2 +
 arch/x86/kvm/x86.c                       |  601 ++++++-
 drivers/gpu/drm/i915/gvt/kvmgt.c         |    2 +-
 include/linux/kvm_host.h                 |   26 +
 include/linux/kvmi.h                     |   68 +
 include/linux/page-flags.h               |    9 +-
 include/linux/remote_mapping.h           |  167 ++
 include/linux/swait.h                    |   11 +
 include/trace/events/kvmi.h              |  680 +++++++
 include/uapi/linux/kvm.h                 |   34 +
 include/uapi/linux/kvm_para.h            |   10 +-
 include/uapi/linux/kvmi.h                |  286 +++
 include/uapi/linux/remote_mapping.h      |   18 +
 kernel/signal.c                          |    1 +
 mm/Kconfig                               |    8 +
 mm/Makefile                              |    1 +
 mm/memory-failure.c                      |   69 +-
 mm/migrate.c                             |    9 +-
 mm/remote_mapping.c                      | 1834 +++++++++++++++++++
 mm/rmap.c                                |   13 +-
 mm/vmscan.c                              |    3 +-
 virt/kvm/kvm_main.c                      |   70 +-
 virt/kvm/kvmi.c                          | 2054 ++++++++++++++++++++++
 virt/kvm/kvmi_int.h                      |  311 ++++
 virt/kvm/kvmi_mem.c                      |  324 ++++
 virt/kvm/kvmi_mem_guest.c                |  651 +++++++
 virt/kvm/kvmi_msg.c                      | 1236 +++++++++++++
 55 files changed, 13485 insertions(+), 225 deletions(-)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 Documentation/virtual/kvm/spp_kvm.txt
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/include/asm/kvmi_host.h
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 arch/x86/kvm/kvmi.c
 create mode 100644 include/linux/kvmi.h
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/trace/events/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi_int.h
 create mode 100644 virt/kvm/kvmi_mem.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_msg.c


^ permalink raw reply	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-12 20:20   ` Sean Christopherson
  2019-08-09 15:59 ` [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook) Adalbert Lazăr
                   ` (93 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mihai Donțu <mdontu@bitdefender.com>

Besides the pointer to the new structure, the patch adds to the kvm
structure a reference counter (the new object will be used by the thread
receiving introspection commands/events) and a completion variable
(to signal that the VM can be hooked by the introspection tool).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 75 ++++++++++++++++++++++++++++++
 arch/x86/kvm/Kconfig               |  7 +++
 arch/x86/kvm/Makefile              |  1 +
 include/linux/kvm_host.h           |  4 ++
 include/linux/kvmi.h               | 23 +++++++++
 include/uapi/linux/kvmi.h          | 68 +++++++++++++++++++++++++++
 virt/kvm/kvm_main.c                | 10 +++-
 virt/kvm/kvmi.c                    | 64 +++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                | 12 +++++
 9 files changed, 263 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 include/linux/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi_int.h

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..d54caf8d974f
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,75 @@
+=========================================================
+KVMI - The kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPUs (GPRs, MSRs etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preventing an MSR from being written), others are mere informative
+(like breakpoint events which can be used for execution tracing).
+With few exceptions, all events are optional. An application using this
+subsystem will explicitly register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is highly influenced by how the guest
+software (kernel, applications) sees the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for KVMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made KVMI a separate subsystem, even though many of these features are
+available in the device manager (eg. QEMU). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from a user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. QEMU) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (QEMU) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. As such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed), restarted or the peer goes away, at which point it will wake
+up and properly cleanup as if the introspection subsystem has never been
+used on that guest. Obviously, whether the guest can really continue
+normal execution depends on whether the introspection tool has made any
+modifications that require an active KVMI channel.
+
+Memory access safety
+--------------------
+
+The KVMI API gives access to the entire guest physical address space but
+provides no information on which parts of it are system RAM and which are
+device-specific memory (DMA, emulated MMIO, reserved by a passthrough
+device etc.). It is up to the user to determine, using the guest operating
+system data structures, the areas that are safe to access (code, stack, heap
+etc.).
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..f70a6a1b6814 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_INTROSPECTION
+	bool "VM Introspection"
+	depends on KVM && (KVM_INTEL || KVM_AMD)
+	help
+	 This option enables functions to control the execution of VM-s, query
+	 the state of the vCPU-s (GPR-s, MSR-s etc.).
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source "drivers/vhost/Kconfig"
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31ecf7a76d5a..312597bd47c7 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,6 +7,7 @@ KVM := ../../../virt/kvm
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
+kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..582b0187f5a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -455,6 +455,10 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+
+	struct completion kvmi_completed;
+	refcount_t kvmi_ref;
+	void *kvmi;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
new file mode 100644
index 000000000000..e36de3f9f3de
--- /dev/null
+++ b/include/linux/kvmi.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_H__
+#define __KVMI_H__
+
+#define kvmi_is_present() IS_ENABLED(CONFIG_KVM_INTROSPECTION)
+
+#ifdef CONFIG_KVM_INTROSPECTION
+
+int kvmi_init(void);
+void kvmi_uninit(void);
+void kvmi_create_vm(struct kvm *kvm);
+void kvmi_destroy_vm(struct kvm *kvm);
+
+#else
+
+static inline int kvmi_init(void) { return 0; }
+static inline void kvmi_uninit(void) { }
+static inline void kvmi_create_vm(struct kvm *kvm) { }
+static inline void kvmi_destroy_vm(struct kvm *kvm) { }
+
+#endif /* CONFIG_KVM_INTROSPECTION */
+
+#endif
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..dbf63ad0862f
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI__LINUX_KVMI_H
+#define _UAPI__LINUX_KVMI_H
+
+/*
+ * KVMI structures and definitions
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+#define KVMI_VERSION 0x00000001
+
+enum {
+	KVMI_EVENT_REPLY           = 0,
+	KVMI_EVENT                 = 1,
+
+	KVMI_FIRST_COMMAND         = 2,
+
+	KVMI_GET_VERSION           = 2,
+	KVMI_CHECK_COMMAND         = 3,
+	KVMI_CHECK_EVENT           = 4,
+	KVMI_GET_GUEST_INFO        = 5,
+	KVMI_GET_VCPU_INFO         = 6,
+	KVMI_PAUSE_VCPU            = 7,
+	KVMI_CONTROL_VM_EVENTS     = 8,
+	KVMI_CONTROL_EVENTS        = 9,
+	KVMI_CONTROL_CR            = 10,
+	KVMI_CONTROL_MSR           = 11,
+	KVMI_CONTROL_VE            = 12,
+	KVMI_GET_REGISTERS         = 13,
+	KVMI_SET_REGISTERS         = 14,
+	KVMI_GET_CPUID             = 15,
+	KVMI_GET_XSAVE             = 16,
+	KVMI_READ_PHYSICAL         = 17,
+	KVMI_WRITE_PHYSICAL        = 18,
+	KVMI_INJECT_EXCEPTION      = 19,
+	KVMI_GET_PAGE_ACCESS       = 20,
+	KVMI_SET_PAGE_ACCESS       = 21,
+	KVMI_GET_MAP_TOKEN         = 22,
+	KVMI_GET_MTRR_TYPE         = 23,
+	KVMI_CONTROL_SPP           = 24,
+	KVMI_GET_PAGE_WRITE_BITMAP = 25,
+	KVMI_SET_PAGE_WRITE_BITMAP = 26,
+	KVMI_CONTROL_CMD_RESPONSE  = 27,
+
+	KVMI_NEXT_AVAILABLE_COMMAND,
+
+};
+
+enum {
+	KVMI_EVENT_UNHOOK      = 0,
+	KVMI_EVENT_CR	       = 1,
+	KVMI_EVENT_MSR	       = 2,
+	KVMI_EVENT_XSETBV      = 3,
+	KVMI_EVENT_BREAKPOINT  = 4,
+	KVMI_EVENT_HYPERCALL   = 5,
+	KVMI_EVENT_PF	       = 6,
+	KVMI_EVENT_TRAP	       = 7,
+	KVMI_EVENT_DESCRIPTOR  = 8,
+	KVMI_EVENT_CREATE_VCPU = 9,
+	KVMI_EVENT_PAUSE_VCPU  = 10,
+	KVMI_EVENT_SINGLESTEP  = 11,
+
+	KVMI_NUM_EVENTS
+};
+
+#endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 585845203db8..90e432d225ab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/kvmi.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -680,6 +681,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (r)
 		goto out_err;
 
+	kvmi_create_vm(kvm);
+
 	spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	spin_unlock(&kvm_lock);
@@ -725,6 +728,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	kvmi_destroy_vm(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
@@ -1556,7 +1560,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
-	 */ 
+	 */
 	kvm_get_pfn(pfn);
 
 	*p_pfn = pfn;
@@ -4204,6 +4208,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_vfio_ops_init();
 	WARN_ON(r);
 
+	r = kvmi_init();
+	WARN_ON(r);
+
 	return 0;
 
 out_unreg:
@@ -4229,6 +4236,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvmi_uninit();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
new file mode 100644
index 000000000000..20638743bd03
--- /dev/null
+++ b/virt/kvm/kvmi.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ */
+#include <uapi/linux/kvmi.h>
+#include "kvmi_int.h"
+
+int kvmi_init(void)
+{
+	return 0;
+}
+
+void kvmi_uninit(void)
+{
+}
+
+struct kvmi * __must_check kvmi_get(struct kvm *kvm)
+{
+	if (refcount_inc_not_zero(&kvm->kvmi_ref))
+		return kvm->kvmi;
+
+	return NULL;
+}
+
+static void kvmi_destroy(struct kvm *kvm)
+{
+}
+
+static void kvmi_release(struct kvm *kvm)
+{
+	kvmi_destroy(kvm);
+
+	complete(&kvm->kvmi_completed);
+}
+
+/* This function may be called from atomic context and must not sleep */
+void kvmi_put(struct kvm *kvm)
+{
+	if (refcount_dec_and_test(&kvm->kvmi_ref))
+		kvmi_release(kvm);
+}
+
+void kvmi_create_vm(struct kvm *kvm)
+{
+	init_completion(&kvm->kvmi_completed);
+	complete(&kvm->kvmi_completed);
+}
+
+void kvmi_destroy_vm(struct kvm *kvm)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return;
+
+	kvmi_put(kvm);
+
+	/* wait for introspection resources to be released */
+	wait_for_completion_killable(&kvm->kvmi_completed);
+}
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
new file mode 100644
index 000000000000..ac23ad6fc4df
--- /dev/null
+++ b/virt/kvm/kvmi_int.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_INT_H__
+#define __KVMI_INT_H__
+
+#include <linux/kvm_host.h>
+
+#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
+
+struct kvmi {
+};
+
+#endif

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook)
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem) Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  8:44   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 03/92] kvm: introspection: add permission access ioctls Adalbert Lazăr
                   ` (92 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

The connection of the introspection socket with the introspection tool
is initialized by userspace/QEMU. Once the handshake is done, the file
descriptor is passed to KVMi using the KVM_INTROSPECTION_HOOK ioctl. A
new thread will be created to handle/dispatch all introspection commands
or replies to introspection events. This thread will finish when the
socket is closed by userspace (eg. when the guest is restarted) or by
the introspection tool. The uuid member of struct kvm_introspection is
used to show the guest id with the error messages.

On certain actions from userspace (pause, suspend, migrate, etc.) the
KVM_INTROSPECTION_UNHOOK ioctl is used to notify the introspection tool
to remove its hooks (eg. breakpoints).

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/api.txt  |  50 ++++++++++
 Documentation/virtual/kvm/kvmi.rst |  65 +++++++++++++
 arch/x86/kvm/Makefile              |   2 +-
 arch/x86/kvm/x86.c                 |   7 ++
 include/linux/kvmi.h               |   4 +
 include/uapi/linux/kvm.h           |  11 +++
 virt/kvm/kvm_main.c                |   8 ++
 virt/kvm/kvmi.c                    | 145 +++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  31 ++++++
 virt/kvm/kvmi_msg.c                |  42 +++++++++
 10 files changed, 364 insertions(+), 1 deletion(-)
 create mode 100644 virt/kvm/kvmi_msg.c

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 356156f5c52d..28d4429f9ae9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3857,6 +3857,56 @@ number of valid entries in the 'entries' array, which is then filled.
 'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
 userspace should not expect to get any particular value there.
 
+4.996 KVM_INTROSPECTION_HOOK
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_introspection (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to enable the introspection of the current VM.
+
+struct kvm_introspection {
+	__s32 fd;
+	__u32 padding;
+	__u8 uuid[16];
+};
+
+fd is the file handle of a socket connected to the introspection tool,
+
+padding must be zero (it might be used in the future),
+
+uuid is used for debug and error messages.
+
+It can fail with -EFAULT if:
+ - memory allocation failed
+ - this VM is already introspected
+ - the file handle doesn't correspond to an active socket
+
+It will fail with -EINVAL if padding is not zero.
+
+The KVMI version can be retrieved using the KVM_CAP_INTROSPECTION of
+the KVM_CHECK_EXTENSION ioctl() at run-time.
+
+4.997 KVM_INTROSPECTION_UNHOOK
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: none
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to disable the introspection of the current VM.
+It is useful when the VM is paused/suspended/migrated.
+
+It can fail with -EFAULT if:
+  - the introspection is not enabled
+  - the socket (passed with KVM_INTROSPECTION_HOOK) had an error
+
+If the ioctl is successful, the userspace should give the introspection
+tool a chance to unhook the VM.
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index d54caf8d974f..47b7c36d334a 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -64,6 +64,71 @@ used on that guest. Obviously, whether the guest can really continue
 normal execution depends on whether the introspection tool has made any
 modifications that require an active KVMI channel.
 
+Handshake
+---------
+
+Although this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+Based on the system administration policies, the management tool
+(eg. libvirt) starts device managers (eg. QEMU) with some extra arguments:
+what introspector could monitor/control that specific guest (and how to
+connect to) and what introspection commands/events are allowed.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+In the end, the device manager will pass the file handle (plus the allowed
+commands/events) to KVM, and forget about it. It will be notified by
+KVM when the introspection tool closes the file handle (in case of
+errors), and should reinitiate the handshake.
+
+Unhooking
+---------
+
+During a VMI session it is possible for the guest to be patched and for
+some of these patches to "talk" with the introspection tool. It thus
+becomes necessary to remove them before the guest is suspended, moved
+(migrated) or a snapshot with memory is created.
+
+The actions are normally performed by the device manager. In the case
+of QEMU, it will use the *KVM_INTROSPECTION_UNHOOK* ioctl to trigger
+the *KVMI_EVENT_UNHOOK* event and wait for a limited amount of time (a
+few seconds) for a confirmation from the introspection tool
+that is OK to proceed.
+
+Live migrations
+---------------
+
+Before the live migration takes place, the introspection tool has to be
+notified and have a chance to unhook (see **Unhooking**).
+
+The QEMU instance on the receiving end, if configured for KVMI, will need to
+establish a connection to the introspection tool after the migration has
+completed.
+
+Obviously, this creates a window in which the guest is not introspected. The
+user will need to be aware of this detail. Future introspection
+technologies can choose not to disconnect and instead transfer the necessary
+context to the introspection tool at the migration destination via a separate
+channel.
+
 Memory access safety
 --------------------
 
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 312597bd47c7..0963e475dbe9 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ KVM := ../../../virt/kvm
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
-kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o
+kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o $(KVM)/kvmi_msg.o
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 941f932373d0..0163e1ad1aaa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -20,6 +20,8 @@
  */
 
 #include <linux/kvm_host.h>
+#include <uapi/linux/kvmi.h>
+#include <linux/kvmi.h>
 #include "irq.h"
 #include "mmu.h"
 #include "i8254.h"
@@ -3083,6 +3085,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_x86_ops->get_nested_state ?
 			kvm_x86_ops->get_nested_state(NULL, 0, 0) : 0;
 		break;
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_CAP_INTROSPECTION:
+		r = KVMI_VERSION;
+		break;
+#endif
 	default:
 		break;
 	}
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index e36de3f9f3de..4ca9280e4419 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -10,6 +10,10 @@ int kvmi_init(void);
 void kvmi_uninit(void);
 void kvmi_create_vm(struct kvm *kvm);
 void kvmi_destroy_vm(struct kvm *kvm);
+int kvmi_ioctl_hook(struct kvm *kvm, void __user *argp);
+int kvmi_ioctl_command(struct kvm *kvm, void __user *argp);
+int kvmi_ioctl_event(struct kvm *kvm, void __user *argp);
+int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 
 #else
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..bae37bf37338 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -989,6 +989,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
 
+#define KVM_CAP_INTROSPECTION 999
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
@@ -1520,6 +1522,15 @@ struct kvm_sev_dbg {
 	__u32 len;
 };
 
+struct kvm_introspection {
+	__s32 fd;
+	__u32 padding;
+	__u8 uuid[16];
+};
+#define KVM_INTROSPECTION_HOOK    _IOW(KVMIO, 0xff, struct kvm_introspection)
+#define KVM_INTROSPECTION_UNHOOK  _IO(KVMIO, 0xfe)
+/* write true on force-reset, false otherwise */
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 90e432d225ab..09a930ac007d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3266,6 +3266,14 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_INTROSPECTION_HOOK:
+		r = kvmi_ioctl_hook(kvm, argp);
+		break;
+	case KVM_INTROSPECTION_UNHOOK:
+		r = kvmi_ioctl_unhook(kvm, arg);
+		break;
+#endif /* CONFIG_KVM_INTROSPECTION */
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 20638743bd03..591f6ee22135 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -7,6 +7,8 @@
  */
 #include <uapi/linux/kvmi.h>
 #include "kvmi_int.h"
+#include <linux/kthread.h>
+#include <linux/bitmap.h>
 
 int kvmi_init(void)
 {
@@ -17,6 +19,22 @@ void kvmi_uninit(void)
 {
 }
 
+static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kzalloc(sizeof(*ikvm), GFP_KERNEL);
+	if (!ikvm)
+		return false;
+
+	memcpy(&ikvm->uuid, &qemu->uuid, sizeof(ikvm->uuid));
+
+	ikvm->kvm = kvm;
+	kvm->kvmi = ikvm;
+
+	return true;
+}
+
 struct kvmi * __must_check kvmi_get(struct kvm *kvm)
 {
 	if (refcount_inc_not_zero(&kvm->kvmi_ref))
@@ -27,10 +45,13 @@ struct kvmi * __must_check kvmi_get(struct kvm *kvm)
 
 static void kvmi_destroy(struct kvm *kvm)
 {
+	kfree(kvm->kvmi);
+	kvm->kvmi = NULL;
 }
 
 static void kvmi_release(struct kvm *kvm)
 {
+	kvmi_sock_put(IKVM(kvm));
 	kvmi_destroy(kvm);
 
 	complete(&kvm->kvmi_completed);
@@ -43,6 +64,111 @@ void kvmi_put(struct kvm *kvm)
 		kvmi_release(kvm);
 }
 
+static void kvmi_end_introspection(struct kvmi *ikvm)
+{
+	struct kvm *kvm = ikvm->kvm;
+
+	/* Signal QEMU which is waiting for POLLHUP. */
+	kvmi_sock_shutdown(ikvm);
+
+	/*
+	 * At this moment the socket is shut down, no more commands will come
+	 * from the introspector, and the only way into the introspection is
+	 * thru the event handlers. Make sure the introspection ends.
+	 */
+	kvmi_put(kvm);
+}
+
+static int kvmi_recv(void *arg)
+{
+	struct kvmi *ikvm = arg;
+
+	kvmi_info(ikvm, "Hooking VM\n");
+
+	while (kvmi_msg_process(ikvm))
+		;
+
+	kvmi_info(ikvm, "Unhooking VM\n");
+
+	kvmi_end_introspection(ikvm);
+
+	return 0;
+}
+
+int kvmi_hook(struct kvm *kvm, const struct kvm_introspection *qemu)
+{
+	struct kvmi *ikvm;
+	int err = 0;
+
+	/* wait for the previous introspection to finish */
+	err = wait_for_completion_killable(&kvm->kvmi_completed);
+	if (err)
+		return err;
+
+	/* ensure no VCPU hotplug happens until we set the reference */
+	mutex_lock(&kvm->lock);
+
+	if (!alloc_kvmi(kvm, qemu)) {
+		mutex_unlock(&kvm->lock);
+		return -ENOMEM;
+	}
+	ikvm = IKVM(kvm);
+
+	/* interact with other kernel components after structure allocation */
+	if (!kvmi_sock_get(ikvm, qemu->fd)) {
+		err = -EINVAL;
+		goto err_alloc;
+	}
+
+	/*
+	 * Make sure all the KVM/KVMI structures are linked and no pointer
+	 * is read as NULL after the reference count has been set.
+	 */
+	smp_mb__before_atomic();
+	refcount_set(&kvm->kvmi_ref, 1);
+
+	mutex_unlock(&kvm->lock);
+
+	ikvm->recv = kthread_run(kvmi_recv, ikvm, "kvmi-recv");
+	if (IS_ERR(ikvm->recv)) {
+		kvmi_err(ikvm, "Unable to create receiver thread!\n");
+		err = PTR_ERR(ikvm->recv);
+		goto err_recv;
+	}
+
+	return 0;
+
+err_recv:
+	/*
+	 * introspection has oficially started since reference count has been
+	 * set (and some event handlers may have already acquired it), but
+	 * without the receiver thread; we must emulate its shutdown behavior
+	 */
+	kvmi_end_introspection(ikvm);
+
+	return err;
+
+err_alloc:
+	kvmi_release(kvm);
+
+	mutex_unlock(&kvm->lock);
+
+	return err;
+}
+
+int kvmi_ioctl_hook(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_introspection i;
+
+	if (copy_from_user(&i, argp, sizeof(i)))
+		return -EFAULT;
+
+	if (i.padding)
+		return -EINVAL;
+
+	return kvmi_hook(kvm, &i);
+}
+
 void kvmi_create_vm(struct kvm *kvm)
 {
 	init_completion(&kvm->kvmi_completed);
@@ -57,8 +183,27 @@ void kvmi_destroy_vm(struct kvm *kvm)
 	if (!ikvm)
 		return;
 
+	/* trigger socket shutdown - kvmi_recv() will start shutdown process */
+	kvmi_sock_shutdown(ikvm);
+
 	kvmi_put(kvm);
 
 	/* wait for introspection resources to be released */
 	wait_for_completion_killable(&kvm->kvmi_completed);
 }
+
+int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
+{
+	struct kvmi *ikvm;
+	int err = 0;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return -EFAULT;
+
+	kvm_info("TODO: %s force_reset %d", __func__, force_reset);
+
+	kvmi_put(kvm);
+
+	return err;
+}
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index ac23ad6fc4df..9bc5205c8714 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -2,11 +2,42 @@
 #ifndef __KVMI_INT_H__
 #define __KVMI_INT_H__
 
+#include <linux/types.h>
 #include <linux/kvm_host.h>
 
+#include <uapi/linux/kvmi.h>
+
+#define kvmi_debug(ikvm, fmt, ...) \
+	kvm_debug("%pU " fmt, &ikvm->uuid, ## __VA_ARGS__)
+#define kvmi_info(ikvm, fmt, ...) \
+	kvm_info("%pU " fmt, &ikvm->uuid, ## __VA_ARGS__)
+#define kvmi_warn(ikvm, fmt, ...) \
+	kvm_info("%pU WARNING: " fmt, &ikvm->uuid, ## __VA_ARGS__)
+#define kvmi_warn_once(ikvm, fmt, ...) ({                     \
+		static bool __section(.data.once) __warned;   \
+		if (!__warned) {                              \
+			__warned = true;                      \
+			kvmi_warn(ikvm, fmt, ## __VA_ARGS__); \
+		}                                             \
+	})
+#define kvmi_err(ikvm, fmt, ...) \
+	kvm_info("%pU ERROR: " fmt, &ikvm->uuid, ## __VA_ARGS__)
+
 #define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
 
 struct kvmi {
+	struct kvm *kvm;
+
+	struct socket *sock;
+	struct task_struct *recv;
+
+	uuid_t uuid;
 };
 
+/* kvmi_msg.c */
+bool kvmi_sock_get(struct kvmi *ikvm, int fd);
+void kvmi_sock_shutdown(struct kvmi *ikvm);
+void kvmi_sock_put(struct kvmi *ikvm);
+bool kvmi_msg_process(struct kvmi *ikvm);
+
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
new file mode 100644
index 000000000000..4de012eafb6d
--- /dev/null
+++ b/virt/kvm/kvmi_msg.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ */
+#include <linux/net.h>
+#include "kvmi_int.h"
+
+bool kvmi_sock_get(struct kvmi *ikvm, int fd)
+{
+	struct socket *sock;
+	int r;
+
+	sock = sockfd_lookup(fd, &r);
+	if (!sock) {
+		kvmi_err(ikvm, "Invalid file handle: %d\n", fd);
+		return false;
+	}
+
+	ikvm->sock = sock;
+
+	return true;
+}
+
+void kvmi_sock_put(struct kvmi *ikvm)
+{
+	if (ikvm->sock)
+		sockfd_put(ikvm->sock);
+}
+
+void kvmi_sock_shutdown(struct kvmi *ikvm)
+{
+	kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
+}
+
+bool kvmi_msg_process(struct kvmi *ikvm)
+{
+	kvmi_info(ikvm, "TODO: %s", __func__);
+	return false;
+}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 03/92] kvm: introspection: add permission access ioctls
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem) Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook) Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 04/92] kvm: introspection: add the read/dispatch message function Adalbert Lazăr
                   ` (91 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

KVM_INTROSPECTION_COMMAND and KVM_INTROSPECTION_EVENTS should be used
by userspace/QEMU to allow access to specific (or all) introspection
commands and events.

By default, all introspection events and almost all introspection commands
are disallowed. There are a couple of commands that are always allowed
(those querying the introspection capabilities).

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/api.txt | 56 +++++++++++++++++++-
 include/uapi/linux/kvm.h          |  6 +++
 virt/kvm/kvm_main.c               |  6 +++
 virt/kvm/kvmi.c                   | 85 +++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h               | 51 +++++++++++++++++++
 5 files changed, 203 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 28d4429f9ae9..ea3135d365c7 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3889,7 +3889,61 @@ It will fail with -EINVAL if padding is not zero.
 The KVMI version can be retrieved using the KVM_CAP_INTROSPECTION of
 the KVM_CHECK_EXTENSION ioctl() at run-time.
 
-4.997 KVM_INTROSPECTION_UNHOOK
+4.997 KVM_INTROSPECTION_COMMAND
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_introspection_feature (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to allow or disallow introspection commands
+for the current VM. By default, almost all commands are disallowed
+except for those used to query the API.
+
+struct kvm_introspection_feature {
+	__u32 allow;
+	__s32 id;
+};
+
+If allow is 1, the command specified by id is allowed. If allow is 0,
+the command is disallowed.
+
+Unless set to -1 (meaning all commands), id must be a command ID
+(e.g. KVMI_GET_VERSION, KVMI_GET_GUEST_INFO etc.)
+
+Errors:
+
+  -EINVAL if the command is unknown
+  -EPERM if the command can't be disallowed (e.g. KVMI_GET_VERSION)
+
+4.998 KVM_INTROSPECTION_EVENT
+
+Capability: KVM_CAP_INTROSPECTION
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_introspection_feature (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to allow or disallow introspection events
+for the current VM. By default, all events are disallowed.
+
+struct kvm_introspection_feature {
+	__u32 allow;
+	__s32 id;
+};
+
+If allow is 1, the event specified by id is allowed. If allow is 0,
+the event is disallowed.
+
+Unless set to -1 (meaning all event), id must be a event ID
+(e.g. KVMI_EVENT_UNHOOK, KVMI_EVENT_CR, etc.)
+
+Errors:
+
+  -EINVAL if the event is unknown
+
+4.999 KVM_INTROSPECTION_UNHOOK
 
 Capability: KVM_CAP_INTROSPECTION
 Architectures: x86
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bae37bf37338..2ff05fd123e3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1527,9 +1527,15 @@ struct kvm_introspection {
 	__u32 padding;
 	__u8 uuid[16];
 };
+struct kvm_introspection_feature {
+	__u32 allow;
+	__s32 id;
+};
 #define KVM_INTROSPECTION_HOOK    _IOW(KVMIO, 0xff, struct kvm_introspection)
 #define KVM_INTROSPECTION_UNHOOK  _IO(KVMIO, 0xfe)
 /* write true on force-reset, false otherwise */
+#define KVM_INTROSPECTION_COMMAND _IOW(KVMIO, 0xfd, struct kvm_introspection_feature)
+#define KVM_INTROSPECTION_EVENT   _IOW(KVMIO, 0xfc, struct kvm_introspection_feature)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09a930ac007d..8399b826f2d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3270,6 +3270,12 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_INTROSPECTION_HOOK:
 		r = kvmi_ioctl_hook(kvm, argp);
 		break;
+	case KVM_INTROSPECTION_COMMAND:
+		r = kvmi_ioctl_command(kvm, argp);
+		break;
+	case KVM_INTROSPECTION_EVENT:
+		r = kvmi_ioctl_event(kvm, argp);
+		break;
 	case KVM_INTROSPECTION_UNHOOK:
 		r = kvmi_ioctl_unhook(kvm, arg);
 		break;
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 591f6ee22135..dc64f975998f 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -169,6 +169,91 @@ int kvmi_ioctl_hook(struct kvm *kvm, void __user *argp)
 	return kvmi_hook(kvm, &i);
 }
 
+static int kvmi_ioctl_get_feature(void __user *argp, bool *allow, int *id,
+				  unsigned long *bitmask)
+{
+	struct kvm_introspection_feature feat;
+	int all_bits = -1;
+
+	if (copy_from_user(&feat, argp, sizeof(feat)))
+		return -EFAULT;
+
+	if (feat.id < 0 && feat.id != all_bits)
+		return -EINVAL;
+
+	*allow = !!(feat.allow & 1);
+	*id = feat.id;
+	*bitmask = *id == all_bits ? -1 : BIT(feat.id);
+
+	return 0;
+}
+
+static int kvmi_ioctl_feature(struct kvm *kvm,
+			      bool allow, unsigned long *requested,
+			      size_t off_dest, unsigned int nbits)
+{
+	unsigned long *dest;
+	struct kvmi *ikvm;
+
+	if (bitmap_empty(requested, nbits))
+		return -EINVAL;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return -EFAULT;
+
+	dest = (unsigned long *)((char *)ikvm + off_dest);
+
+	if (allow)
+		bitmap_or(dest, dest, requested, nbits);
+	else
+		bitmap_andnot(dest, dest, requested, nbits);
+
+	kvmi_put(kvm);
+
+	return 0;
+}
+
+int kvmi_ioctl_event(struct kvm *kvm, void __user *argp)
+{
+	DECLARE_BITMAP(requested, KVMI_NUM_EVENTS);
+	DECLARE_BITMAP(known, KVMI_NUM_EVENTS);
+	bool allow;
+	int err;
+	int id;
+
+	err = kvmi_ioctl_get_feature(argp, &allow, &id, requested);
+	if (err)
+		return err;
+
+	bitmap_from_u64(known, KVMI_KNOWN_EVENTS);
+	bitmap_and(requested, requested, known, KVMI_NUM_EVENTS);
+
+	return kvmi_ioctl_feature(kvm, allow, requested,
+				  offsetof(struct kvmi, event_allow_mask),
+				  KVMI_NUM_EVENTS);
+}
+
+int kvmi_ioctl_command(struct kvm *kvm, void __user *argp)
+{
+	DECLARE_BITMAP(requested, KVMI_NUM_COMMANDS);
+	DECLARE_BITMAP(known, KVMI_NUM_COMMANDS);
+	bool allow;
+	int err;
+	int id;
+
+	err = kvmi_ioctl_get_feature(argp, &allow, &id, requested);
+	if (err)
+		return err;
+
+	bitmap_from_u64(known, KVMI_KNOWN_COMMANDS);
+	bitmap_and(requested, requested, known, KVMI_NUM_COMMANDS);
+
+	return kvmi_ioctl_feature(kvm, allow, requested,
+				  offsetof(struct kvmi, cmd_allow_mask),
+				  KVMI_NUM_COMMANDS);
+}
+
 void kvmi_create_vm(struct kvm *kvm)
 {
 	init_completion(&kvm->kvmi_completed);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 9bc5205c8714..bd8b539e917a 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -23,6 +23,54 @@
 #define kvmi_err(ikvm, fmt, ...) \
 	kvm_info("%pU ERROR: " fmt, &ikvm->uuid, ## __VA_ARGS__)
 
+#define KVMI_KNOWN_VCPU_EVENTS ( \
+		BIT(KVMI_EVENT_CR) | \
+		BIT(KVMI_EVENT_MSR) | \
+		BIT(KVMI_EVENT_XSETBV) | \
+		BIT(KVMI_EVENT_BREAKPOINT) | \
+		BIT(KVMI_EVENT_HYPERCALL) | \
+		BIT(KVMI_EVENT_PF) | \
+		BIT(KVMI_EVENT_TRAP) | \
+		BIT(KVMI_EVENT_DESCRIPTOR) | \
+		BIT(KVMI_EVENT_PAUSE_VCPU) | \
+		BIT(KVMI_EVENT_SINGLESTEP))
+
+#define KVMI_KNOWN_VM_EVENTS ( \
+		BIT(KVMI_EVENT_CREATE_VCPU) | \
+		BIT(KVMI_EVENT_UNHOOK))
+
+#define KVMI_KNOWN_EVENTS (KVMI_KNOWN_VCPU_EVENTS | KVMI_KNOWN_VM_EVENTS)
+
+#define KVMI_KNOWN_COMMANDS ( \
+		BIT(KVMI_GET_VERSION) | \
+		BIT(KVMI_CHECK_COMMAND) | \
+		BIT(KVMI_CHECK_EVENT) | \
+		BIT(KVMI_GET_GUEST_INFO) | \
+		BIT(KVMI_PAUSE_VCPU) | \
+		BIT(KVMI_CONTROL_VM_EVENTS) | \
+		BIT(KVMI_CONTROL_EVENTS) | \
+		BIT(KVMI_CONTROL_CR) | \
+		BIT(KVMI_CONTROL_MSR) | \
+		BIT(KVMI_CONTROL_VE) | \
+		BIT(KVMI_GET_REGISTERS) | \
+		BIT(KVMI_SET_REGISTERS) | \
+		BIT(KVMI_GET_CPUID) | \
+		BIT(KVMI_GET_XSAVE) | \
+		BIT(KVMI_READ_PHYSICAL) | \
+		BIT(KVMI_WRITE_PHYSICAL) | \
+		BIT(KVMI_INJECT_EXCEPTION) | \
+		BIT(KVMI_GET_PAGE_ACCESS) | \
+		BIT(KVMI_SET_PAGE_ACCESS) | \
+		BIT(KVMI_GET_MAP_TOKEN) | \
+		BIT(KVMI_CONTROL_SPP) | \
+		BIT(KVMI_GET_PAGE_WRITE_BITMAP) | \
+		BIT(KVMI_SET_PAGE_WRITE_BITMAP) | \
+		BIT(KVMI_GET_MTRR_TYPE) | \
+		BIT(KVMI_CONTROL_CMD_RESPONSE) | \
+		BIT(KVMI_GET_VCPU_INFO))
+
+#define KVMI_NUM_COMMANDS KVMI_NEXT_AVAILABLE_COMMAND
+
 #define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
 
 struct kvmi {
@@ -32,6 +80,9 @@ struct kvmi {
 	struct task_struct *recv;
 
 	uuid_t uuid;
+
+	DECLARE_BITMAP(cmd_allow_mask, KVMI_NUM_COMMANDS);
+	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
 };
 
 /* kvmi_msg.c */

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 04/92] kvm: introspection: add the read/dispatch message function
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (2 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 03/92] kvm: introspection: add permission access ioctls Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 05/92] kvm: introspection: add KVMI_GET_VERSION Adalbert Lazăr
                   ` (90 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

Based on the common header used by all messages (struct kvmi_msg_hdr),
the worker will read/validate all messages, execute the VM introspection
commands (eg. KVMI_GET_GUEST_INFO) and dispatch to vCPUs the vCPU
introspection commands (eg. KVMI_GET_REGISTERS) and the replies to
vCPU events. The vCPU threads will reply to vCPU introspection commands
without the help of the receiving worker.

Because of the command header (struct kvmi_error_code) used in any
command reply, this worker could respond to any unsupported/disallowed
command with an error code.

This thread will end when the socket is closed (signaled by userspace/QEMU
or the introspection tool) or on the first API error (eg. wrong message
size).

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  86 +++++++++++
 include/uapi/linux/kvmi.h          |  13 ++
 virt/kvm/kvmi.c                    |  43 +++++-
 virt/kvm/kvmi_int.h                |   7 +
 virt/kvm/kvmi_msg.c                | 240 ++++++++++++++++++++++++++++-
 5 files changed, 386 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 47b7c36d334a..1d4a1dcd7d2f 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -64,6 +64,85 @@ used on that guest. Obviously, whether the guest can really continue
 normal execution depends on whether the introspection tool has made any
 modifications that require an active KVMI channel.
 
+All messages (commands or events) have a common header::
+
+	struct kvmi_msg_hdr {
+		__u16 id;
+		__u16 size;
+		__u32 seq;
+	};
+
+The replies have the same header, with the sequence number (``seq``)
+and message id (``id``) matching the command/event.
+
+After ``kvmi_msg_hdr``, ``id`` specific data of ``size`` bytes will
+follow.
+
+The message header and its data must be sent with one ``sendmsg()`` call
+to the socket. This simplifies the receiver loop and avoids
+the reconstruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+A command reply begins with::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	}
+
+followed by the command specific data if the error code ``err`` is zero.
+
+The error code -KVM_EOPNOTSUPP is returned for unsupported commands.
+
+The error code -KVM_EPERM is returned for disallowed commands (see **Hooking**).
+
+The error code is related to the message processing, including unsupported
+commands. For all the other errors (incomplete messages, wrong sequence
+numbers, socket errors etc.) the socket will be closed. The device
+manager should reconnect.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               ----
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message at the same time::
+
+   Host kernel               Tool
+   -----------               ----
+   event X ->                <- command X
+
+the host kernel will reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures or to round
+the structure size to a multiple of 8 bytes (64bit) to improve the copy
+operations that happen during ``recvmsg()`` or ``sendmsg()``. The members
+should have the native alignment of the host (4 bytes on x86). All padding
+must be initialized with zero otherwise the respective commands will fail
+with -KVM_EINVAL.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures provide this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
 Handshake
 ---------
 
@@ -99,6 +178,13 @@ commands/events) to KVM, and forget about it. It will be notified by
 KVM when the introspection tool closes the file handle (in case of
 errors), and should reinitiate the handshake.
 
+Once the file handle reaches KVM, the introspection tool should use
+the *KVMI_GET_VERSION* command to get the API version and/or
+the *KVMI_CHECK_COMMAND* and *KVMI_CHECK_EVENTS* commands to see which
+commands/events are allowed for this guest. The error code -KVM_EPERM
+will be returned if the introspection tool uses a command or enables an
+event which is disallowed.
+
 Unhooking
 ---------
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index dbf63ad0862f..6c7600ed4564 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -65,4 +65,17 @@ enum {
 	KVMI_NUM_EVENTS
 };
 
+#define KVMI_MSG_SIZE (4096 - sizeof(struct kvmi_msg_hdr))
+
+struct kvmi_msg_hdr {
+	__u16 id;
+	__u16 size;
+	__u32 seq;
+};
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index dc64f975998f..afa31748d7f4 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -10,13 +10,54 @@
 #include <linux/kthread.h>
 #include <linux/bitmap.h>
 
-int kvmi_init(void)
+static struct kmem_cache *msg_cache;
+
+void *kvmi_msg_alloc(void)
+{
+	return kmem_cache_zalloc(msg_cache, GFP_KERNEL);
+}
+
+void *kvmi_msg_alloc_check(size_t size)
+{
+	if (size > KVMI_MSG_SIZE_ALLOC)
+		return NULL;
+	return kvmi_msg_alloc();
+}
+
+void kvmi_msg_free(void *addr)
+{
+	if (addr)
+		kmem_cache_free(msg_cache, addr);
+}
+
+static void kvmi_cache_destroy(void)
 {
+	kmem_cache_destroy(msg_cache);
+	msg_cache = NULL;
+}
+
+static int kvmi_cache_create(void)
+{
+	msg_cache = kmem_cache_create("kvmi_msg", KVMI_MSG_SIZE_ALLOC,
+				      4096, SLAB_ACCOUNT, NULL);
+
+	if (!msg_cache) {
+		kvmi_cache_destroy();
+
+		return -1;
+	}
+
 	return 0;
 }
 
+int kvmi_init(void)
+{
+	return kvmi_cache_create();
+}
+
 void kvmi_uninit(void)
 {
+	kvmi_cache_destroy();
 }
 
 static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index bd8b539e917a..76119a4b69d8 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -23,6 +23,8 @@
 #define kvmi_err(ikvm, fmt, ...) \
 	kvm_info("%pU ERROR: " fmt, &ikvm->uuid, ## __VA_ARGS__)
 
+#define KVMI_MSG_SIZE_ALLOC (sizeof(struct kvmi_msg_hdr) + KVMI_MSG_SIZE)
+
 #define KVMI_KNOWN_VCPU_EVENTS ( \
 		BIT(KVMI_EVENT_CR) | \
 		BIT(KVMI_EVENT_MSR) | \
@@ -91,4 +93,9 @@ void kvmi_sock_shutdown(struct kvmi *ikvm);
 void kvmi_sock_put(struct kvmi *ikvm);
 bool kvmi_msg_process(struct kvmi *ikvm);
 
+/* kvmi.c */
+void *kvmi_msg_alloc(void);
+void *kvmi_msg_alloc_check(size_t size);
+void kvmi_msg_free(void *addr);
+
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 4de012eafb6d..af6bc47dc031 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -8,6 +8,19 @@
 #include <linux/net.h>
 #include "kvmi_int.h"
 
+static const char *const msg_IDs[] = {
+};
+
+static bool is_known_message(u16 id)
+{
+	return id < ARRAY_SIZE(msg_IDs) && msg_IDs[id];
+}
+
+static const char *id2str(u16 id)
+{
+	return is_known_message(id) ? msg_IDs[id] : "unknown";
+}
+
 bool kvmi_sock_get(struct kvmi *ikvm, int fd)
 {
 	struct socket *sock;
@@ -35,8 +48,231 @@ void kvmi_sock_shutdown(struct kvmi *ikvm)
 	kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
 }
 
+static int kvmi_sock_read(struct kvmi *ikvm, void *buf, size_t size)
+{
+	struct kvec i = {
+		.iov_base = buf,
+		.iov_len = size,
+	};
+	struct msghdr m = { };
+	int rc;
+
+	rc = kernel_recvmsg(ikvm->sock, &m, &i, 1, size, MSG_WAITALL);
+
+	if (rc > 0)
+		print_hex_dump_debug("read: ", DUMP_PREFIX_NONE, 32, 1,
+					buf, rc, false);
+
+	if (unlikely(rc != size)) {
+		if (rc >= 0)
+			rc = -EPIPE;
+		else
+			kvmi_err(ikvm, "kernel_recvmsg: %d\n", rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+static int kvmi_sock_write(struct kvmi *ikvm, struct kvec *i, size_t n,
+			   size_t size)
+{
+	struct msghdr m = { };
+	int rc, k;
+
+	rc = kernel_sendmsg(ikvm->sock, &m, i, n, size);
+
+	if (rc > 0)
+		for (k = 0; k < n; k++)
+			print_hex_dump_debug("write: ", DUMP_PREFIX_NONE, 32, 1,
+					i[k].iov_base, i[k].iov_len, false);
+
+	if (unlikely(rc != size)) {
+		kvmi_err(ikvm, "kernel_sendmsg: %d\n", rc);
+		if (rc >= 0)
+			rc = -EPIPE;
+		return rc;
+	}
+
+	return 0;
+}
+
+static int kvmi_msg_reply(struct kvmi *ikvm,
+			  const struct kvmi_msg_hdr *msg, int err,
+			  const void *rpl, size_t rpl_size)
+{
+	struct kvmi_error_code ec;
+	struct kvmi_msg_hdr h;
+	struct kvec vec[3] = {
+		{ .iov_base = &h, .iov_len = sizeof(h) },
+		{ .iov_base = &ec, .iov_len = sizeof(ec) },
+		{ .iov_base = (void *)rpl, .iov_len = rpl_size },
+	};
+	size_t size = sizeof(h) + sizeof(ec) + (err ? 0 : rpl_size);
+	size_t n = err ? ARRAY_SIZE(vec) - 1 : ARRAY_SIZE(vec);
+
+	memset(&h, 0, sizeof(h));
+	h.id = msg->id;
+	h.seq = msg->seq;
+	h.size = size - sizeof(h);
+
+	memset(&ec, 0, sizeof(ec));
+	ec.err = err;
+
+	return kvmi_sock_write(ikvm, vec, n, size);
+}
+
+static int kvmi_msg_vm_reply(struct kvmi *ikvm,
+			     const struct kvmi_msg_hdr *msg, int err,
+			     const void *rpl, size_t rpl_size)
+{
+	return kvmi_msg_reply(ikvm, msg, err, rpl, rpl_size);
+}
+
+static bool is_command_allowed(struct kvmi *ikvm, int id)
+{
+	return test_bit(id, ikvm->cmd_allow_mask);
+}
+
+/*
+ * These commands are executed on the receiving thread/worker.
+ */
+static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
+			    const void *) = {
+};
+
+static bool is_vm_message(u16 id)
+{
+	return id < ARRAY_SIZE(msg_vm) && !!msg_vm[id];
+}
+
+static bool is_unsupported_message(u16 id)
+{
+	bool supported;
+
+	supported = is_known_message(id) && is_vm_message(id);
+
+	return !supported;
+}
+
+static int kvmi_consume_bytes(struct kvmi *ikvm, size_t bytes)
+{
+	size_t to_read;
+	u8 buf[1024];
+	int err = 0;
+
+	while (bytes && !err) {
+		to_read = min(bytes, sizeof(buf));
+
+		err = kvmi_sock_read(ikvm, buf, to_read);
+
+		bytes -= to_read;
+	}
+
+	return err;
+}
+
+static struct kvmi_msg_hdr *kvmi_msg_recv(struct kvmi *ikvm, bool *unsupported)
+{
+	struct kvmi_msg_hdr *msg;
+	int err;
+
+	*unsupported = false;
+
+	msg = kvmi_msg_alloc();
+	if (!msg)
+		goto out_err;
+
+	err = kvmi_sock_read(ikvm, msg, sizeof(*msg));
+	if (err)
+		goto out_err;
+
+	if (msg->size > KVMI_MSG_SIZE)
+		goto out_err_msg;
+
+	if (is_unsupported_message(msg->id)) {
+		if (msg->size && kvmi_consume_bytes(ikvm, msg->size) < 0)
+			goto out_err_msg;
+
+		*unsupported = true;
+		return msg;
+	}
+
+	if (msg->size && kvmi_sock_read(ikvm, msg + 1, msg->size) < 0)
+		goto out_err_msg;
+
+	return msg;
+
+out_err_msg:
+	kvmi_err(ikvm, "%s id %u (%s) size %u\n",
+		 __func__, msg->id, id2str(msg->id), msg->size);
+
+out_err:
+	kvmi_msg_free(msg);
+
+	return NULL;
+}
+
+static int kvmi_msg_dispatch_vm_cmd(struct kvmi *ikvm,
+				    const struct kvmi_msg_hdr *msg)
+{
+	return msg_vm[msg->id](ikvm, msg, msg + 1);
+}
+
+static int kvmi_msg_dispatch(struct kvmi *ikvm,
+			     struct kvmi_msg_hdr *msg, bool *queued)
+{
+	int err;
+
+	err = kvmi_msg_dispatch_vm_cmd(ikvm, msg);
+
+	if (err)
+		kvmi_err(ikvm, "%s: msg id: %u (%s), err: %d\n", __func__,
+			 msg->id, id2str(msg->id), err);
+
+	return err;
+}
+
+static bool is_message_allowed(struct kvmi *ikvm, __u16 id)
+{
+	if (id == KVMI_EVENT_REPLY)
+		return true;
+
+	/*
+	 * Some commands (eg.pause) request events that might be
+	 * disallowed. The command is allowed here, but the function
+	 * handling the command will return -KVM_EPERM if the event
+	 * is disallowed.
+	 */
+	return is_command_allowed(ikvm, id);
+}
+
 bool kvmi_msg_process(struct kvmi *ikvm)
 {
-	kvmi_info(ikvm, "TODO: %s", __func__);
-	return false;
+	struct kvmi_msg_hdr *msg;
+	bool queued = false;
+	bool unsupported;
+	int err = -1;
+
+	msg = kvmi_msg_recv(ikvm, &unsupported);
+	if (!msg)
+		goto out;
+
+	if (unsupported) {
+		err = kvmi_msg_vm_reply(ikvm, msg, -KVM_EOPNOTSUPP, NULL, 0);
+		goto out;
+	}
+
+	if (!is_message_allowed(ikvm, msg->id)) {
+		err = kvmi_msg_vm_reply(ikvm, msg, -KVM_EPERM, NULL, 0);
+		goto out;
+	}
+
+	err = kvmi_msg_dispatch(ikvm, msg, &queued);
+
+out:
+	if (!queued)
+		kvmi_msg_free(msg);
+
+	return err == 0;
 }

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 05/92] kvm: introspection: add KVMI_GET_VERSION
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (3 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 04/92] kvm: introspection: add the read/dispatch message function Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE Adalbert Lazăr
                   ` (89 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This command should be used by the introspection tool to identify the
commands/events supported by the KVMi subsystem and, most important,
what messages must be used for event replies. The kernel side will accept
smaller or bigger command messages, but it can be more strict with bigger
event reply messages.

The command is always allowed and any attempt from userspace to disallow it
through KVM_INTROSPECTION_COMMAND will get -EPERM (unless userspace choose
to disable all commands, using id=-1, in which case KVMI_GET_VERSION is
quietly allowed, without an error).

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 28 ++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          |  5 +++++
 virt/kvm/kvmi.c                    | 14 ++++++++++++++
 virt/kvm/kvmi_msg.c                | 13 +++++++++++++
 4 files changed, 60 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 1d4a1dcd7d2f..0f296e3c4244 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -224,3 +224,31 @@ device-specific memory (DMA, emulated MMIO, reserved by a passthrough
 device etc.). It is up to the user to determine, using the guest operating
 system data structures, the areas that are safe to access (code, stack, heap
 etc.).
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+1. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_version_reply {
+		__u32 version;
+		__u32 padding;
+	};
+
+Returns the introspection API version.
+
+This command is always allowed and successful (if the introspection is
+built in kernel).
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 6c7600ed4564..9574ba0b9565 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -78,4 +78,9 @@ struct kvmi_error_code {
 	__u32 padding;
 };
 
+struct kvmi_get_version_reply {
+	__u32 version;
+	__u32 padding;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index afa31748d7f4..d5b6af21564e 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -68,6 +68,8 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 	if (!ikvm)
 		return false;
 
+	set_bit(KVMI_GET_VERSION, ikvm->cmd_allow_mask);
+
 	memcpy(&ikvm->uuid, &qemu->uuid, sizeof(ikvm->uuid));
 
 	ikvm->kvm = kvm;
@@ -290,6 +292,18 @@ int kvmi_ioctl_command(struct kvm *kvm, void __user *argp)
 	bitmap_from_u64(known, KVMI_KNOWN_COMMANDS);
 	bitmap_and(requested, requested, known, KVMI_NUM_COMMANDS);
 
+	if (!allow) {
+		DECLARE_BITMAP(always_allowed, KVMI_NUM_COMMANDS);
+
+		if (id == KVMI_GET_VERSION)
+			return -EPERM;
+
+		set_bit(KVMI_GET_VERSION, always_allowed);
+
+		bitmap_andnot(requested, requested, always_allowed,
+			      KVMI_NUM_COMMANDS);
+	}
+
 	return kvmi_ioctl_feature(kvm, allow, requested,
 				  offsetof(struct kvmi, cmd_allow_mask),
 				  KVMI_NUM_COMMANDS);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index af6bc47dc031..6fe04de29f7e 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -9,6 +9,7 @@
 #include "kvmi_int.h"
 
 static const char *const msg_IDs[] = {
+	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
 
 static bool is_known_message(u16 id)
@@ -129,6 +130,17 @@ static int kvmi_msg_vm_reply(struct kvmi *ikvm,
 	return kvmi_msg_reply(ikvm, msg, err, rpl, rpl_size);
 }
 
+static int handle_get_version(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_version_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	rpl.version = KVMI_VERSION;
+
+	return kvmi_msg_vm_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
+}
+
 static bool is_command_allowed(struct kvmi *ikvm, int id)
 {
 	return test_bit(id, ikvm->cmd_allow_mask);
@@ -139,6 +151,7 @@ static bool is_command_allowed(struct kvmi *ikvm, int id)
  */
 static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 			    const void *) = {
+	[KVMI_GET_VERSION]           = handle_get_version,
 };
 
 static bool is_vm_message(u16 id)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (4 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 05/92] kvm: introspection: add KVMI_GET_VERSION Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  9:15   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command Adalbert Lazăr
                   ` (88 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This command enables/disables the command replies. It is useful when
the introspection tool send multiple messages with one write() call and
doesn't have to wait for a reply.

IIRC, the speed improvment seen during UnixBench tests in a VM
introspected through vsock (the introspection tool was running in a
different VM) was around 5-10%.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 50 ++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          |  7 ++++
 virt/kvm/kvmi_int.h                |  2 ++
 virt/kvm/kvmi_msg.c                | 57 ++++++++++++++++++++++++++++++
 4 files changed, 116 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 0f296e3c4244..82de474d512b 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -252,3 +252,53 @@ Returns the introspection API version.
 
 This command is always allowed and successful (if the introspection is
 built in kernel).
+
+2. KVMI_CONTROL_CMD_RESPONSE
+----------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_cmd_response {
+		__u8 enable;
+		__u8 now;
+		__u16 padding1;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+	struct kvmi_error_code
+
+Enables or disables the command replies. By default, all commands need
+a reply.
+
+If `now` is 1, the command reply is enabled/disabled (according to
+`enable`) starting with the current command. For example, `enable=0`
+and `now=1` means that the reply is disabled for this command too,
+while `enable=0` and `now=0` means that a reply will be send for this
+command, but not for the next ones (until enabled back with another
+*KVMI_CONTROL_CMD_RESPONSE*).
+
+This command is used by the introspection tool to disable the replies
+for commands returning an error code only (eg. *KVMI_SET_REGISTERS*)
+when an error is less likely to happen. For example, the following
+commands can be used to reply to an event with a single `write()` call:
+
+	KVMI_CONTROL_CMD_RESPONSE enable=0 now=1
+	KVMI_SET_REGISTERS vcpu=N
+	KVMI_EVENT_REPLY   vcpu=N
+	KVMI_CONTROL_CMD_RESPONSE enable=1 now=0
+
+While the command reply is disabled:
+
+* the socket will be closed on any command for which the reply should
+  contain more than just an error code (eg. *KVMI_GET_REGISTERS*)
+
+* the reply status is ignored for any unsupported/unknown or disallowed
+  commands (and ``struct kvmi_error_code`` will be sent with -KVM_EOPNOTSUPP
+  or -KVM_PERM).
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 9574ba0b9565..a1ab39c5b8e0 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -83,4 +83,11 @@ struct kvmi_get_version_reply {
 	__u32 padding;
 };
 
+struct kvmi_control_cmd_response {
+	__u8 enable;
+	__u8 now;
+	__u16 padding1;
+	__u32 padding2;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 76119a4b69d8..157f765fb34d 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -85,6 +85,8 @@ struct kvmi {
 
 	DECLARE_BITMAP(cmd_allow_mask, KVMI_NUM_COMMANDS);
 	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
+
+	bool cmd_reply_disabled;
 };
 
 /* kvmi_msg.c */
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 6fe04de29f7e..ea5c7e23669a 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -9,6 +9,7 @@
 #include "kvmi_int.h"
 
 static const char *const msg_IDs[] = {
+	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
 
@@ -130,6 +131,36 @@ static int kvmi_msg_vm_reply(struct kvmi *ikvm,
 	return kvmi_msg_reply(ikvm, msg, err, rpl, rpl_size);
 }
 
+static bool kvmi_validate_no_reply(struct kvmi *ikvm,
+				   const struct kvmi_msg_hdr *msg,
+				   size_t rpl_size, int err)
+{
+	if (rpl_size) {
+		kvmi_err(ikvm, "Reply disabled for command %d", msg->id);
+		return false;
+	}
+
+	if (err)
+		kvmi_warn(ikvm, "Error code %d discarded for message id %d\n",
+			  err, msg->id);
+
+	return true;
+}
+
+static int kvmi_msg_vm_maybe_reply(struct kvmi *ikvm,
+				   const struct kvmi_msg_hdr *msg,
+				   int err, const void *rpl,
+				   size_t rpl_size)
+{
+	if (ikvm->cmd_reply_disabled) {
+		if (!kvmi_validate_no_reply(ikvm, msg, rpl_size, err))
+			return -KVM_EINVAL;
+		return 0;
+	}
+
+	return kvmi_msg_vm_reply(ikvm, msg, err, rpl, rpl_size);
+}
+
 static int handle_get_version(struct kvmi *ikvm,
 			      const struct kvmi_msg_hdr *msg, const void *req)
 {
@@ -146,11 +177,37 @@ static bool is_command_allowed(struct kvmi *ikvm, int id)
 	return test_bit(id, ikvm->cmd_allow_mask);
 }
 
+static int handle_control_cmd_response(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const void *_req)
+{
+	const struct kvmi_control_cmd_response *req = _req;
+	bool disabled, now;
+	int err;
+
+	if (req->padding1 || req->padding2)
+		return -KVM_EINVAL;
+
+	disabled = !req->enable;
+	now = (req->now == 1);
+
+	if (now)
+		ikvm->cmd_reply_disabled = disabled;
+
+	err = kvmi_msg_vm_maybe_reply(ikvm, msg, 0, NULL, 0);
+
+	if (!now)
+		ikvm->cmd_reply_disabled = disabled;
+
+	return err;
+}
+
 /*
  * These commands are executed on the receiving thread/worker.
  */
 static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 			    const void *) = {
+	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (5 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  9:16   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 08/92] kvm: introspection: add KVMI_CHECK_COMMAND and KVMI_CHECK_EVENT Adalbert Lazăr
                   ` (87 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

Obviously, the KVMI_GET_VERSION command must not be used when the command
reply is disabled by a previous KVMI_CONTROL_CMD_RESPONSE command.

This commit changes the code path in order to check the reply option
(enabled/disabled) before trying to reply to this command. If the command
reply is disabled it will return an error to the caller. In the end, the
receiving worker will finish and the introspection socket will be closed.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 virt/kvm/kvmi_msg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index ea5c7e23669a..2237a6ed25f6 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -169,7 +169,7 @@ static int handle_get_version(struct kvmi *ikvm,
 	memset(&rpl, 0, sizeof(rpl));
 	rpl.version = KVMI_VERSION;
 
-	return kvmi_msg_vm_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
 }
 
 static bool is_command_allowed(struct kvmi *ikvm, int id)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 08/92] kvm: introspection: add KVMI_CHECK_COMMAND and KVMI_CHECK_EVENT
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (6 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 09/92] kvm: introspection: add KVMI_GET_GUEST_INFO Adalbert Lazăr
                   ` (86 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

These commands can be used by the introspection tool to check what
introspection commands and events are supported (by KVMi) and allowed
(by userspace/QEMU).

The introspection tool will get one of the following error codes:
  * -KVM_EOPNOTSUPP (unsupported command/event)
  * -KVM_PERM (disallowed command/event)
  * -KVM_EINVAL (the padding space, used for future extensions,
                 is not zero)
  * 0 (the command/event is supported and allowed)

These commands can be seen as an alternative method to KVMI_GET_VERSION
in checking if the introspection supports a specific command/event.

As with the KVMI_GET_VERSION command, these commands can never be
disallowed by userspace/QEMU.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 60 ++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 12 ++++++
 virt/kvm/kvmi.c                    |  8 +++-
 virt/kvm/kvmi_msg.c                | 38 +++++++++++++++++++
 4 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 82de474d512b..61cf69aa5d07 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -302,3 +302,63 @@ While the command reply is disabled:
 * the reply status is ignored for any unsupported/unknown or disallowed
   commands (and ``struct kvmi_error_code`` will be sent with -KVM_EOPNOTSUPP
   or -KVM_PERM).
+
+3. KVMI_CHECK_COMMAND
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_check_command {
+		__u16 id;
+		__u16 padding1;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+
+Checks if the command specified by ``id`` is allowed.
+
+This command is always allowed.
+
+:Errors:
+
+* -KVM_PERM - the command specified by ``id`` is disallowed
+* -KVM_EINVAL - padding is not zero
+
+4. KVMI_CHECK_EVENT
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_check_event {
+		__u16 id;
+		__u16 padding1;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+
+Checks if the event specified by ``id`` is allowed.
+
+This command is always allowed.
+
+:Errors:
+
+* -KVM_PERM - the event specified by ``id`` is disallowed
+* -KVM_EINVAL - padding is not zero
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index a1ab39c5b8e0..7390303371c9 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -90,4 +90,16 @@ struct kvmi_control_cmd_response {
 	__u32 padding2;
 };
 
+struct kvmi_check_command {
+	__u16 id;
+	__u16 padding1;
+	__u32 padding2;
+};
+
+struct kvmi_check_event {
+	__u16 id;
+	__u16 padding1;
+	__u32 padding2;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index d5b6af21564e..dc1bb8326763 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -69,6 +69,8 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 		return false;
 
 	set_bit(KVMI_GET_VERSION, ikvm->cmd_allow_mask);
+	set_bit(KVMI_CHECK_COMMAND, ikvm->cmd_allow_mask);
+	set_bit(KVMI_CHECK_EVENT, ikvm->cmd_allow_mask);
 
 	memcpy(&ikvm->uuid, &qemu->uuid, sizeof(ikvm->uuid));
 
@@ -295,10 +297,14 @@ int kvmi_ioctl_command(struct kvm *kvm, void __user *argp)
 	if (!allow) {
 		DECLARE_BITMAP(always_allowed, KVMI_NUM_COMMANDS);
 
-		if (id == KVMI_GET_VERSION)
+		if (id == KVMI_GET_VERSION
+				|| id == KVMI_CHECK_COMMAND
+				|| id == KVMI_CHECK_EVENT)
 			return -EPERM;
 
 		set_bit(KVMI_GET_VERSION, always_allowed);
+		set_bit(KVMI_CHECK_COMMAND, always_allowed);
+		set_bit(KVMI_CHECK_EVENT, always_allowed);
 
 		bitmap_andnot(requested, requested, always_allowed,
 			      KVMI_NUM_COMMANDS);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 2237a6ed25f6..e24996611e3a 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -9,6 +9,8 @@
 #include "kvmi_int.h"
 
 static const char *const msg_IDs[] = {
+	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
+	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
@@ -177,6 +179,40 @@ static bool is_command_allowed(struct kvmi *ikvm, int id)
 	return test_bit(id, ikvm->cmd_allow_mask);
 }
 
+static int handle_check_command(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_check_command *req = _req;
+	int ec = 0;
+
+	if (req->padding1 || req->padding2)
+		ec = -KVM_EINVAL;
+	else if (!is_command_allowed(ikvm, req->id))
+		ec = -KVM_EPERM;
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
+static bool is_event_allowed(struct kvmi *ikvm, int id)
+{
+	return test_bit(id, ikvm->event_allow_mask);
+}
+
+static int handle_check_event(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *_req)
+{
+	const struct kvmi_check_event *req = _req;
+	int ec = 0;
+
+	if (req->padding1 || req->padding2)
+		ec = -KVM_EINVAL;
+	else if (!is_event_allowed(ikvm, req->id))
+		ec = -KVM_EPERM;
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static int handle_control_cmd_response(struct kvmi *ikvm,
 					const struct kvmi_msg_hdr *msg,
 					const void *_req)
@@ -207,6 +243,8 @@ static int handle_control_cmd_response(struct kvmi *ikvm,
  */
 static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 			    const void *) = {
+	[KVMI_CHECK_COMMAND]         = handle_check_command,
+	[KVMI_CHECK_EVENT]           = handle_check_event,
 	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 09/92] kvm: introspection: add KVMI_GET_GUEST_INFO
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (7 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 08/92] kvm: introspection: add KVMI_CHECK_COMMAND and KVMI_CHECK_EVENT Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 10/92] kvm: introspection: add KVMI_CONTROL_VM_EVENTS Adalbert Lazăr
                   ` (85 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

For now, this command returns only the number of online vCPUs.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 18 ++++++++++++++++++
 include/uapi/linux/kvmi.h          |  5 +++++
 virt/kvm/kvmi_msg.c                | 14 ++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 61cf69aa5d07..2fbe7c28e4f1 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -362,3 +362,21 @@ This command is always allowed.
 
 * -KVM_PERM - the event specified by ``id`` is disallowed
 * -KVM_EINVAL - padding is not zero
+
+5. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_guest_info_reply {
+		__u32 vcpu_count;
+		__u32 padding[3];
+	};
+
+Returns the number of online vCPUs.
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 7390303371c9..367c8ec28f75 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -102,4 +102,9 @@ struct kvmi_check_event {
 	__u32 padding2;
 };
 
+struct kvmi_get_guest_info_reply {
+	__u32 vcpu_count;
+	__u32 padding[3];
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index e24996611e3a..cf8a120b0eae 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -12,6 +12,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
+	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
 
@@ -213,6 +214,18 @@ static int handle_check_event(struct kvmi *ikvm,
 	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
 }
 
+static int handle_get_guest_info(struct kvmi *ikvm,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *req)
+{
+	struct kvmi_get_guest_info_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	rpl.vcpu_count = atomic_read(&ikvm->kvm->online_vcpus);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
+}
+
 static int handle_control_cmd_response(struct kvmi *ikvm,
 					const struct kvmi_msg_hdr *msg,
 					const void *_req)
@@ -246,6 +259,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CHECK_COMMAND]         = handle_check_command,
 	[KVMI_CHECK_EVENT]           = handle_check_event,
 	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
+	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 10/92] kvm: introspection: add KVMI_CONTROL_VM_EVENTS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (8 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 09/92] kvm: introspection: add KVMI_GET_GUEST_INFO Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 11/92] kvm: introspection: add vCPU related data Adalbert Lazăr
                   ` (84 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

No introspection event (neither VM event, nor vCPU event) will be sent
to the introspection tool unless enabled/requested.

This command enables/disables VM events. For now, these events are:

  * KVMI_EVENT_UNHOOK
  * KVMI_EVENT_CREATE_VCPU

The first event is initiated by userspace/QEMU in order to give the
introspection tool a chance to remove its hooks in the event of
pause/suspend/migrate.

The second event is actually a vCPU event, added to cover the case when
the introspection tool has paused all vCPUs and userspace hotplugs (and
starts) another one. The event is controlled by this command because its
status (enabled/disabled) is kept in the VM related structures (as opposed
to vCPU related structures). I didn't had a better idea. Not to mention
that, the vCPU events are controlled with commands like "enable/disable
event X for vCPU Y" and Y is _unknown_ for X=KVMI_EVENT_CREATE_VCPU.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 39 ++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          |  7 ++++++
 virt/kvm/kvmi.c                    | 11 +++++++++
 virt/kvm/kvmi_int.h                |  3 +++
 virt/kvm/kvmi_msg.c                | 23 ++++++++++++++++++
 5 files changed, 83 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 2fbe7c28e4f1..a660def20b23 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -380,3 +380,42 @@ This command is always allowed.
 	};
 
 Returns the number of online vCPUs.
+
+6. KVMI_CONTROL_VM_EVENTS
+-------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_vm_events {
+		__u16 event_id;
+		__u8 enable;
+		__u8 padding1;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables VM introspection events. This command can be used with
+the following events::
+
+	KVMI_EVENT_CREATE_VCPU
+	KVMI_EVENT_UNHOOK
+
+When an event is enabled, the introspection tool is notified and,
+in almost all cases, it must reply with: continue, retry, crash, etc.
+(see **Events** below).
+
+:Errors:
+
+* -KVM_EINVAL - the event ID is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EPERM - the access is restricted by the host
+
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 367c8ec28f75..ff35faabb7ed 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -107,4 +107,11 @@ struct kvmi_get_guest_info_reply {
 	__u32 padding[3];
 };
 
+struct kvmi_control_vm_events {
+	__u16 event_id;
+	__u8 enable;
+	__u8 padding1;
+	__u32 padding2;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index dc1bb8326763..961e6cc13fb6 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -338,6 +338,17 @@ void kvmi_destroy_vm(struct kvm *kvm)
 	wait_for_completion_killable(&kvm->kvmi_completed);
 }
 
+int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
+			       bool enable)
+{
+	if (enable)
+		set_bit(event_id, ikvm->vm_ev_mask);
+	else
+		clear_bit(event_id, ikvm->vm_ev_mask);
+
+	return 0;
+}
+
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
 {
 	struct kvmi *ikvm;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 157f765fb34d..84ba43bd9a9d 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -85,6 +85,7 @@ struct kvmi {
 
 	DECLARE_BITMAP(cmd_allow_mask, KVMI_NUM_COMMANDS);
 	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
+	DECLARE_BITMAP(vm_ev_mask, KVMI_NUM_EVENTS);
 
 	bool cmd_reply_disabled;
 };
@@ -99,5 +100,7 @@ bool kvmi_msg_process(struct kvmi *ikvm);
 void *kvmi_msg_alloc(void);
 void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
+int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
+			       bool enable);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index cf8a120b0eae..a55c9e35be36 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -12,6 +12,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
+	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
@@ -226,6 +227,27 @@ static int handle_get_guest_info(struct kvmi *ikvm,
 	return kvmi_msg_vm_maybe_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
 }
 
+static int handle_control_vm_events(struct kvmi *ikvm,
+				    const struct kvmi_msg_hdr *msg,
+				    const void *_req)
+{
+	const unsigned long known_events = KVMI_KNOWN_VM_EVENTS;
+	const struct kvmi_control_vm_events *req = _req;
+	int ec;
+
+	if (req->padding1 || req->padding2)
+		ec = -KVM_EINVAL;
+	else if (!test_bit(req->event_id, &known_events))
+		ec = -KVM_EINVAL;
+	else if (!is_event_allowed(ikvm, req->event_id))
+		ec = -KVM_EPERM;
+	else
+		ec = kvmi_cmd_control_vm_events(ikvm, req->event_id,
+						req->enable);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static int handle_control_cmd_response(struct kvmi *ikvm,
 					const struct kvmi_msg_hdr *msg,
 					const void *_req)
@@ -259,6 +281,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CHECK_COMMAND]         = handle_check_command,
 	[KVMI_CHECK_EVENT]           = handle_check_event,
 	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
+	[KVMI_CONTROL_VM_EVENTS]     = handle_control_vm_events,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 11/92] kvm: introspection: add vCPU related data
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (9 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 10/92] kvm: introspection: add KVMI_CONTROL_VM_EVENTS Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 12/92] kvm: introspection: add a jobs list to every introspected vCPU Adalbert Lazăr
                   ` (83 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

An opaque pointer is added to struct kvm_vcpu, pointing to its
coresponding introspection structure, allocated (a) when the introspection
socket is connected or (b) when the vCPU is hotpluged and deallocated
when the introspection socket is disconnected.

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/kvm_host.h |  1 +
 include/linux/kvmi.h     |  4 +++
 virt/kvm/kvm_main.c      |  8 +++++
 virt/kvm/kvmi.c          | 73 +++++++++++++++++++++++++++++++++++++++-
 virt/kvm/kvmi_int.h      |  5 +++
 5 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 582b0187f5a4..1ec04384fad3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -275,6 +275,7 @@ struct kvm_vcpu {
 	bool preempted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	void *kvmi;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 4ca9280e4419..e8d25d7da751 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -14,6 +14,8 @@ int kvmi_ioctl_hook(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_command(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_event(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
+int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
 
 #else
 
@@ -21,6 +23,8 @@ static inline int kvmi_init(void) { return 0; }
 static inline void kvmi_uninit(void) { }
 static inline void kvmi_create_vm(struct kvm *kvm) { }
 static inline void kvmi_destroy_vm(struct kvm *kvm) { }
+static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
+static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 
 #endif /* CONFIG_KVM_INTROSPECTION */
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8399b826f2d2..94f15f393e37 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -316,6 +316,13 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
+
+	r = kvmi_vcpu_init(vcpu);
+	if (r < 0) {
+		kvm_arch_vcpu_uninit(vcpu);
+		goto fail_free_run;
+	}
+
 	return 0;
 
 fail_free_run:
@@ -333,6 +340,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	 * descriptors are already gone.
 	 */
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
+	kvmi_vcpu_uninit(vcpu);
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
 }
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 961e6cc13fb6..860574039221 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -80,6 +80,19 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 	return true;
 }
 
+static bool alloc_ivcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu;
+
+	ivcpu = kzalloc(sizeof(*ivcpu), GFP_KERNEL);
+	if (!ivcpu)
+		return false;
+
+	vcpu->kvmi = ivcpu;
+
+	return true;
+}
+
 struct kvmi * __must_check kvmi_get(struct kvm *kvm)
 {
 	if (refcount_inc_not_zero(&kvm->kvmi_ref))
@@ -90,8 +103,16 @@ struct kvmi * __must_check kvmi_get(struct kvm *kvm)
 
 static void kvmi_destroy(struct kvm *kvm)
 {
+	struct kvm_vcpu *vcpu;
+	int i;
+
 	kfree(kvm->kvmi);
 	kvm->kvmi = NULL;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kfree(vcpu->kvmi);
+		vcpu->kvmi = NULL;
+	}
 }
 
 static void kvmi_release(struct kvm *kvm)
@@ -109,6 +130,48 @@ void kvmi_put(struct kvm *kvm)
 		kvmi_release(kvm);
 }
 
+/*
+ * VCPU hotplug - this function will likely be called before VCPU will start
+ * executing code
+ */
+int kvmi_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	int ret = 0;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return 0;
+
+	if (!alloc_ivcpu(vcpu)) {
+		kvmi_err(ikvm, "Unable to alloc ivcpu for vcpu_id %u\n",
+			 vcpu->vcpu_id);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+out:
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
+/*
+ * VCPU hotplug - this function will likely be called after VCPU will stop
+ * executing code
+ */
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Under certain circumstances (errors in creating the VCPU, hotplug?)
+	 * this function may be reached with the KVMI member still allocated.
+	 * This VCPU won't be reachable by the introspection engine, so no
+	 * protection is necessary when de-allocating.
+	 */
+	kfree(vcpu->kvmi);
+	vcpu->kvmi = NULL;
+}
+
 static void kvmi_end_introspection(struct kvmi *ikvm)
 {
 	struct kvm *kvm = ikvm->kvm;
@@ -142,8 +205,9 @@ static int kvmi_recv(void *arg)
 
 int kvmi_hook(struct kvm *kvm, const struct kvm_introspection *qemu)
 {
+	struct kvm_vcpu *vcpu;
 	struct kvmi *ikvm;
-	int err = 0;
+	int i, err = 0;
 
 	/* wait for the previous introspection to finish */
 	err = wait_for_completion_killable(&kvm->kvmi_completed);
@@ -159,6 +223,13 @@ int kvmi_hook(struct kvm *kvm, const struct kvm_introspection *qemu)
 	}
 	ikvm = IKVM(kvm);
 
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!alloc_ivcpu(vcpu)) {
+			err = -ENOMEM;
+			goto err_alloc;
+		}
+	}
+
 	/* interact with other kernel components after structure allocation */
 	if (!kvmi_sock_get(ikvm, qemu->fd)) {
 		err = -EINVAL;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 84ba43bd9a9d..8739a3435893 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -23,6 +23,8 @@
 #define kvmi_err(ikvm, fmt, ...) \
 	kvm_info("%pU ERROR: " fmt, &ikvm->uuid, ## __VA_ARGS__)
 
+#define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
+
 #define KVMI_MSG_SIZE_ALLOC (sizeof(struct kvmi_msg_hdr) + KVMI_MSG_SIZE)
 
 #define KVMI_KNOWN_VCPU_EVENTS ( \
@@ -73,6 +75,9 @@
 
 #define KVMI_NUM_COMMANDS KVMI_NEXT_AVAILABLE_COMMAND
 
+struct kvmi_vcpu {
+};
+
 #define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
 
 struct kvmi {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 12/92] kvm: introspection: add a jobs list to every introspected vCPU
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (10 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 11/92] kvm: introspection: add vCPU related data Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty Adalbert Lazăr
                   ` (82 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

Every vCPU has a lock-protected list in which (mostly) the receiving
worker places the jobs to be done by the vCPU once it is kicked
(KVM_REQ_INTROSPECTION) out of guest.

A job is defined by a "do" function, a pointer (context) and a "free"
function.

Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 virt/kvm/kvmi.c                 | 102 +++++++++++++++++++++++++++++++-
 virt/kvm/kvmi_int.h             |   9 +++
 3 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 180373360e34..67ed934ca124 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -78,6 +78,7 @@
 #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
 #define KVM_REQ_LOAD_EOI_EXITMAP	KVM_ARCH_REQ(23)
 #define KVM_REQ_GET_VMCS12_PAGES	KVM_ARCH_REQ(24)
+#define KVM_REQ_INTROSPECTION		KVM_ARCH_REQ(25)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 860574039221..07ebd1c629b0 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -11,6 +11,9 @@
 #include <linux/bitmap.h>
 
 static struct kmem_cache *msg_cache;
+static struct kmem_cache *job_cache;
+
+static void kvmi_abort_events(struct kvm *kvm);
 
 void *kvmi_msg_alloc(void)
 {
@@ -34,14 +37,19 @@ static void kvmi_cache_destroy(void)
 {
 	kmem_cache_destroy(msg_cache);
 	msg_cache = NULL;
+	kmem_cache_destroy(job_cache);
+	job_cache = NULL;
 }
 
 static int kvmi_cache_create(void)
 {
+	job_cache = kmem_cache_create("kvmi_job",
+				      sizeof(struct kvmi_job),
+				      0, SLAB_ACCOUNT, NULL);
 	msg_cache = kmem_cache_create("kvmi_msg", KVMI_MSG_SIZE_ALLOC,
 				      4096, SLAB_ACCOUNT, NULL);
 
-	if (!msg_cache) {
+	if (!msg_cache || !job_cache) {
 		kvmi_cache_destroy();
 
 		return -1;
@@ -80,6 +88,53 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 	return true;
 }
 
+static int __kvmi_add_job(struct kvm_vcpu *vcpu,
+			  void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
+			  void *ctx, void (*free_fct)(void *ctx))
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_job *job;
+
+	job = kmem_cache_zalloc(job_cache, GFP_KERNEL);
+	if (unlikely(!job))
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&job->link);
+	job->fct = fct;
+	job->ctx = ctx;
+	job->free_fct = free_fct;
+
+	spin_lock(&ivcpu->job_lock);
+	list_add_tail(&job->link, &ivcpu->job_list);
+	spin_unlock(&ivcpu->job_lock);
+
+	return 0;
+}
+
+int kvmi_add_job(struct kvm_vcpu *vcpu,
+		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
+		 void *ctx, void (*free_fct)(void *ctx))
+{
+	int err;
+
+	err = __kvmi_add_job(vcpu, fct, ctx, free_fct);
+
+	if (!err) {
+		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
+
+	return err;
+}
+
+static void kvmi_free_job(struct kvmi_job *job)
+{
+	if (job->free_fct)
+		job->free_fct(job->ctx);
+
+	kmem_cache_free(job_cache, job);
+}
+
 static bool alloc_ivcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu;
@@ -88,6 +143,9 @@ static bool alloc_ivcpu(struct kvm_vcpu *vcpu)
 	if (!ivcpu)
 		return false;
 
+	INIT_LIST_HEAD(&ivcpu->job_list);
+	spin_lock_init(&ivcpu->job_lock);
+
 	vcpu->kvmi = ivcpu;
 
 	return true;
@@ -101,6 +159,27 @@ struct kvmi * __must_check kvmi_get(struct kvm *kvm)
 	return NULL;
 }
 
+static void kvmi_clear_vcpu_jobs(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	struct kvmi_job *cur, *next;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+		if (!ivcpu)
+			continue;
+
+		spin_lock(&ivcpu->job_lock);
+		list_for_each_entry_safe(cur, next, &ivcpu->job_list, link) {
+			list_del(&cur->link);
+			kvmi_free_job(cur);
+		}
+		spin_unlock(&ivcpu->job_lock);
+	}
+}
+
 static void kvmi_destroy(struct kvm *kvm)
 {
 	struct kvm_vcpu *vcpu;
@@ -118,6 +197,7 @@ static void kvmi_destroy(struct kvm *kvm)
 static void kvmi_release(struct kvm *kvm)
 {
 	kvmi_sock_put(IKVM(kvm));
+	kvmi_clear_vcpu_jobs(kvm);
 	kvmi_destroy(kvm);
 
 	complete(&kvm->kvmi_completed);
@@ -179,6 +259,13 @@ static void kvmi_end_introspection(struct kvmi *ikvm)
 	/* Signal QEMU which is waiting for POLLHUP. */
 	kvmi_sock_shutdown(ikvm);
 
+	/*
+	 * Trigger all the VCPUs out of waiting for replies. Although the
+	 * introspection is still enabled, sending additional events will
+	 * fail because the socket is shut down. Waiting will not be possible.
+	 */
+	kvmi_abort_events(kvm);
+
 	/*
 	 * At this moment the socket is shut down, no more commands will come
 	 * from the introspector, and the only way into the introspection is
@@ -420,6 +507,19 @@ int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 	return 0;
 }
 
+static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
+{
+}
+
+static void kvmi_abort_events(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmi_add_job(vcpu, kvmi_job_abort, NULL, NULL);
+}
+
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
 {
 	struct kvmi *ikvm;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 8739a3435893..97f91a568096 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -75,7 +75,16 @@
 
 #define KVMI_NUM_COMMANDS KVMI_NEXT_AVAILABLE_COMMAND
 
+struct kvmi_job {
+	struct list_head link;
+	void *ctx;
+	void (*fct)(struct kvm_vcpu *vcpu, void *ctx);
+	void (*free_fct)(void *ctx);
+};
+
 struct kvmi_vcpu {
+	struct list_head job_list;
+	spinlock_t job_lock;
 };
 
 #define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (11 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 12/92] kvm: introspection: add a jobs list to every introspected vCPU Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  8:43   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest Adalbert Lazăr
                   ` (81 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

Usually, the vCPU thread will run the functions from its jobs list
(unless the thread is SIGKILL-ed) and continue to guest when the
list is empty. But, there are cases when it has to wait for something
(e.g. another vCPU runs in single-step mode, or the current vCPU waits
for an event reply from the introspection tool).

In these cases, it will append a "wait job" into its own list, which
will do (a) nothing if the list is not empty or it doesn't have to wait
any longer or (b) wait (in the same wake-queue used by KVM) until it
is kicked. It should be OK if the receiving worker appends a new job in
the same time.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/swait.h | 11 ++++++
 virt/kvm/kvmi.c       | 80 +++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h   |  2 ++
 3 files changed, 93 insertions(+)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 73e06e9986d4..2486625e7fb4 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -297,4 +297,15 @@ do {									\
 	__ret;								\
 })
 
+#define __swait_event_killable(wq, condition)				\
+	___swait_event(wq, condition, TASK_KILLABLE, 0,	schedule())	\
+
+#define swait_event_killable(wq, condition)				\
+({									\
+	int __ret = 0;							\
+	if (!(condition))						\
+		__ret = __swait_event_killable(wq, condition);		\
+	__ret;								\
+})
+
 #endif /* _LINUX_SWAIT_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 07ebd1c629b0..3c884dc0e38c 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -135,6 +135,19 @@ static void kvmi_free_job(struct kvmi_job *job)
 	kmem_cache_free(job_cache, job);
 }
 
+static struct kvmi_job *kvmi_pull_job(struct kvmi_vcpu *ivcpu)
+{
+	struct kvmi_job *job = NULL;
+
+	spin_lock(&ivcpu->job_lock);
+	job = list_first_entry_or_null(&ivcpu->job_list, typeof(*job), link);
+	if (job)
+		list_del(&job->link);
+	spin_unlock(&ivcpu->job_lock);
+
+	return job;
+}
+
 static bool alloc_ivcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu;
@@ -496,6 +509,73 @@ void kvmi_destroy_vm(struct kvm *kvm)
 	wait_for_completion_killable(&kvm->kvmi_completed);
 }
 
+void kvmi_run_jobs(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_job *job;
+
+	while ((job = kvmi_pull_job(ivcpu))) {
+		job->fct(vcpu, job->ctx);
+		kvmi_free_job(job);
+	}
+}
+
+static bool done_waiting(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	return !list_empty(&ivcpu->job_list);
+}
+
+static void kvmi_job_wait(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu);
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	int err;
+
+	err = swait_event_killable(*wq, done_waiting(vcpu));
+
+	if (err)
+		ivcpu->killed = true;
+}
+
+int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	int err = 0;
+
+	for (;;) {
+		kvmi_run_jobs(vcpu);
+
+		if (ivcpu->killed) {
+			err = -1;
+			break;
+		}
+
+		kvmi_add_job(vcpu, kvmi_job_wait, NULL, NULL);
+	}
+
+	return err;
+}
+
+void kvmi_handle_requests(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return;
+
+	for (;;) {
+		int err = kvmi_run_jobs_and_wait(vcpu);
+
+		if (err)
+			break;
+	}
+
+	kvmi_put(vcpu->kvm);
+}
+
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 97f91a568096..47418e9a86f6 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -85,6 +85,8 @@ struct kvmi_job {
 struct kvmi_vcpu {
 	struct list_head job_list;
 	spinlock_t job_lock;
+
+	bool killed;
 };
 
 #define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (12 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  8:26   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 15/92] kvm: introspection: handle vCPU related introspection commands Adalbert Lazăr
                   ` (80 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mihai Donțu <mdontu@bitdefender.com>

The introspection requests (KVM_REQ_INTROSPECTION) are checked by any
introspected vCPU in two places:

  * on its way to guest - vcpu_enter_guest()
  * when halted - kvm_vcpu_block()

In kvm_vcpu_block(), we check to see if there are any introspection
requests during the swait loop, handle them outside of swait loop and
start swait again.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c   |  3 +++
 include/linux/kvmi.h |  2 ++
 virt/kvm/kvm_main.c  | 28 ++++++++++++++++++++++------
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0163e1ad1aaa..adbdb1ceb618 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7742,6 +7742,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu))
+			kvmi_handle_requests(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index e8d25d7da751..ae5de1905b55 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -16,6 +16,7 @@ int kvmi_ioctl_event(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
 void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
+void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 
 #else
 
@@ -25,6 +26,7 @@ static inline void kvmi_create_vm(struct kvm *kvm) { }
 static inline void kvmi_destroy_vm(struct kvm *kvm) { }
 static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
+static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
 
 #endif /* CONFIG_KVM_INTROSPECTION */
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 94f15f393e37..2e11069b9565 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2282,16 +2282,32 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_blocking(vcpu);
 
 	for (;;) {
-		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+		bool do_kvmi_work = false;
 
-		if (kvm_vcpu_check_block(vcpu) < 0)
-			break;
+		for (;;) {
+			prepare_to_swait_exclusive(&vcpu->wq, &wait,
+						   TASK_INTERRUPTIBLE);
+
+			if (kvm_vcpu_check_block(vcpu) < 0)
+				break;
+
+			waited = true;
+			schedule();
+
+			if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu)) {
+				do_kvmi_work = true;
+				break;
+			}
+		}
 
-		waited = true;
-		schedule();
+		finish_swait(&vcpu->wq, &wait);
+
+		if (do_kvmi_work)
+			kvmi_handle_requests(vcpu);
+		else
+			break;
 	}
 
-	finish_swait(&vcpu->wq, &wait);
 	cur = ktime_get();
 
 	kvm_arch_vcpu_unblocking(vcpu);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 15/92] kvm: introspection: handle vCPU related introspection commands
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (13 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies Adalbert Lazăr
                   ` (79 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

Following the common structure used for all messages (kvmi_msg_hdr), all
vCPU related commands have another common structure (kvmi_vcpu_hdr). This
allows the receiving worker to validate and dispatch the message to the
proper vCPU (adding the handling function to its jobs list).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |   8 ++
 include/uapi/linux/kvm_para.h      |   4 +-
 include/uapi/linux/kvmi.h          |   6 ++
 virt/kvm/kvmi_int.h                |   3 +
 virt/kvm/kvmi_msg.c                | 159 ++++++++++++++++++++++++++++-
 5 files changed, 177 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index a660def20b23..7f3c4f8fce63 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -232,6 +232,14 @@ The following C structures are meant to be used directly when communicating
 over the wire. The peer that detects any size mismatch should simply close
 the connection and report the error.
 
+The commands related to vCPU-s start with::
+
+	struct kvmi_vcpu_hdr {
+		__u16 vcpu;
+		__u16 padding1;
+		__u32 padding2;
+	}
+
 1. KVMI_GET_VERSION
 -------------------
 
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 6c0ce49931e5..54c0e20f5b64 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -10,13 +10,15 @@
  * - kvm_para_available
  */
 
-/* Return values for hypercalls */
+/* Return values for hypercalls and VM introspection */
 #define KVM_ENOSYS		1000
 #define KVM_EFAULT		EFAULT
 #define KVM_EINVAL		EINVAL
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EAGAIN		11
+#define KVM_ENOMEM		ENOMEM
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index ff35faabb7ed..29452da818e3 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -114,4 +114,10 @@ struct kvmi_control_vm_events {
 	__u32 padding2;
 };
 
+struct kvmi_vcpu_hdr {
+	__u16 vcpu;
+	__u16 padding1;
+	__u32 padding2;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 47418e9a86f6..33ea05cb99af 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -118,5 +118,8 @@ void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
+int kvmi_add_job(struct kvm_vcpu *vcpu,
+		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
+		 void *ctx, void (*free_fct)(void *ctx));
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index a55c9e35be36..2728e6870d47 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -8,6 +8,18 @@
 #include <linux/net.h>
 #include "kvmi_int.h"
 
+typedef int (*vcpu_reply_fct)(struct kvm_vcpu *vcpu,
+			      const struct kvmi_msg_hdr *msg, int err,
+			      const void *rpl, size_t rpl_size);
+
+struct kvmi_vcpu_cmd {
+	vcpu_reply_fct reply_cb;
+	struct {
+		struct kvmi_msg_hdr hdr;
+		struct kvmi_vcpu_hdr cmd;
+	} *msg;
+};
+
 static const char *const msg_IDs[] = {
 	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
@@ -165,6 +177,23 @@ static int kvmi_msg_vm_maybe_reply(struct kvmi *ikvm,
 	return kvmi_msg_vm_reply(ikvm, msg, err, rpl, rpl_size);
 }
 
+int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
+			const struct kvmi_msg_hdr *msg, int err,
+			const void *rpl, size_t rpl_size)
+{
+	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, rpl_size);
+}
+
+int kvmi_msg_vcpu_drop_reply(struct kvm_vcpu *vcpu,
+			      const struct kvmi_msg_hdr *msg, int err,
+			      const void *rpl, size_t rpl_size)
+{
+	if (!kvmi_validate_no_reply(IKVM(vcpu->kvm), msg, rpl_size, err))
+		return -KVM_EINVAL;
+
+	return 0;
+}
+
 static int handle_get_version(struct kvmi *ikvm,
 			      const struct kvmi_msg_hdr *msg, const void *req)
 {
@@ -248,6 +277,23 @@ static int handle_control_vm_events(struct kvmi *ikvm,
 	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
 }
 
+static int kvmi_get_vcpu(struct kvmi *ikvm, unsigned int vcpu_idx,
+			 struct kvm_vcpu **dest)
+{
+	struct kvm *kvm = ikvm->kvm;
+	struct kvm_vcpu *vcpu;
+
+	if (vcpu_idx >= atomic_read(&kvm->online_vcpus))
+		return -KVM_EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, vcpu_idx);
+	if (!vcpu)
+		return -KVM_EINVAL;
+
+	*dest = vcpu;
+	return 0;
+}
+
 static int handle_control_cmd_response(struct kvmi *ikvm,
 					const struct kvmi_msg_hdr *msg,
 					const void *_req)
@@ -273,6 +319,11 @@ static int handle_control_cmd_response(struct kvmi *ikvm,
 	return err;
 }
 
+static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
+{
+	return hdr->padding1 || hdr->padding2;
+}
+
 /*
  * These commands are executed on the receiving thread/worker.
  */
@@ -286,16 +337,66 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };
 
+/*
+ * These commands are executed on the vCPU thread. The receiving thread
+ * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
+ * and signals the vCPU to handle the command (which includes
+ * sending back the reply).
+ */
+static int(*const msg_vcpu[])(struct kvm_vcpu *,
+			      const struct kvmi_msg_hdr *, const void *,
+			      vcpu_reply_fct) = {
+};
+
+static void kvmi_job_vcpu_cmd(struct kvm_vcpu *vcpu, void *_ctx)
+{
+	const struct kvmi_vcpu_cmd *ctx = _ctx;
+	size_t id = ctx->msg->hdr.id;
+	int err;
+
+	err = msg_vcpu[id](vcpu, &ctx->msg->hdr, ctx->msg + 1, ctx->reply_cb);
+
+	if (err) {
+		struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+		kvmi_err(ikvm,
+			 "%s: cmd id: %zu (%s), err: %d\n", __func__,
+			 id, id2str(id), err);
+		kvmi_sock_shutdown(ikvm);
+	}
+}
+
+static void kvmi_free_ctx(void *_ctx)
+{
+	const struct kvmi_vcpu_cmd *ctx = _ctx;
+
+	kvmi_msg_free(ctx->msg);
+	kfree(ctx);
+}
+
+static int kvmi_msg_queue_to_vcpu(struct kvm_vcpu *vcpu,
+				  const struct kvmi_vcpu_cmd *cmd)
+{
+	return kvmi_add_job(vcpu, kvmi_job_vcpu_cmd, (void *)cmd,
+			    kvmi_free_ctx);
+}
+
 static bool is_vm_message(u16 id)
 {
 	return id < ARRAY_SIZE(msg_vm) && !!msg_vm[id];
 }
 
+static bool is_vcpu_message(u16 id)
+{
+	return id < ARRAY_SIZE(msg_vcpu) && !!msg_vcpu[id];
+}
+
 static bool is_unsupported_message(u16 id)
 {
 	bool supported;
 
-	supported = is_known_message(id) && is_vm_message(id);
+	supported = is_known_message(id) &&
+			(is_vm_message(id) || is_vcpu_message(id));
 
 	return !supported;
 }
@@ -364,12 +465,66 @@ static int kvmi_msg_dispatch_vm_cmd(struct kvmi *ikvm,
 	return msg_vm[msg->id](ikvm, msg, msg + 1);
 }
 
+static int kvmi_msg_dispatch_vcpu_job(struct kvmi *ikvm,
+				      struct kvmi_vcpu_cmd *job,
+				      bool *queued)
+{
+	struct kvmi_msg_hdr *hdr = &job->msg->hdr;
+	struct kvmi_vcpu_hdr *cmd = &job->msg->cmd;
+	struct kvm_vcpu *vcpu = NULL;
+	int err;
+
+	if (invalid_vcpu_hdr(cmd))
+		return -KVM_EINVAL;
+
+	err = kvmi_get_vcpu(ikvm, cmd->vcpu, &vcpu);
+
+	if (!err && vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
+		err = -KVM_EAGAIN;
+
+	if (err)
+		return kvmi_msg_vm_maybe_reply(ikvm, hdr, err, NULL, 0);
+
+	err = kvmi_msg_queue_to_vcpu(vcpu, job);
+	if (!err)
+		*queued = true;
+
+	return err;
+}
+
+static int kvmi_msg_dispatch_vcpu_msg(struct kvmi *ikvm,
+				      struct kvmi_msg_hdr *msg,
+				      bool *queued)
+{
+	struct kvmi_vcpu_cmd *job_msg;
+	int err;
+
+	job_msg = kzalloc(sizeof(*job_msg), GFP_KERNEL);
+	if (!job_msg)
+		return -KVM_ENOMEM;
+
+	job_msg->reply_cb = ikvm->cmd_reply_disabled
+				? kvmi_msg_vcpu_drop_reply
+				: kvmi_msg_vcpu_reply;
+	job_msg->msg = (void *)msg;
+
+	err = kvmi_msg_dispatch_vcpu_job(ikvm, job_msg, queued);
+
+	if (!*queued)
+		kfree(job_msg);
+
+	return err;
+}
+
 static int kvmi_msg_dispatch(struct kvmi *ikvm,
 			     struct kvmi_msg_hdr *msg, bool *queued)
 {
 	int err;
 
-	err = kvmi_msg_dispatch_vm_cmd(ikvm, msg);
+	if (is_vcpu_message(msg->id))
+		err = kvmi_msg_dispatch_vcpu_msg(ikvm, msg, queued);
+	else
+		err = kvmi_msg_dispatch_vm_cmd(ikvm, msg);
 
 	if (err)
 		kvmi_err(ikvm, "%s: msg id: %u (%s), err: %d\n", __func__,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (14 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 15/92] kvm: introspection: handle vCPU related introspection commands Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  8:55   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 17/92] kvm: introspection: introduce event actions Adalbert Lazăr
                   ` (78 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

All events are sent by the vCPU thread, which will handle any
introspection command while waiting for the reply.

The event reply messages contain a common strucure (kvmi_vcpu_hdr), as
any vCPU related command, which allows the receiving worker to dispatch
the reply as it does with any other introspection command sent for a
specific vCPU.

The kernel side will gracefully handle commands coming from an
introspection tool compiled with older or newer versions of KVMI API.
However, it will only accept smaller replies (coming from older versions),
but not the bigger/newer ones (this should make the kernel code simpler).

TODO: Not quite true. An event reply has a common part (kvmi_event_reply)
and an event specific part (eg. the new value for MSR x). If the common
part is smaller, the event will be rejected.

The code from handle_event_reply():

	common = sizeof(struct kvmi_vcpu_hdr) + sizeof(*reply);
	if (unlikely(msg->size < common))
		goto out;

should be changed to

	min_common = sizeof(struct kvmi_vcpu_hdr) + offsetof(reply...)
	if (unlikely(msg->size < min_common))
		goto out;

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  56 +++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   |  29 +++++++
 arch/x86/kvm/Makefile              |   2 +-
 arch/x86/kvm/kvmi.c                |  92 ++++++++++++++++++++
 arch/x86/kvm/x86.c                 |  10 +++
 include/linux/kvm_host.h           |   3 +
 include/uapi/linux/kvmi.h          |  16 ++++
 virt/kvm/kvmi.c                    |  15 ++++
 virt/kvm/kvmi_int.h                |  16 ++++
 virt/kvm/kvmi_msg.c                | 129 +++++++++++++++++++++++++++++
 10 files changed, 367 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 arch/x86/kvm/kvmi.c

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 7f3c4f8fce63..e7d9a3816e00 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -427,3 +427,59 @@ in almost all cases, it must reply with: continue, retry, crash, etc.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EPERM - the access is restricted by the host
 
+Events
+======
+
+All vCPU events are sent using the *KVMI_EVENT* message id. No event
+will be sent unless explicitly enabled with a *KVMI_CONTROL_EVENTS*
+or a *KVMI_CONTROL_VM_EVENTS* command or requested, as it is the case
+with the *KVMI_EVENT_PAUSE_VCPU* event (see **KVMI_PAUSE_VCPU**).
+
+There is one VM event, *KVMI_EVENT_UNHOOK*, which doesn't have a reply,
+but shares the kvmi_event structure, for consistency with the vCPU events.
+
+The message data begins with a common structure, having the size of the
+structure, the vCPU index and the event id::
+
+	struct kvmi_event {
+		__u16 size;
+		__u16 vcpu;
+		__u8 event;
+		__u8 padding[3];
+		struct kvmi_event_arch arch;
+	}
+
+On x86 the structure looks like this::
+
+	struct kvmi_event_arch {
+		__u8 mode;
+		__u8 padding[7];
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+			__u64 cstar;
+			__u64 pat;
+			__u64 shadow_gs;
+		} msrs;
+	};
+
+It contains information about the vCPU state at the time of the event.
+
+The reply to events have the *KVMI_EVENT_REPLY* message id and begins
+with two common structures::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply {
+		__u8 action;
+		__u8 event;
+		__u16 padding1;
+		__u32 padding2;
+	};
+
+Specific data can follow these common structures.
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
new file mode 100644
index 000000000000..551f9ed1ed9c
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_KVMI_H
+#define _UAPI_ASM_X86_KVMI_H
+
+/*
+ * KVM introspection - x86 specific structures and definitions
+ */
+
+#include <asm/kvm.h>
+
+struct kvmi_event_arch {
+	__u8 mode;		/* 2, 4 or 8 */
+	__u8 padding[7];
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+		__u64 cstar;
+		__u64 pat;
+		__u64 shadow_gs;
+	} msrs;
+};
+
+#endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 0963e475dbe9..673cf37c0747 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ KVM := ../../../virt/kvm
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
-kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o $(KVM)/kvmi_msg.o
+kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o $(KVM)/kvmi_msg.o kvmi.o
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
new file mode 100644
index 000000000000..9aecca551673
--- /dev/null
+++ b/arch/x86/kvm/kvmi.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection - x86
+ *
+ * Copyright (C) 2019 Bitdefender S.R.L.
+ */
+#include "x86.h"
+#include "../../../virt/kvm/kvmi_int.h"
+
+/*
+ * TODO: this can be done from userspace.
+ *   - all these registers are sent with struct kvmi_event_arch
+ *   - userspace can request MSR_EFER with KVMI_GET_REGISTERS
+ */
+static unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+				   const struct kvm_sregs *sregs)
+{
+	unsigned int mode = 0;
+
+	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
+		if (sregs->cs.l)
+			mode = 8;
+		else if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (sregs->cr0 & X86_CR0_PE) {
+		if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (!sregs->cs.db) {
+		mode = 2;
+	} else {
+		mode = 4;
+	}
+
+	return mode;
+}
+
+static void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event_arch *event)
+{
+	struct msr_data msr;
+
+	msr.host_initiated = true;
+
+	msr.index = MSR_IA32_SYSENTER_CS;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_cs = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_ESP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_esp = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_EIP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_eip = msr.data;
+
+	msr.index = MSR_EFER;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.efer = msr.data;
+
+	msr.index = MSR_STAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.star = msr.data;
+
+	msr.index = MSR_LSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.lstar = msr.data;
+
+	msr.index = MSR_CSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.cstar = msr.data;
+
+	msr.index = MSR_IA32_CR_PAT;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.pat = msr.data;
+
+	msr.index = MSR_KERNEL_GS_BASE;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.shadow_gs = msr.data;
+}
+
+void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev)
+{
+	struct kvmi_event_arch *event = &ev->arch;
+
+	kvm_arch_vcpu_get_regs(vcpu, &event->regs);
+	kvm_arch_vcpu_get_sregs(vcpu, &event->sregs);
+	ev->arch.mode = kvmi_vcpu_mode(vcpu, &event->sregs);
+	kvmi_get_msrs(vcpu, event);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index adbdb1ceb618..30cf0d162aa8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8244,6 +8244,11 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+void kvm_arch_vcpu_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	__get_regs(vcpu, regs);
+}
+
 static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
 	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
@@ -8339,6 +8344,11 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+void kvm_arch_vcpu_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	__get_sregs(vcpu, sregs);
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
 				    struct kvm_mp_state *mp_state)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1ec04384fad3..e876921938b6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -788,9 +788,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 				    struct kvm_translation *tr);
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
+void kvm_arch_vcpu_get_sregs(struct kvm_vcpu *vcpu,
+				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 29452da818e3..dda2ae352611 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -8,6 +8,7 @@
 
 #include <linux/kernel.h>
 #include <linux/types.h>
+#include <asm/kvmi.h>
 
 #define KVMI_VERSION 0x00000001
 
@@ -120,4 +121,19 @@ struct kvmi_vcpu_hdr {
 	__u32 padding2;
 };
 
+struct kvmi_event {
+	__u16 size;
+	__u16 vcpu;
+	__u8 event;
+	__u8 padding[3];
+	struct kvmi_event_arch arch;
+};
+
+struct kvmi_event_reply {
+	__u8 action;
+	__u8 event;
+	__u16 padding1;
+	__u32 padding2;
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 3c884dc0e38c..3cc7bb035796 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -76,6 +76,8 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 	if (!ikvm)
 		return false;
 
+	atomic_set(&ikvm->ev_seq, 0);
+
 	set_bit(KVMI_GET_VERSION, ikvm->cmd_allow_mask);
 	set_bit(KVMI_CHECK_COMMAND, ikvm->cmd_allow_mask);
 	set_bit(KVMI_CHECK_EVENT, ikvm->cmd_allow_mask);
@@ -520,10 +522,20 @@ void kvmi_run_jobs(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool need_to_wait(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	return ivcpu->reply_waiting;
+}
+
 static bool done_waiting(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
 
+	if (!need_to_wait(vcpu))
+		return true;
+
 	return !list_empty(&ivcpu->job_list);
 }
 
@@ -552,6 +564,9 @@ int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu)
 			break;
 		}
 
+		if (!need_to_wait(vcpu))
+			break;
+
 		kvmi_add_job(vcpu, kvmi_job_wait, NULL, NULL);
 	}
 
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 33ea05cb99af..70c8ca0343a3 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -82,7 +82,18 @@ struct kvmi_job {
 	void (*free_fct)(void *ctx);
 };
 
+struct kvmi_vcpu_reply {
+	int error;
+	int action;
+	u32 seq;
+	void *data;
+	size_t size;
+};
+
 struct kvmi_vcpu {
+	bool reply_waiting;
+	struct kvmi_vcpu_reply reply;
+
 	struct list_head job_list;
 	spinlock_t job_lock;
 
@@ -96,6 +107,7 @@ struct kvmi {
 
 	struct socket *sock;
 	struct task_struct *recv;
+	atomic_t ev_seq;
 
 	uuid_t uuid;
 
@@ -118,8 +130,12 @@ void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
+int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
 int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
 		 void *ctx, void (*free_fct)(void *ctx));
 
+/* arch */
+void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
+
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 2728e6870d47..536034e1bea7 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -25,6 +25,8 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
+	[KVMI_EVENT]                 = "KVMI_EVENT",
+	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
@@ -337,6 +339,57 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };
 
+static int handle_event_reply(struct kvm_vcpu *vcpu,
+			      const struct kvmi_msg_hdr *msg, const void *rpl,
+			      vcpu_reply_fct reply_cb)
+{
+	const struct kvmi_event_reply *reply = rpl;
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	struct kvmi_vcpu_reply *expected = &ivcpu->reply;
+	size_t useful, received, common;
+
+	if (unlikely(msg->seq != expected->seq))
+		goto out;
+
+	common = sizeof(struct kvmi_vcpu_hdr) + sizeof(*reply);
+	if (unlikely(msg->size < common))
+		goto out;
+
+	if (unlikely(reply->padding1 || reply->padding2))
+		goto out;
+
+	received = msg->size - common;
+	/* Don't accept newer/bigger structures */
+	if (unlikely(received > expected->size))
+		goto out;
+
+	useful = min(received, expected->size);
+	if (useful)
+		memcpy(expected->data, reply + 1, useful);
+
+	if (useful < expected->size)
+		memset((char *)expected->data + useful, 0,
+			expected->size - useful);
+
+	expected->action = reply->action;
+	expected->error = 0;
+
+out:
+
+	if (unlikely(expected->error))
+		kvmi_err(ikvm, "Invalid event %d/%d reply seq %x/%x size %u min %zu expected %zu padding %u,%u\n",
+			 reply->event, reply->action,
+			 msg->seq, expected->seq,
+			 msg->size, common,
+			 common + expected->size,
+			 reply->padding1,
+			 reply->padding2);
+
+	ivcpu->reply_waiting = false;
+	return expected->error;
+}
+
 /*
  * These commands are executed on the vCPU thread. The receiving thread
  * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
@@ -346,6 +399,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      const struct kvmi_msg_hdr *, const void *,
 			      vcpu_reply_fct) = {
+	[KVMI_EVENT_REPLY]      = handle_event_reply,
 };
 
 static void kvmi_job_vcpu_cmd(struct kvm_vcpu *vcpu, void *_ctx)
@@ -576,3 +630,78 @@ bool kvmi_msg_process(struct kvmi *ikvm)
 
 	return err == 0;
 }
+
+static void kvmi_setup_event_common(struct kvmi_event *ev, u32 ev_id,
+				    unsigned short vcpu_idx)
+{
+	memset(ev, 0, sizeof(*ev));
+
+	ev->vcpu = vcpu_idx;
+	ev->event = ev_id;
+	ev->size = sizeof(*ev);
+}
+
+static void kvmi_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev,
+			     u32 ev_id)
+{
+	kvmi_setup_event_common(ev, ev_id, kvm_vcpu_get_idx(vcpu));
+	kvmi_arch_setup_event(vcpu, ev);
+}
+
+static inline u32 new_seq(struct kvmi *ikvm)
+{
+	return atomic_inc_return(&ikvm->ev_seq);
+}
+
+int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
+		    void *ev, size_t ev_size,
+		    void *rpl, size_t rpl_size, int *action)
+{
+	struct kvmi_msg_hdr hdr;
+	struct kvmi_event common;
+	struct kvec vec[] = {
+		{.iov_base = &hdr,	.iov_len = sizeof(hdr)	 },
+		{.iov_base = &common,	.iov_len = sizeof(common)},
+		{.iov_base = ev,	.iov_len = ev_size	 },
+	};
+	size_t msg_size = sizeof(hdr) + sizeof(common) + ev_size;
+	size_t n = ev_size ? ARRAY_SIZE(vec) : ARRAY_SIZE(vec)-1;
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	int err;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.id = KVMI_EVENT;
+	hdr.seq = new_seq(ikvm);
+	hdr.size = msg_size - sizeof(hdr);
+
+	kvmi_setup_event(vcpu, &common, ev_id);
+
+	memset(&ivcpu->reply, 0, sizeof(ivcpu->reply));
+
+	ivcpu->reply.seq = hdr.seq;
+	ivcpu->reply.data = rpl;
+	ivcpu->reply.size = rpl_size;
+	ivcpu->reply.error = -EINTR;
+
+	err = kvmi_sock_write(ikvm, vec, n, msg_size);
+	if (err)
+		goto out;
+
+	ivcpu->reply_waiting = true;
+	err = kvmi_run_jobs_and_wait(vcpu);
+	if (err)
+		goto out;
+
+	err = ivcpu->reply.error;
+	if (err)
+		goto out;
+
+	*action = ivcpu->reply.action;
+
+out:
+	if (err)
+		kvmi_sock_shutdown(ikvm);
+	return err;
+}
+

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 17/92] kvm: introspection: introduce event actions
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (15 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 18/92] kvm: introspection: add KVMI_EVENT_UNHOOK Adalbert Lazăr
                   ` (77 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

All vCPU event replies contains the action requested by the introspection
tool, which can be one of the following:

  * KVMI_EVENT_ACTION_CONTINUE
  * KVMI_EVENT_ACTION_RETRY
  * KVMI_EVENT_ACTION_CRASH

The CONTINUE action can be seen as "continue with the old KVM code
path", while the RETRY action as "re-enter guest".

Note: KVMI_EVENT_UNHOOK, a VM event, doesn't have/need a reply.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 10 ++++++++
 include/uapi/linux/kvmi.h          |  4 +++
 kernel/signal.c                    |  1 +
 virt/kvm/kvmi.c                    | 40 ++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index e7d9a3816e00..1ea4be0d5a45 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -482,4 +482,14 @@ with two common structures::
 		__u32 padding2;
 	};
 
+All events accept the KVMI_EVENT_ACTION_CRASH action, which stops the
+guest ungracefully but as soon as possible.
+
+Most of the events accept the KVMI_EVENT_ACTION_CONTINUE action, which
+lets the instruction that caused the event to continue (unless specified
+otherwise).
+
+Some of the events accept the KVMI_EVENT_ACTION_RETRY action, to continue
+by re-entering the guest.
+
 Specific data can follow these common structures.
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index dda2ae352611..ccf2239b5db4 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -66,6 +66,10 @@ enum {
 	KVMI_NUM_EVENTS
 };
 
+#define KVMI_EVENT_ACTION_CONTINUE      0
+#define KVMI_EVENT_ACTION_RETRY         1
+#define KVMI_EVENT_ACTION_CRASH         2
+
 #define KVMI_MSG_SIZE (4096 - sizeof(struct kvmi_msg_hdr))
 
 struct kvmi_msg_hdr {
diff --git a/kernel/signal.c b/kernel/signal.c
index 57b7771e20d7..9befbfaaa710 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1413,6 +1413,7 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
 		 */
 	}
 }
+EXPORT_SYMBOL(kill_pid_info);
 
 static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
 {
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 3cc7bb035796..0d3560b74f2d 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -511,6 +511,46 @@ void kvmi_destroy_vm(struct kvm *kvm)
 	wait_for_completion_killable(&kvm->kvmi_completed);
 }
 
+static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
+{
+	int err = -ESRCH;
+	struct pid *pid;
+	struct kernel_siginfo siginfo[1] = {};
+
+	rcu_read_lock();
+	pid = rcu_dereference(vcpu->pid);
+	if (pid)
+		err = kill_pid_info(sig, siginfo, pid);
+	rcu_read_unlock();
+
+	return err;
+}
+
+static void kvmi_vm_shutdown(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmi_vcpu_kill(SIGTERM, vcpu);
+}
+
+void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
+				      const char *str)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CRASH:
+		kvmi_vm_shutdown(kvm);
+		break;
+
+	default:
+		kvmi_err(IKVM(kvm), "Unsupported action %d for event %s\n",
+			 action, str);
+	}
+}
+
 void kvmi_run_jobs(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 18/92] kvm: introspection: add KVMI_EVENT_UNHOOK
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (16 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 17/92] kvm: introspection: introduce event actions Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 19/92] kvm: introspection: add KVMI_EVENT_CREATE_VCPU Adalbert Lazăr
                   ` (76 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

In certain situations (when the guest has to be paused, suspended,
migrated, etc.), userspace/QEMU will use the KVM_INTROSPECTION_UNHOOK
ioctl in order to trigger the KVMI_EVENT_UNHOOK. If the event is sent
successfully (the VM has an active introspection channel), userspace
should delay the action (pause/suspend/...) to give the introspection
tool the chance to remove its hooks (eg. breakpoints). Once a timeout
is reached or the introspection tool has closed the socket, QEMU should
continue with the planned action.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 20 ++++++++++++++++++
 virt/kvm/kvmi.c                    | 34 +++++++++++++++++++++++++++++-
 virt/kvm/kvmi_int.h                |  1 +
 virt/kvm/kvmi_msg.c                | 20 ++++++++++++++++++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 1ea4be0d5a45..28e1a1c80551 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -493,3 +493,23 @@ Some of the events accept the KVMI_EVENT_ACTION_RETRY action, to continue
 by re-entering the guest.
 
 Specific data can follow these common structures.
+
+1. KVMI_EVENT_UNHOOK
+--------------------
+
+:Architecture: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns: none
+
+This event is sent when the device manager (ie. QEMU) has to
+pause/stop/migrate the guest (see **Unhooking**) and the introspection
+has been enabled for this event (see **KVMI_CONTROL_VM_EVENTS**).
+The introspection tool has a chance to unhook and close the KVMI channel
+(signaling that the operation can proceed).
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 0d3560b74f2d..7eda49bf65c4 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -644,6 +644,9 @@ int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 
 static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
 {
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	ivcpu->reply_waiting = false;
 }
 
 static void kvmi_abort_events(struct kvm *kvm)
@@ -655,6 +658,34 @@ static void kvmi_abort_events(struct kvm *kvm)
 		kvmi_add_job(vcpu, kvmi_job_abort, NULL, NULL);
 }
 
+static bool __kvmi_unhook_event(struct kvmi *ikvm)
+{
+	int err;
+
+	if (!test_bit(KVMI_EVENT_UNHOOK, ikvm->vm_ev_mask))
+		return false;
+
+	err = kvmi_msg_send_unhook(ikvm);
+
+	return !err;
+}
+
+static bool kvmi_unhook_event(struct kvm *kvm)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return false;
+
+	ret = __kvmi_unhook_event(ikvm);
+
+	kvmi_put(kvm);
+
+	return ret;
+}
+
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
 {
 	struct kvmi *ikvm;
@@ -664,7 +695,8 @@ int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
 	if (!ikvm)
 		return -EFAULT;
 
-	kvm_info("TODO: %s force_reset %d", __func__, force_reset);
+	if (!force_reset && !kvmi_unhook_event(kvm))
+		err = -ENOENT;
 
 	kvmi_put(kvm);
 
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 70c8ca0343a3..9750a9b9902b 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -123,6 +123,7 @@ bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
 void kvmi_sock_put(struct kvmi *ikvm);
 bool kvmi_msg_process(struct kvmi *ikvm);
+int kvmi_msg_send_unhook(struct kvmi *ikvm);
 
 /* kvmi.c */
 void *kvmi_msg_alloc(void);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 536034e1bea7..0c7c1e968007 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -705,3 +705,23 @@ int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
 	return err;
 }
 
+int kvmi_msg_send_unhook(struct kvmi *ikvm)
+{
+	struct kvmi_msg_hdr hdr;
+	struct kvmi_event common;
+	struct kvec vec[] = {
+		{.iov_base = &hdr,	.iov_len = sizeof(hdr)	 },
+		{.iov_base = &common,	.iov_len = sizeof(common)},
+	};
+	size_t msg_size = sizeof(hdr) + sizeof(common);
+	size_t n = ARRAY_SIZE(vec);
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.id = KVMI_EVENT;
+	hdr.seq = new_seq(ikvm);
+	hdr.size = msg_size - sizeof(hdr);
+
+	kvmi_setup_event_common(&common, KVMI_EVENT_UNHOOK, 0);
+
+	return kvmi_sock_write(ikvm, vec, n, msg_size);
+}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 19/92] kvm: introspection: add KVMI_EVENT_CREATE_VCPU
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (17 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 18/92] kvm: introspection: add KVMI_EVENT_UNHOOK Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 20/92] kvm: introspection: add KVMI_GET_VCPU_INFO Adalbert Lazăr
                   ` (75 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

This event is sent when a vCPU is ready to be introspected.

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 23 +++++++++++++++
 virt/kvm/kvmi.c                    | 47 ++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  1 +
 virt/kvm/kvmi_msg.c                | 12 ++++++++
 4 files changed, 83 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 28e1a1c80551..b29cd1b80b4f 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -513,3 +513,26 @@ pause/stop/migrate the guest (see **Unhooking**) and the introspection
 has been enabled for this event (see **KVMI_CONTROL_VM_EVENTS**).
 The introspection tool has a chance to unhook and close the KVMI channel
 (signaling that the operation can proceed).
+
+2. KVMI_EVENT_CREATE_VCPU
+-------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent when a new vCPU is created and the introspection has
+been enabled for this event (see *KVMI_CONTROL_VM_EVENTS*).
+
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 7eda49bf65c4..d0d9adf5b6ed 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -13,6 +13,7 @@
 static struct kmem_cache *msg_cache;
 static struct kmem_cache *job_cache;
 
+static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu);
 static void kvmi_abort_events(struct kvm *kvm);
 
 void *kvmi_msg_alloc(void)
@@ -150,6 +151,11 @@ static struct kvmi_job *kvmi_pull_job(struct kvmi_vcpu *ivcpu)
 	return job;
 }
 
+static void kvmi_job_create_vcpu(struct kvm_vcpu *vcpu, void *ctx)
+{
+	kvmi_create_vcpu_event(vcpu);
+}
+
 static bool alloc_ivcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu;
@@ -245,6 +251,9 @@ int kvmi_vcpu_init(struct kvm_vcpu *vcpu)
 		goto out;
 	}
 
+	if (kvmi_add_job(vcpu, kvmi_job_create_vcpu, NULL, NULL))
+		ret = -ENOMEM;
+
 out:
 	kvmi_put(vcpu->kvm);
 
@@ -330,6 +339,10 @@ int kvmi_hook(struct kvm *kvm, const struct kvm_introspection *qemu)
 			err = -ENOMEM;
 			goto err_alloc;
 		}
+		if (kvmi_add_job(vcpu, kvmi_job_create_vcpu, NULL, NULL)) {
+			err = -ENOMEM;
+			goto err_alloc;
+		}
 	}
 
 	/* interact with other kernel components after structure allocation */
@@ -551,6 +564,40 @@ void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 	}
 }
 
+static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+	bool ret = false;
+
+	action = kvmi_msg_send_create_vcpu(vcpu);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ret = true;
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "CREATE");
+	}
+
+	return ret;
+}
+
+static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (test_bit(KVMI_EVENT_CREATE_VCPU, ikvm->vm_ev_mask))
+		ret = __kvmi_create_vcpu_event(vcpu);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 void kvmi_run_jobs(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 9750a9b9902b..c21f0fd5e16c 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -123,6 +123,7 @@ bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
 void kvmi_sock_put(struct kvmi *ikvm);
 bool kvmi_msg_process(struct kvmi *ikvm);
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
 int kvmi_msg_send_unhook(struct kvmi *ikvm);
 
 /* kvmi.c */
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 0c7c1e968007..8e8af572a4f4 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -725,3 +725,15 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm)
 
 	return kvmi_sock_write(ikvm, vec, n, msg_size);
 }
+
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
+{
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_CREATE_VCPU, NULL, 0,
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 20/92] kvm: introspection: add KVMI_GET_VCPU_INFO
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (18 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 19/92] kvm: introspection: add KVMI_EVENT_CREATE_VCPU Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 21/92] kvm: page track: add track_create_slot() callback Adalbert Lazăr
                   ` (74 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

For now, this command returns the TSC frequency (in HZ) for the specified
vCPU if available (otherwise it returns zero).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 29 +++++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                | 12 ++++++++++++
 include/uapi/linux/kvmi.h          |  4 ++++
 virt/kvm/kvmi_int.h                |  2 ++
 virt/kvm/kvmi_msg.c                | 14 ++++++++++++++
 5 files changed, 61 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index b29cd1b80b4f..71897338e85a 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -427,6 +427,35 @@ in almost all cases, it must reply with: continue, retry, crash, etc.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EPERM - the access is restricted by the host
 
+7. KVMI_GET_VCPU_INFO
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_vcpu_info_reply {
+		__u64 tsc_speed;
+	};
+
+Returns the TSC frequency (in HZ) for the specified vCPU if available
+(otherwise it returns zero).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 9aecca551673..97c72cdc6fb0 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -90,3 +90,15 @@ void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev)
 	ev->arch.mode = kvmi_vcpu_mode(vcpu, &event->sregs);
 	kvmi_get_msrs(vcpu, event);
 }
+
+int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
+				struct kvmi_get_vcpu_info_reply *rpl)
+{
+	if (kvm_has_tsc_control)
+		rpl->tsc_speed = 1000ul * vcpu->arch.virtual_tsc_khz;
+	else
+		rpl->tsc_speed = 0;
+
+	return 0;
+}
+
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index ccf2239b5db4..aa5bc909e278 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -112,6 +112,10 @@ struct kvmi_get_guest_info_reply {
 	__u32 padding[3];
 };
 
+struct kvmi_get_vcpu_info_reply {
+	__u64 tsc_speed;
+};
+
 struct kvmi_control_vm_events {
 	__u16 event_id;
 	__u8 enable;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index c21f0fd5e16c..7cff91bc1acc 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -139,5 +139,7 @@ int kvmi_add_job(struct kvm_vcpu *vcpu,
 
 /* arch */
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
+int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
+				struct kvmi_get_vcpu_info_reply *rpl);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 8e8af572a4f4..3372d8c7e74f 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -28,6 +28,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_EVENT]                 = "KVMI_EVENT",
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
+	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
 
@@ -390,6 +391,18 @@ static int handle_event_reply(struct kvm_vcpu *vcpu,
 	return expected->error;
 }
 
+static int handle_get_vcpu_info(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *req, vcpu_reply_fct reply_cb)
+{
+	struct kvmi_get_vcpu_info_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	kvmi_arch_cmd_get_vcpu_info(vcpu, &rpl);
+
+	return reply_cb(vcpu, msg, 0, &rpl, sizeof(rpl));
+}
+
 /*
  * These commands are executed on the vCPU thread. The receiving thread
  * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
@@ -400,6 +413,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      const struct kvmi_msg_hdr *, const void *,
 			      vcpu_reply_fct) = {
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
+	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
 };
 
 static void kvmi_job_vcpu_cmd(struct kvm_vcpu *vcpu, void *_ctx)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 21/92] kvm: page track: add track_create_slot() callback
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (19 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 20/92] kvm: introspection: add KVMI_GET_VCPU_INFO Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 22/92] kvm: x86: provide all page tracking hooks with the guest virtual address Adalbert Lazăr
                   ` (73 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Xiao Guangrong

From: Mihai Donțu <mdontu@bitdefender.com>

This is used to add page access notifications as soon as a slot appears.

CC: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_page_track.h |  5 ++++-
 arch/x86/kvm/page_track.c             | 18 ++++++++++++++++--
 arch/x86/kvm/x86.c                    |  2 +-
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 172f9749dbb2..18a94d180485 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -34,6 +34,9 @@ struct kvm_page_track_notifier_node {
 	 */
 	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			    int bytes, struct kvm_page_track_notifier_node *node);
+	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  unsigned long npages,
+				  struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when memory slot is being moved or removed
 	 * users can drop write-protection for the pages in that memory slot
@@ -51,7 +54,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 				 struct kvm_memory_slot *dont);
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 3052a59a3065..db5b906876bb 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 		}
 }
 
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages)
 {
-	int  i;
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	int i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
 		slot->arch.gfn_track[i] =
@@ -47,6 +50,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 			goto track_free;
 	}
 
+	head = &kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return 0;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_create_slot)
+			n->track_create_slot(kvm, slot, npages, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+
 	return 0;
 
 track_free:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 30cf0d162aa8..f66db9473ea3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9350,7 +9350,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		}
 	}
 
-	if (kvm_page_track_create_memslot(slot, npages))
+	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
 	return 0;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 22/92] kvm: x86: provide all page tracking hooks with the guest virtual address
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (20 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 21/92] kvm: page track: add track_create_slot() callback Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 23/92] kvm: page track: add support for preread, prewrite and preexec Adalbert Lazăr
                   ` (72 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed because the emulator calls the page tracking code
irrespective of the current VMEXIT reason or available information.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h       |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  9 +++++----
 arch/x86/kvm/mmu.c                    |  2 +-
 arch/x86/kvm/page_track.c             |  6 +++---
 arch/x86/kvm/x86.c                    | 16 ++++++++--------
 drivers/gpu/drm/i915/gvt/kvmgt.c      |  2 +-
 6 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 67ed934ca124..2d6bde6fa59f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1263,7 +1263,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
 bool pdptrs_changed(struct kvm_vcpu *vcpu);
 
-int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			  const void *val, int bytes);
 
 struct kvm_irq_mask_notifier {
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 18a94d180485..0492a85f3a44 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -32,8 +32,9 @@ struct kvm_page_track_notifier_node {
 	 * @bytes: the written length.
 	 * @node: this node
 	 */
-	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			    int bytes, struct kvm_page_track_notifier_node *node);
+	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    const u8 *new, int bytes,
+			    struct kvm_page_track_notifier_node *node);
 	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages,
 				  struct kvm_page_track_notifier_node *node);
@@ -72,7 +73,7 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
-void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			  int bytes);
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			  const u8 *new, int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f2d1d230d5b8..9898d863b6b6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5222,7 +5222,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 	return spte;
 }
 
-static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			      const u8 *new, int bytes,
 			      struct kvm_page_track_notifier_node *node)
 {
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index db5b906876bb..ff7defb4a1d2 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -236,8 +236,8 @@ EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
  * The node should figure out if the written page is the one that node is
  * interested in by itself.
  */
-void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
-			  int bytes)
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			  const u8 *new, int bytes)
 {
 	struct kvm_page_track_notifier_head *head;
 	struct kvm_page_track_notifier_node *n;
@@ -251,7 +251,7 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	idx = srcu_read_lock(&head->track_srcu);
 	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
 		if (n->track_write)
-			n->track_write(vcpu, gpa, new, bytes, n);
+			n->track_write(vcpu, gpa, gva, new, bytes, n);
 	srcu_read_unlock(&head->track_srcu, idx);
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f66db9473ea3..d3d159986243 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5281,7 +5281,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 	return vcpu_is_mmio_gpa(vcpu, gva, *gpa, write);
 }
 
-int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			const void *val, int bytes)
 {
 	int ret;
@@ -5289,14 +5289,14 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
 	if (ret < 0)
 		return 0;
-	kvm_page_track_write(vcpu, gpa, val, bytes);
+	kvm_page_track_write(vcpu, gpa, gva, val, bytes);
 	return 1;
 }
 
 struct read_write_emulator_ops {
 	int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
 				  int bytes);
-	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
+	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 				  void *val, int bytes);
 	int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
 			       int bytes, void *val);
@@ -5317,16 +5317,16 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 	return 0;
 }
 
-static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			void *val, int bytes)
 {
 	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
 }
 
-static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			 void *val, int bytes)
 {
-	return emulator_write_phys(vcpu, gpa, val, bytes);
+	return emulator_write_phys(vcpu, gpa, gva, val, bytes);
 }
 
 static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
@@ -5395,7 +5395,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
+	if (!ret && ops->read_write_emulate(vcpu, gpa, addr, val, bytes))
 		return X86EMUL_CONTINUE;
 
 	/*
@@ -5556,7 +5556,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 		return X86EMUL_CMPXCHG_FAILED;
 
 	kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
-	kvm_page_track_write(vcpu, gpa, new, bytes);
+	kvm_page_track_write(vcpu, gpa, addr, new, bytes);
 
 	return X86EMUL_CONTINUE;
 
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index dd3dfd00f4e6..4bd2cdf79f86 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1550,7 +1550,7 @@ static int kvmgt_page_track_remove(unsigned long handle, u64 gfn)
 	return 0;
 }
 
-static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 		const u8 *val, int len,
 		struct kvm_page_track_notifier_node *node)
 {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 23/92] kvm: page track: add support for preread, prewrite and preexec
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (21 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 22/92] kvm: x86: provide all page tracking hooks with the guest virtual address Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 24/92] kvm: x86: wire in the preread/prewrite/preexec page trackers Adalbert Lazăr
                   ` (71 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Xiao Guangrong, Sean Christopherson

From: Mihai Donțu <mdontu@bitdefender.com>

These callbacks return a boolean value. If false, the emulation should
stop and the instruction should be reexecuted in guest. The preread
callback can return the bytes needed by the read operation.

CC: Xiao Guangrong <guangrong.xiao@gmail.com>
CC: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_page_track.h |  19 +++-
 arch/x86/kvm/mmu.c                    |  81 +++++++++++++++++
 arch/x86/kvm/mmu.h                    |   4 +
 arch/x86/kvm/page_track.c             | 123 ++++++++++++++++++++++++--
 4 files changed, 217 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 0492a85f3a44..a431e5e1e5cb 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,7 +3,10 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+	KVM_PAGE_TRACK_PREREAD,
+	KVM_PAGE_TRACK_PREWRITE,
 	KVM_PAGE_TRACK_WRITE,
+	KVM_PAGE_TRACK_PREEXEC,
 	KVM_PAGE_TRACK_MAX,
 };
 
@@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
 	struct hlist_node node;
 
+	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      u8 *new, int bytes,
+			      struct kvm_page_track_notifier_node *node,
+			      bool *data_ready);
+	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			       const u8 *new, int bytes,
+			       struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when guest is writing the write-tracked page
 	 * and write emulation is finished at that time.
@@ -35,12 +45,14 @@ struct kvm_page_track_notifier_node {
 	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			    const u8 *new, int bytes,
 			    struct kvm_page_track_notifier_node *node);
+	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      struct kvm_page_track_notifier_node *node);
 	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages,
 				  struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when memory slot is being moved or removed
-	 * users can drop write-protection for the pages in that memory slot
+	 * users can drop active protection for the pages in that memory slot
 	 *
 	 * @kvm: the kvm where memory slot being moved or removed
 	 * @slot: the memory slot being moved or removed
@@ -73,7 +85,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    u8 *new, int bytes, bool *data_ready);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			     const u8 *new, int bytes);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			  const u8 *new, int bytes);
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9898d863b6b6..a86b165cf6dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1523,6 +1523,31 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
+static bool spte_read_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+	bool exec_only_supported = (shadow_present_mask == 0ull);
+
+	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
+
+	WARN_ON_ONCE(!exec_only_supported);
+
+	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool spte_exec_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
+
+	spte = spte & ~PT_USER_MASK;
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static bool __rmap_write_protect(struct kvm *kvm,
 				 struct kvm_rmap_head *rmap_head,
 				 bool pt_protect)
@@ -1537,6 +1562,32 @@ static bool __rmap_write_protect(struct kvm *kvm,
 	return flush;
 }
 
+static bool __rmap_read_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_read_protect(sptep);
+
+	return flush;
+}
+
+static bool __rmap_exec_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_exec_protect(sptep);
+
+	return flush;
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
 	u64 spte = *sptep;
@@ -1707,6 +1758,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool read_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		read_protected |= __rmap_read_protect(kvm, rmap_head);
+	}
+
+	return read_protected;
+}
+
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool exec_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		exec_protected |= __rmap_exec_protect(kvm, rmap_head);
+	}
+
+	return exec_protected;
+}
+
 static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b333147c4a..45948dabe0b6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,5 +210,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index ff7defb4a1d2..fc792939a05c 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -1,5 +1,5 @@
 /*
- * Support KVM gust page tracking
+ * Support KVM guest page tracking
  *
  * This feature allows us to track page access in guest. Currently, only
  * write access is tracked.
@@ -101,7 +101,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
 				  struct kvm_memory_slot *slot, gfn_t gfn,
@@ -119,9 +119,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
 	 */
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
-	if (mode == KVM_PAGE_TRACK_WRITE)
+	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
 		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
 			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
+		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
+		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
 
@@ -136,7 +143,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
@@ -229,12 +236,81 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
 
+/*
+ * Notify the node that a read access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ *
+ * The nodes will always be in conflict if they track the same page:
+ * - accepting a read won't guarantee that the next node will not override
+ *   the data (filling new/bytes and setting data_ready)
+ * - filling new/bytes with custom data won't guarantee that the next node
+ *   will not override that
+ */
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			    u8 *new, int bytes, bool *data_ready)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	*data_ready = false;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preread)
+			if (!n->track_preread(vcpu, gpa, gva, new, bytes, n,
+					       data_ready))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
+/*
+ * Notify the node that a write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			     const u8 *new, int bytes)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_prewrite)
+			if (!n->track_prewrite(vcpu, gpa, gva, new, bytes, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that write access is intercepted and write emulation is
  * finished at this time.
  *
- * The node should figure out if the written page is the one that node is
- * interested in by itself.
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
  */
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			  const u8 *new, int bytes)
@@ -255,12 +331,41 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	srcu_read_unlock(&head->track_srcu, idx);
 }
 
+/*
+ * Notify the node that an instruction is about to be executed.
+ * Returning false doesn't stop the other nodes from being called,
+ * but it will stop the emulation with X86EMUL_RETRY_INSTR.
+ *
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
+ */
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preexec)
+			if (!n->track_preexec(vcpu, gpa, gva, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that memory slot is being removed or moved so that it can
- * drop write-protection for the pages in the memory slot.
+ * drop active protection for the pages in the memory slot.
  *
- * The node should figure out it has any write-protected pages in this slot
- * by itself.
+ * The node should figure out if the written page is the one that the node
+ * is interested in by itself.
  */
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 24/92] kvm: x86: wire in the preread/prewrite/preexec page trackers
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (22 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 23/92] kvm: page track: add support for preread, prewrite and preexec Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 25/92] kvm: x86: intercept the write access on sidt and other emulated instructions Adalbert Lazăr
                   ` (70 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Sean Christopherson, Joerg Roedel

From: Mihai Donțu <mdontu@bitdefender.com>

These are needed by the introspection subsystem.

CC: Sean Christopherson <sean.j.christopherson@intel.com>
CC: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_emulate.h |  1 +
 arch/x86/kvm/emulate.c             | 10 +++++-
 arch/x86/kvm/mmu.c                 | 37 ++++++++++++++-------
 arch/x86/kvm/x86.c                 | 52 ++++++++++++++++++++++++------
 4 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 93c4bf598fb0..97cb592687cb 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -444,6 +444,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_RETRY_INSTR 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c338984c850d..34431cf31f74 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5366,7 +5366,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 					ctxt->memopp->addr.mem.ea + ctxt->_eip);
 
 done:
-	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_RETRY_INSTR;
+	else if (rc == X86EMUL_CONTINUE)
+		return EMULATION_OK;
+	else
+		return EMULATION_FAILED;
 }
 
 bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt)
@@ -5736,6 +5741,9 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 	if (rc == X86EMUL_INTERCEPTED)
 		return EMULATION_INTERCEPTED;
 
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_RETRY_INSTR;
+
 	if (rc == X86EMUL_CONTINUE)
 		writeback_registers(ctxt);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a86b165cf6dd..ff053f17b8c2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1111,9 +1111,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_add_page(kvm, slot, gfn,
-						    KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
@@ -1128,9 +1132,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-						       KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
@@ -2884,7 +2892,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_mmu_page *sp;
 
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
 		return true;
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
@@ -4006,15 +4015,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	if (!(error_code & PFERR_PRESENT_MASK))
 		return false;
 
 	/*
-	 * guest is writing the page which is write tracked which can
+	 * guest is reading/writing/fetching the page which is
+	 * read/write/execute tracked which can
 	 * not be fixed by page fault handler.
 	 */
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (((error_code & PFERR_USER_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+	    || ((error_code & PFERR_WRITE_MASK)
+		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
+		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
+	    || ((error_code & PFERR_FETCH_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
 		return true;
 
 	return false;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3d159986243..7aef002be551 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5065,6 +5065,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
 {
 	void *data = val;
 	int r = X86EMUL_CONTINUE;
+	bool data_ready;
 
 	while (bytes) {
 		gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access,
@@ -5075,6 +5076,13 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
 
 		if (gpa == UNMAPPED_GVA)
 			return X86EMUL_PROPAGATE_FAULT;
+		if (!kvm_page_track_preread(vcpu, gpa, addr, data, toread,
+						&data_ready))
+			return X86EMUL_RETRY_INSTR;
+		if (data_ready) {
+			WARN_ON(toread > bytes); /* TODO */
+			return X86EMUL_CONTINUE;
+		}
 		ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, data,
 					       offset, toread);
 		if (ret < 0) {
@@ -5106,6 +5114,9 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 	if (unlikely(gpa == UNMAPPED_GVA))
 		return X86EMUL_PROPAGATE_FAULT;
 
+	if (!kvm_page_track_preexec(vcpu, gpa, addr))
+		return X86EMUL_RETRY_INSTR;
+
 	offset = addr & (PAGE_SIZE-1);
 	if (WARN_ON(offset + bytes > PAGE_SIZE))
 		bytes = (unsigned)PAGE_SIZE - offset;
@@ -5284,13 +5295,26 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			const void *val, int bytes)
 {
-	int ret;
-
-	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
-	if (ret < 0)
-		return 0;
+	if (!kvm_page_track_prewrite(vcpu, gpa, gva, val, bytes))
+		return X86EMUL_RETRY_INSTR;
+	if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
 	kvm_page_track_write(vcpu, gpa, gva, val, bytes);
-	return 1;
+	return X86EMUL_CONTINUE;
+}
+
+static int emulator_read_phys(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			      void *val, int bytes)
+{
+	bool data_ready;
+
+	if (!kvm_page_track_preread(vcpu, gpa, gva, val, bytes, &data_ready))
+		return X86EMUL_RETRY_INSTR;
+	if (data_ready)
+		return X86EMUL_CONTINUE;
+	if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
+	return X86EMUL_CONTINUE;
 }
 
 struct read_write_emulator_ops {
@@ -5320,7 +5344,7 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			void *val, int bytes)
 {
-	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+	return emulator_read_phys(vcpu, gpa, gva, val, bytes);
 }
 
 static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
@@ -5395,8 +5419,11 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, addr, val, bytes))
-		return X86EMUL_CONTINUE;
+	if (!ret) {
+		ret = ops->read_write_emulate(vcpu, gpa, addr, val, bytes);
+		if (ret == X86EMUL_CONTINUE || ret == X86EMUL_RETRY_INSTR)
+			return ret;
+	}
 
 	/*
 	 * Is this MMIO handled locally?
@@ -5531,6 +5558,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (is_error_page(page))
 		goto emul_write;
 
+	if (!kvm_page_track_prewrite(vcpu, gpa, addr, new, bytes))
+		return X86EMUL_RETRY_INSTR;
+
 	kaddr = kmap_atomic(page);
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
@@ -6416,6 +6446,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 		trace_kvm_emulate_insn_start(vcpu);
 		++vcpu->stat.insn_emulation;
+		if (r == EMULATION_RETRY_INSTR)
+			return EMULATE_DONE;
 		if (r != EMULATION_OK)  {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
@@ -6457,6 +6489,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	r = x86_emulate_insn(ctxt);
 
+	if (r == EMULATION_RETRY_INSTR)
+		return EMULATE_DONE;
 	if (r == EMULATION_INTERCEPTED)
 		return EMULATE_DONE;
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 25/92] kvm: x86: intercept the write access on sidt and other emulated instructions
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (23 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 24/92] kvm: x86: wire in the preread/prewrite/preexec page trackers Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault() Adalbert Lazăr
                   ` (69 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Joerg Roedel

This is needed for the introspection subsystem to track the changes to
descriptor table registers.

CC: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7aef002be551..c28e2a20dec2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5185,11 +5185,14 @@ static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes
 
 		if (gpa == UNMAPPED_GVA)
 			return X86EMUL_PROPAGATE_FAULT;
+		if (!kvm_page_track_prewrite(vcpu, gpa, addr, data, towrite))
+			return X86EMUL_RETRY_INSTR;
 		ret = kvm_vcpu_write_guest(vcpu, gpa, data, towrite);
 		if (ret < 0) {
 			r = X86EMUL_IO_NEEDED;
 			goto out;
 		}
+		kvm_page_track_write(vcpu, gpa, addr, data, towrite);
 
 		bytes -= towrite;
 		data += towrite;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (24 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 25/92] kvm: x86: intercept the write access on sidt and other emulated instructions Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  8:12   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 27/92] kvm: introspection: use page track Adalbert Lazăr
                   ` (68 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed to filter #PF introspection events.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h | 4 ++++
 arch/x86/kvm/mmu.c              | 5 +++++
 arch/x86/kvm/svm.c              | 7 +++++++
 arch/x86/kvm/vmx/vmx.c          | 9 +++++++++
 4 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2d6bde6fa59f..7da1137a2b82 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1004,6 +1004,8 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(int index);
 	void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
+	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
+
 	struct kvm *(*vm_alloc)(void);
 	void (*vm_free)(struct kvm *);
 	int (*vm_init)(struct kvm *kvm);
@@ -1593,4 +1595,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ff053f17b8c2..9eaf6cc776a9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -6169,3 +6169,8 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->nested_pagefault(vcpu);
+}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f13a3a24d360..3c099c56099c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7098,6 +7098,11 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
 	return -ENODEV;
 }
 
+static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7109,6 +7114,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
 	.has_emulated_msr = svm_has_emulated_msr,
 
+	.nested_pagefault = svm_nested_pagefault,
+
 	.vcpu_create = svm_create_vcpu,
 	.vcpu_free = svm_free_vcpu,
 	.vcpu_reset = svm_vcpu_reset,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 30a6bcd735ec..e10ee8fd1c67 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7682,6 +7682,13 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
+		return false;
+	return true;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7693,6 +7700,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.nested_pagefault = vmx_nested_pagefault,
+
 	.vm_init = vmx_vm_init,
 	.vm_alloc = vmx_vm_alloc,
 	.vm_free = vmx_vm_free,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 27/92] kvm: introspection: use page track
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (25 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault() Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-13  9:06   ` Paolo Bonzini
  2019-08-09 15:59 ` [RFC PATCH v6 28/92] kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map() Adalbert Lazăr
                   ` (67 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu, Marian Rotariu

From: Mihai Donțu <mdontu@bitdefender.com>

From preread, prewrite and preexec callbacks we will send the
KVMI_EVENT_PF events caused by access rights enforced by the introspection
tool.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Marian Rotariu <marian.c.rotariu@gmail.com>
Signed-off-by: Marian Rotariu <marian.c.rotariu@gmail.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvmi_host.h |  12 ++
 arch/x86/kvm/kvmi.c              |  45 +++++
 include/uapi/linux/kvmi.h        |   4 +
 virt/kvm/kvmi.c                  | 293 ++++++++++++++++++++++++++++++-
 virt/kvm/kvmi_int.h              |  21 +++
 5 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/kvmi_host.h

diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
new file mode 100644
index 000000000000..7ab6dd71a0c2
--- /dev/null
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_KVMI_HOST_H
+#define _ASM_X86_KVMI_HOST_H
+
+#include <asm/kvm_host.h>
+#include <asm/kvm_page_track.h>
+
+struct kvmi_arch_mem_access {
+	unsigned long active[KVM_PAGE_TRACK_MAX][BITS_TO_LONGS(KVM_MEM_SLOTS_NUM)];
+};
+
+#endif /* _ASM_X86_KVMI_HOST_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 97c72cdc6fb0..d7b9201582b4 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -91,6 +91,12 @@ void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev)
 	kvmi_get_msrs(vcpu, event);
 }
 
+bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			u8 access)
+{
+	return KVMI_EVENT_ACTION_CONTINUE; /* TODO */
+}
+
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl)
 {
@@ -102,3 +108,42 @@ int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static const struct {
+	unsigned int allow_bit;
+	enum kvm_page_track_mode track_mode;
+} track_modes[] = {
+	{ KVMI_PAGE_ACCESS_R, KVM_PAGE_TRACK_PREREAD },
+	{ KVMI_PAGE_ACCESS_W, KVM_PAGE_TRACK_PREWRITE },
+	{ KVMI_PAGE_ACCESS_X, KVM_PAGE_TRACK_PREEXEC },
+};
+
+void kvmi_arch_update_page_tracking(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    struct kvmi_mem_access *m)
+{
+	struct kvmi_arch_mem_access *arch = &m->arch;
+	int i;
+
+	if (!slot) {
+		slot = gfn_to_memslot(kvm, m->gfn);
+		if (!slot)
+			return;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(track_modes); i++) {
+		unsigned int allow_bit = track_modes[i].allow_bit;
+		enum kvm_page_track_mode mode = track_modes[i].track_mode;
+		bool slot_tracked = test_bit(slot->id, arch->active[mode]);
+
+		if (m->access & allow_bit) {
+			if (slot_tracked) {
+				kvm_slot_page_track_remove_page(kvm, slot,
+								m->gfn, mode);
+				clear_bit(slot->id, arch->active[mode]);
+			}
+		} else if (!slot_tracked) {
+			kvm_slot_page_track_add_page(kvm, slot, m->gfn, mode);
+			set_bit(slot->id, arch->active[mode]);
+		}
+	}
+}
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index aa5bc909e278..c56e676ddb2b 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -70,6 +70,10 @@ enum {
 #define KVMI_EVENT_ACTION_RETRY         1
 #define KVMI_EVENT_ACTION_CRASH         2
 
+#define KVMI_PAGE_ACCESS_R (1 << 0)
+#define KVMI_PAGE_ACCESS_W (1 << 1)
+#define KVMI_PAGE_ACCESS_X (1 << 2)
+
 #define KVMI_MSG_SIZE (4096 - sizeof(struct kvmi_msg_hdr))
 
 struct kvmi_msg_hdr {
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index d0d9adf5b6ed..5cbc82b284f4 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -11,10 +11,27 @@
 #include <linux/bitmap.h>
 
 static struct kmem_cache *msg_cache;
+static struct kmem_cache *radix_cache;
 static struct kmem_cache *job_cache;
 
 static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu);
 static void kvmi_abort_events(struct kvm *kvm);
+static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	u8 *new, int bytes, struct kvm_page_track_notifier_node *node,
+	bool *data_ready);
+static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	const u8 *new, int bytes, struct kvm_page_track_notifier_node *node);
+static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	struct kvm_page_track_notifier_node *node);
+static void kvmi_track_create_slot(struct kvm *kvm,
+	struct kvm_memory_slot *slot, unsigned long npages,
+	struct kvm_page_track_notifier_node *node);
+static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+	struct kvm_page_track_notifier_node *node);
+
+static const u8 full_access  =	KVMI_PAGE_ACCESS_R |
+				KVMI_PAGE_ACCESS_W |
+				KVMI_PAGE_ACCESS_X;
 
 void *kvmi_msg_alloc(void)
 {
@@ -34,23 +51,96 @@ void kvmi_msg_free(void *addr)
 		kmem_cache_free(msg_cache, addr);
 }
 
+static struct kvmi_mem_access *__kvmi_get_gfn_access(struct kvmi *ikvm,
+						     const gfn_t gfn)
+{
+	return radix_tree_lookup(&ikvm->access_tree, gfn);
+}
+
+static int kvmi_get_gfn_access(struct kvmi *ikvm, const gfn_t gfn,
+			       u8 *access)
+{
+	struct kvmi_mem_access *m;
+
+	*access = full_access;
+
+	read_lock(&ikvm->access_tree_lock);
+	m = __kvmi_get_gfn_access(ikvm, gfn);
+	if (m)
+		*access = m->access;
+	read_unlock(&ikvm->access_tree_lock);
+
+	return m ? 0 : -1;
+}
+
+static bool kvmi_restricted_access(struct kvmi *ikvm, gpa_t gpa, u8 access)
+{
+	u8 allowed_access;
+	int err;
+
+	err = kvmi_get_gfn_access(ikvm, gpa_to_gfn(gpa), &allowed_access);
+
+	if (err)
+		return false;
+
+	/*
+	 * We want to be notified only for violations involving access
+	 * bits that we've specifically cleared
+	 */
+	if ((~allowed_access) & access)
+		return true;
+
+	return false;
+}
+
+static void kvmi_clear_mem_access(struct kvm *kvm)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+	struct kvmi *ikvm = IKVM(kvm);
+	int idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	write_lock(&ikvm->access_tree_lock);
+
+	radix_tree_for_each_slot(slot, &ikvm->access_tree, &iter, 0) {
+		struct kvmi_mem_access *m = *slot;
+
+		m->access = full_access;
+		kvmi_arch_update_page_tracking(kvm, NULL, m);
+
+		radix_tree_iter_delete(&ikvm->access_tree, &iter, slot);
+		kmem_cache_free(radix_cache, m);
+	}
+
+	write_unlock(&ikvm->access_tree_lock);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
 static void kvmi_cache_destroy(void)
 {
 	kmem_cache_destroy(msg_cache);
 	msg_cache = NULL;
+	kmem_cache_destroy(radix_cache);
+	radix_cache = NULL;
 	kmem_cache_destroy(job_cache);
 	job_cache = NULL;
 }
 
 static int kvmi_cache_create(void)
 {
+	radix_cache = kmem_cache_create("kvmi_radix_tree",
+					sizeof(struct kvmi_mem_access),
+					0, SLAB_ACCOUNT, NULL);
 	job_cache = kmem_cache_create("kvmi_job",
 				      sizeof(struct kvmi_job),
 				      0, SLAB_ACCOUNT, NULL);
 	msg_cache = kmem_cache_create("kvmi_msg", KVMI_MSG_SIZE_ALLOC,
 				      4096, SLAB_ACCOUNT, NULL);
 
-	if (!msg_cache || !job_cache) {
+	if (!msg_cache || !radix_cache || !job_cache) {
 		kvmi_cache_destroy();
 
 		return -1;
@@ -77,6 +167,10 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 	if (!ikvm)
 		return false;
 
+	/* see comments of radix_tree_preload() - no direct reclaim */
+	INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM);
+	rwlock_init(&ikvm->access_tree_lock);
+
 	atomic_set(&ikvm->ev_seq, 0);
 
 	set_bit(KVMI_GET_VERSION, ikvm->cmd_allow_mask);
@@ -85,6 +179,12 @@ static bool alloc_kvmi(struct kvm *kvm, const struct kvm_introspection *qemu)
 
 	memcpy(&ikvm->uuid, &qemu->uuid, sizeof(ikvm->uuid));
 
+	ikvm->kptn_node.track_preread = kvmi_track_preread;
+	ikvm->kptn_node.track_prewrite = kvmi_track_prewrite;
+	ikvm->kptn_node.track_preexec = kvmi_track_preexec;
+	ikvm->kptn_node.track_create_slot = kvmi_track_create_slot;
+	ikvm->kptn_node.track_flush_slot = kvmi_track_flush_slot;
+
 	ikvm->kvm = kvm;
 	kvm->kvmi = ikvm;
 
@@ -276,6 +376,179 @@ void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
 	vcpu->kvmi = NULL;
 }
 
+static bool is_pf_of_interest(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	if (kvm_mmu_nested_pagefault(vcpu))
+		return false;
+
+	/* Have we shown interest in this page? */
+	return kvmi_restricted_access(IKVM(kvm), gpa, access);
+}
+
+static bool __kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	u8 *new, int bytes, struct kvm_page_track_notifier_node *node,
+	bool *data_ready)
+{
+	bool ret;
+
+	if (!is_pf_of_interest(vcpu, gpa, KVMI_PAGE_ACCESS_R))
+		return true;
+
+	ret = kvmi_arch_pf_event(vcpu, gpa, gva, KVMI_PAGE_ACCESS_R);
+
+	return ret;
+}
+
+static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	u8 *new, int bytes, struct kvm_page_track_notifier_node *node,
+	bool *data_ready)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_PF))
+		ret = __kvmi_track_preread(vcpu, gpa, gva, new, bytes, node,
+					   data_ready);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
+static bool __kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	const u8 *new, int bytes,
+	struct kvm_page_track_notifier_node *node)
+{
+	if (!is_pf_of_interest(vcpu, gpa, KVMI_PAGE_ACCESS_W))
+		return true;
+
+	return kvmi_arch_pf_event(vcpu, gpa, gva, KVMI_PAGE_ACCESS_W);
+}
+
+static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	const u8 *new, int bytes,
+	struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_PF))
+		ret = __kvmi_track_prewrite(vcpu, gpa, gva, new, bytes, node);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
+static bool __kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	struct kvm_page_track_notifier_node *node)
+{
+	if (!is_pf_of_interest(vcpu, gpa, KVMI_PAGE_ACCESS_X))
+		return true;
+
+	return kvmi_arch_pf_event(vcpu, gpa, gva, KVMI_PAGE_ACCESS_X);
+}
+
+static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+	struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_PF))
+		ret = __kvmi_track_preexec(vcpu, gpa, gva, node);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
+static void kvmi_track_create_slot(struct kvm *kvm,
+	struct kvm_memory_slot *slot,
+	unsigned long npages,
+	struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm;
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + npages;
+	int idx;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	read_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = __kvmi_get_gfn_access(ikvm, start);
+		if (m)
+			kvmi_arch_update_page_tracking(kvm, slot, m);
+		start++;
+	}
+
+	read_unlock(&ikvm->access_tree_lock);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	kvmi_put(kvm);
+}
+
+static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+	struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm;
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + slot->npages;
+	int idx;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	write_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = __kvmi_get_gfn_access(ikvm, start);
+		if (m) {
+			u8 prev_access = m->access;
+
+			m->access = full_access;
+			kvmi_arch_update_page_tracking(kvm, slot, m);
+			m->access = prev_access;
+		}
+
+		start++;
+	}
+
+	write_unlock(&ikvm->access_tree_lock);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	kvmi_put(kvm);
+}
+
 static void kvmi_end_introspection(struct kvmi *ikvm)
 {
 	struct kvm *kvm = ikvm->kvm;
@@ -290,6 +563,22 @@ static void kvmi_end_introspection(struct kvmi *ikvm)
 	 */
 	kvmi_abort_events(kvm);
 
+	/*
+	 * This may sleep on synchronize_srcu() so it's not allowed to be
+	 * called under kvmi_put().
+	 * Also synchronize_srcu() may deadlock on (page tracking) read-side
+	 * regions that are waiting for reply to events, so must be called
+	 * after kvmi_abort_events().
+	 */
+	kvm_page_track_unregister_notifier(kvm, &ikvm->kptn_node);
+
+	/*
+	 * This function uses kvm->mmu_lock so it's not allowed to be
+	 * called under kvmi_put(). It can reach a deadlock if called
+	 * from kvm_mmu_load -> kvmi_tracked_gfn -> kvmi_put.
+	 */
+	kvmi_clear_mem_access(kvm);
+
 	/*
 	 * At this moment the socket is shut down, no more commands will come
 	 * from the introspector, and the only way into the introspection is
@@ -351,6 +640,8 @@ int kvmi_hook(struct kvm *kvm, const struct kvm_introspection *qemu)
 		goto err_alloc;
 	}
 
+	kvm_page_track_register_notifier(kvm, &ikvm->kptn_node);
+
 	/*
 	 * Make sure all the KVM/KVMI structures are linked and no pointer
 	 * is read as NULL after the reference count has been set.
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 7cff91bc1acc..d798908d0f70 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -6,6 +6,7 @@
 #include <linux/kvm_host.h>
 
 #include <uapi/linux/kvmi.h>
+#include <asm/kvmi_host.h>
 
 #define kvmi_debug(ikvm, fmt, ...) \
 	kvm_debug("%pU " fmt, &ikvm->uuid, ## __VA_ARGS__)
@@ -104,6 +105,10 @@ struct kvmi_vcpu {
 
 struct kvmi {
 	struct kvm *kvm;
+	struct kvm_page_track_notifier_node kptn_node;
+
+	struct radix_tree_root access_tree;
+	rwlock_t access_tree_lock;
 
 	struct socket *sock;
 	struct task_struct *recv;
@@ -118,6 +123,17 @@ struct kvmi {
 	bool cmd_reply_disabled;
 };
 
+struct kvmi_mem_access {
+	gfn_t gfn;
+	u8 access;
+	struct kvmi_arch_mem_access arch;
+};
+
+static inline bool is_event_enabled(struct kvm_vcpu *vcpu, int event)
+{
+	return false; /* TODO */
+}
+
 /* kvmi_msg.c */
 bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
@@ -138,7 +154,12 @@ int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void *ctx, void (*free_fct)(void *ctx));
 
 /* arch */
+void kvmi_arch_update_page_tracking(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    struct kvmi_mem_access *m);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
+bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
+			u8 access);
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl);
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 28/92] kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (26 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 27/92] kvm: introspection: use page track Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 29/92] kvm: introspection: add KVMI_CONTROL_EVENTS Adalbert Lazăr
                   ` (66 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Sean Christopherson

From: Mihai Donțu <mdontu@bitdefender.com>

KVM doesn't normally need to keep track that closely to page access bits,
however for the introspection subsystem this is essential.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://marc.info/?l=kvm&m=149804987417131&w=2
CC: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/mmu.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9eaf6cc776a9..810e3e5bd575 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2491,6 +2491,20 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sp);
 }
 
+static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
+					   unsigned int acc)
+{
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+		acc &= ~ACC_USER_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+		acc &= ~ACC_WRITE_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
+		acc &= ~ACC_EXEC_MASK;
+
+	return acc;
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -2511,7 +2525,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	role.direct = direct;
 	if (role.direct)
 		role.cr4_pae = 0;
-	role.access = access;
+	role.access = kvm_mmu_page_track_acc(vcpu, gfn, access);
 	if (!vcpu->arch.mmu->direct_map
 	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
 		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
@@ -3234,7 +3248,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
 		if (iterator.level == level) {
-			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
+			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn,
+								  ACC_ALL);
+
+			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
 					       write, level, gfn, pfn, prefault,
 					       map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 29/92] kvm: introspection: add KVMI_CONTROL_EVENTS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (27 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 28/92] kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map() Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 30/92] kvm: x86: add kvm_spt_fault() Adalbert Lazăr
                   ` (65 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This command enables/disables vCPU introspection events.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 53 ++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          |  7 ++++
 virt/kvm/kvmi.c                    | 13 ++++++++
 virt/kvm/kvmi_int.h                |  6 +++-
 virt/kvm/kvmi_msg.c                | 24 ++++++++++++++
 5 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 71897338e85a..957641802cac 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -456,6 +456,59 @@ Returns the TSC frequency (in HZ) for the specified vCPU if available
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+8. KVMI_CONTROL_EVENTS
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_control_events {
+		__u16 event_id;
+		__u8 enable;
+		__u8 padding1;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables vCPU introspection events. This command can be used with
+the following events::
+
+	KVMI_EVENT_CR
+	KVMI_EVENT_MSR
+	KVMI_EVENT_XSETBV
+	KVMI_EVENT_BREAKPOINT
+	KVMI_EVENT_HYPERCALL
+	KVMI_EVENT_PF
+	KVMI_EVENT_TRAP
+	KVMI_EVENT_DESCRIPTOR
+	KVMI_EVENT_SINGLESTEP
+
+When an event is enabled, the introspection tool is notified and it
+must reply with: continue, retry, crash, etc. (see **Events** below).
+
+The *KVMI_EVENT_PAUSE_VCPU* event is always allowed,
+because it is triggered by the *KVMI_PAUSE_VCPU* command.
+The *KVMI_EVENT_CREATE_VCPU* and *KVMI_EVENT_UNHOOK* events are controlled
+by the *KVMI_CONTROL_VM_EVENTS* command.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the event ID is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EPERM - the access is restricted by the host
+* -KVM_EOPNOTSUPP - one the events can't be intercepted in the current setup
+
 Events
 ======
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index c56e676ddb2b..934c0610140a 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -120,6 +120,13 @@ struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
 
+struct kvmi_control_events {
+	__u16 event_id;
+	__u8 enable;
+	__u8 padding1;
+	__u32 padding2;
+};
+
 struct kvmi_control_vm_events {
 	__u16 event_id;
 	__u8 enable;
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 5cbc82b284f4..14963474617e 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -969,6 +969,19 @@ void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
+			    bool enable)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (enable)
+		set_bit(event_id, ivcpu->ev_mask);
+	else
+		clear_bit(event_id, ivcpu->ev_mask);
+
+	return 0;
+}
+
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index d798908d0f70..c0044cae8089 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -95,6 +95,8 @@ struct kvmi_vcpu {
 	bool reply_waiting;
 	struct kvmi_vcpu_reply reply;
 
+	DECLARE_BITMAP(ev_mask, KVMI_NUM_EVENTS);
+
 	struct list_head job_list;
 	spinlock_t job_lock;
 
@@ -131,7 +133,7 @@ struct kvmi_mem_access {
 
 static inline bool is_event_enabled(struct kvm_vcpu *vcpu, int event)
 {
-	return false; /* TODO */
+	return test_bit(event, IVCPU(vcpu)->ev_mask);
 }
 
 /* kvmi_msg.c */
@@ -146,6 +148,8 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm);
 void *kvmi_msg_alloc(void);
 void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
+			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
 int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 3372d8c7e74f..a3c67af8674e 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -24,6 +24,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
+	[KVMI_CONTROL_EVENTS]        = "KVMI_CONTROL_EVENTS",
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
 	[KVMI_EVENT]                 = "KVMI_EVENT",
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
@@ -403,6 +404,28 @@ static int handle_get_vcpu_info(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, 0, &rpl, sizeof(rpl));
 }
 
+static int handle_control_events(struct kvm_vcpu *vcpu,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req,
+				 vcpu_reply_fct reply_cb)
+{
+	unsigned long known_events = KVMI_KNOWN_VCPU_EVENTS;
+	const struct kvmi_control_events *req = _req;
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	int ec;
+
+	if (req->padding1 || req->padding2)
+		ec = -KVM_EINVAL;
+	else if (!test_bit(req->event_id, &known_events))
+		ec = -KVM_EINVAL;
+	else if (!is_event_allowed(ikvm, req->event_id))
+		ec = -KVM_EPERM;
+	else
+		ec = kvmi_cmd_control_events(vcpu, req->event_id, req->enable);
+
+	return reply_cb(vcpu, msg, ec, NULL, 0);
+}
+
 /*
  * These commands are executed on the vCPU thread. The receiving thread
  * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
@@ -412,6 +435,7 @@ static int handle_get_vcpu_info(struct kvm_vcpu *vcpu,
 static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      const struct kvmi_msg_hdr *, const void *,
 			      vcpu_reply_fct) = {
+	[KVMI_CONTROL_EVENTS]   = handle_control_events,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
 };

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 30/92] kvm: x86: add kvm_spt_fault()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (28 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 29/92] kvm: introspection: add KVMI_CONTROL_EVENTS Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 31/92] kvm: introspection: add KVMI_EVENT_PF Adalbert Lazăr
                   ` (64 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed to filter #PF introspection events, when not caused by
EPT/NPT fault. One such case is when handle_ud() calls the emulator
which failes to fetch the opcodes from stack (which is hooked rw-)
which leads to a page fault event.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/svm.c              | 8 ++++++++
 arch/x86/kvm/vmx/vmx.c          | 8 ++++++++
 arch/x86/kvm/x86.c              | 6 ++++++
 4 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7da1137a2b82..f1b3d89a0430 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1005,6 +1005,7 @@ struct kvm_x86_ops {
 	void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
+	bool (*spt_fault)(struct kvm_vcpu *vcpu);
 
 	struct kvm *(*vm_alloc)(void);
 	void (*vm_free)(struct kvm *);
@@ -1596,5 +1597,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
+bool kvm_spt_fault(struct kvm_vcpu *vcpu);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3c099c56099c..6b533698c73d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7103,6 +7103,13 @@ static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static bool svm_spt_fault(struct kvm_vcpu *vcpu)
+{
+	const struct vcpu_svm *svm = to_svm(vcpu);
+
+	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7115,6 +7122,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.has_emulated_msr = svm_has_emulated_msr,
 
 	.nested_pagefault = svm_nested_pagefault,
+	.spt_fault = svm_spt_fault,
 
 	.vcpu_create = svm_create_vcpu,
 	.vcpu_free = svm_free_vcpu,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e10ee8fd1c67..97cfd5a316f3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7689,6 +7689,13 @@ static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static bool vmx_spt_fault(struct kvm_vcpu *vcpu)
+{
+	const struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	return (vmx->exit_reason == EXIT_REASON_EPT_VIOLATION);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7701,6 +7708,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.has_emulated_msr = vmx_has_emulated_msr,
 
 	.nested_pagefault = vmx_nested_pagefault,
+	.spt_fault = vmx_spt_fault,
 
 	.vm_init = vmx_vm_init,
 	.vm_alloc = vmx_vm_alloc,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c28e2a20dec2..257c4a262acd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9884,6 +9884,12 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+bool kvm_spt_fault(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->spt_fault(vcpu);
+}
+EXPORT_SYMBOL(kvm_spt_fault);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 31/92] kvm: introspection: add KVMI_EVENT_PF
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (29 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 30/92] kvm: x86: add kvm_spt_fault() Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 32/92] kvm: introspection: add KVMI_GET_PAGE_ACCESS Adalbert Lazăr
                   ` (63 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

This event is sent when a #PF occurs due to a failed permission check
in the shadow page tables, for a page in which the introspection tool
has shown interest.

The introspection tool can respond to a KVMI_EVENT_PF event with custom
input for the current instruction. This input is used to trick the guest
software into believing it has read certain data, in order to hide the
content of certain memory areas (eg. hide injected code from integrity
checkers).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  63 ++++++++++++++++++
 arch/x86/kvm/kvmi.c                |  38 ++++++++++-
 arch/x86/kvm/x86.c                 |   7 +-
 include/linux/kvmi.h               |   4 ++
 include/uapi/linux/kvmi.h          |  18 +++++
 virt/kvm/kvmi.c                    | 103 +++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  13 ++++
 virt/kvm/kvmi_msg.c                |  55 +++++++++++++++
 8 files changed, 298 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 957641802cac..0fc51b57b1e8 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -618,3 +618,66 @@ The introspection tool has a chance to unhook and close the KVMI channel
 This event is sent when a new vCPU is created and the introspection has
 been enabled for this event (see *KVMI_CONTROL_VM_EVENTS*).
 
+3. KVMI_EVENT_PF
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_pf {
+		__u64 gva;
+		__u64 gpa;
+		__u8 access;
+		__u8 padding1;
+		__u16 view;
+		__u32 padding2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+	struct kvmi_event_pf_reply {
+		__u64 ctx_addr;
+		__u32 ctx_size;
+		__u8 singlestep;
+		__u8 rep_complete;
+		__u16 padding;
+		__u8 ctx_data[256];
+	};
+
+This event is sent when a hypervisor page fault occurs due to a failed
+permission check in the shadow page tables, the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*) and the event was
+generated for a page in which the introspector has shown interest
+(ie. has previously touched it by adjusting the spte permissions).
+
+The shadow page tables can be used by the introspection tool to guarantee
+the purpose of code areas inside the guest (code, rodata, stack, heap
+etc.) Each attempt at an operation unfitting for a certain memory
+range (eg. execute code in heap) triggers a page fault and gives the
+introspection tool the chance to audit the code attempting the operation.
+
+``kvmi_event``, guest virtual address (or 0xffffffff/UNMAPPED_GVA),
+guest physical address, access flags (eg. KVMI_PAGE_ACCESS_R) and the
+EPT view are sent to the introspector.
+
+The *CONTINUE* action will continue the page fault handling via emulation
+(with custom input if ``ctx_size`` > 0). The use of custom input is
+to trick the guest software into believing it has read certain data,
+in order to hide the content of certain memory areas (eg. hide injected
+code from integrity checkers). If ``rep_complete`` is not zero, the REP
+prefixed instruction should be emulated just once (or at least no other
+*KVMI_EVENT_PF* event should be sent for the current instruction).
+
+The *RETRY* action is used by the introspector to retry the execution of
+the current instruction. Either using single-step (if ``singlestep`` is
+not zero) or return to guest (if the introspector changed the instruction
+pointer or the page restrictions).
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index d7b9201582b4..121819f9c487 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -94,7 +94,43 @@ void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev)
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access)
 {
-	return KVMI_EVENT_ACTION_CONTINUE; /* TODO */
+	struct kvmi_vcpu *ivcpu;
+	u32 ctx_size;
+	u64 ctx_addr;
+	u32 action;
+	bool singlestep_ignored;
+	bool ret = false;
+
+	if (!kvm_spt_fault(vcpu))
+		/* We are only interested in EPT/NPT violations */
+		return true;
+
+	ivcpu = IVCPU(vcpu);
+	ctx_size = sizeof(ivcpu->ctx_data);
+
+	if (ivcpu->effective_rep_complete)
+		return true;
+
+	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &singlestep_ignored,
+				  &ivcpu->rep_complete, &ctx_addr,
+				  ivcpu->ctx_data, &ctx_size);
+
+	ivcpu->ctx_size = 0;
+	ivcpu->ctx_addr = 0;
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ivcpu->ctx_size = ctx_size;
+		ivcpu->ctx_addr = ctx_addr;
+		ret = true;
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "PF");
+	}
+
+	return ret;
 }
 
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 257c4a262acd..ef6d9dd80086 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6418,6 +6418,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	vcpu->arch.l1tf_flush_l1d = true;
 
+	kvmi_init_emulate(vcpu);
+
 	/*
 	 * Clear write_fault_to_shadow_pgtable here to ensure it is
 	 * never reused.
@@ -6523,9 +6525,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 			writeback = false;
 		r = EMULATE_USER_EXIT;
 		vcpu->arch.complete_userspace_io = complete_emulated_mmio;
-	} else if (r == EMULATION_RESTART)
+	} else if (r == EMULATION_RESTART) {
+		kvmi_activate_rep_complete(vcpu);
 		goto restart;
-	else
+	} else
 		r = EMULATE_DONE;
 
 	if (writeback) {
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index ae5de1905b55..80c15b9195e4 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -17,6 +17,8 @@ int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
 void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
+void kvmi_init_emulate(struct kvm_vcpu *vcpu);
+void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
 
 #else
 
@@ -27,6 +29,8 @@ static inline void kvmi_destroy_vm(struct kvm *kvm) { }
 static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
+static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
+static inline void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu) { }
 
 #endif /* CONFIG_KVM_INTROSPECTION */
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 934c0610140a..40a5c304c26f 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -155,4 +155,22 @@ struct kvmi_event_reply {
 	__u32 padding2;
 };
 
+struct kvmi_event_pf {
+	__u64 gva;
+	__u64 gpa;
+	__u8 access;
+	__u8 padding1;
+	__u16 view;
+	__u32 padding2;
+};
+
+struct kvmi_event_pf_reply {
+	__u64 ctx_addr;
+	__u32 ctx_size;
+	__u8 singlestep;
+	__u8 rep_complete;
+	__u16 padding;
+	__u8 ctx_data[256];
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 14963474617e..0264115a7f4d 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -387,6 +387,52 @@ static bool is_pf_of_interest(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
 	return kvmi_restricted_access(IKVM(kvm), gpa, access);
 }
 
+/*
+ * The custom input is defined by a virtual address and size, and all reads
+ * must be within this space. Reads that are completely outside should be
+ * satisfyied using guest memory. Overlapping reads are erroneous.
+ */
+static int use_custom_input(struct kvm_vcpu *vcpu, gva_t gva, u8 *new,
+			    int bytes)
+{
+	unsigned int offset;
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (!ivcpu->ctx_size || !bytes)
+		return 0;
+
+	if (bytes < 0 || bytes > ivcpu->ctx_size) {
+		kvmi_warn_once(IKVM(vcpu->kvm),
+			       "invalid range: %d (max: %u)\n",
+			       bytes, ivcpu->ctx_size);
+		return 0;
+	}
+
+	if (gva + bytes <= ivcpu->ctx_addr ||
+	    gva >= ivcpu->ctx_addr + ivcpu->ctx_size)
+		return 0;
+
+	if (gva < ivcpu->ctx_addr && gva + bytes > ivcpu->ctx_addr) {
+		kvmi_warn_once(IKVM(vcpu->kvm),
+			       "read ranges overlap: 0x%lx:%d, 0x%llx:%u\n",
+			       gva, bytes, ivcpu->ctx_addr, ivcpu->ctx_size);
+		return 0;
+	}
+
+	if (gva + bytes > ivcpu->ctx_addr + ivcpu->ctx_size) {
+		kvmi_warn_once(IKVM(vcpu->kvm),
+			       "read ranges overlap: 0x%lx:%d, 0x%llx:%u\n",
+			       gva, bytes, ivcpu->ctx_addr, ivcpu->ctx_size);
+		return 0;
+	}
+
+	offset = gva - ivcpu->ctx_addr;
+
+	memcpy(new, ivcpu->ctx_data + offset, bytes);
+
+	return bytes;
+}
+
 static bool __kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	u8 *new, int bytes, struct kvm_page_track_notifier_node *node,
 	bool *data_ready)
@@ -396,9 +442,24 @@ static bool __kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	if (!is_pf_of_interest(vcpu, gpa, KVMI_PAGE_ACCESS_R))
 		return true;
 
+	if (use_custom_input(vcpu, gva, new, bytes))
+		goto out_custom;
+
 	ret = kvmi_arch_pf_event(vcpu, gpa, gva, KVMI_PAGE_ACCESS_R);
 
+	if (ret && use_custom_input(vcpu, gva, new, bytes))
+		goto out_custom;
+
 	return ret;
+
+out_custom:
+	if (*data_ready)
+		kvmi_err(IKVM(vcpu->kvm),
+			"Override custom data from another tracker\n");
+
+	*data_ready = true;
+
+	return true;
 }
 
 static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
@@ -855,6 +916,48 @@ void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 	}
 }
 
+void kvmi_init_emulate(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	struct kvmi_vcpu *ivcpu;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return;
+
+	ivcpu = IVCPU(vcpu);
+
+	ivcpu->rep_complete = false;
+	ivcpu->effective_rep_complete = false;
+
+	ivcpu->ctx_size = 0;
+	ivcpu->ctx_addr = 0;
+
+	kvmi_put(vcpu->kvm);
+}
+EXPORT_SYMBOL(kvmi_init_emulate);
+
+/*
+ * If the user has requested that events triggered by repetitive
+ * instructions be suppressed after the first cycle, then this
+ * function will effectively activate it. This ensures that we don't
+ * prematurely suppress potential events (second or later) triggerd
+ * by an instruction during a single pass.
+ */
+void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return;
+
+	IVCPU(vcpu)->effective_rep_complete = IVCPU(vcpu)->rep_complete;
+
+	kvmi_put(vcpu->kvm);
+}
+EXPORT_SYMBOL(kvmi_activate_rep_complete);
+
 static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index c0044cae8089..d478d9a2e769 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -26,6 +26,8 @@
 
 #define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
 
+#define KVMI_CTX_DATA_SIZE FIELD_SIZEOF(struct kvmi_event_pf_reply, ctx_data)
+
 #define KVMI_MSG_SIZE_ALLOC (sizeof(struct kvmi_msg_hdr) + KVMI_MSG_SIZE)
 
 #define KVMI_KNOWN_VCPU_EVENTS ( \
@@ -92,6 +94,12 @@ struct kvmi_vcpu_reply {
 };
 
 struct kvmi_vcpu {
+	u8 ctx_data[KVMI_CTX_DATA_SIZE];
+	u32 ctx_size;
+	u64 ctx_addr;
+	bool rep_complete;
+	bool effective_rep_complete;
+
 	bool reply_waiting;
 	struct kvmi_vcpu_reply reply;
 
@@ -141,6 +149,9 @@ bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
 void kvmi_sock_put(struct kvmi *ikvm);
 bool kvmi_msg_process(struct kvmi *ikvm);
+u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
+		     bool *singlestep, bool *rep_complete,
+		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
 u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
 int kvmi_msg_send_unhook(struct kvmi *ikvm);
 
@@ -156,6 +167,8 @@ int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
 int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
 		 void *ctx, void (*free_fct)(void *ctx));
+void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
+				      const char *str);
 
 /* arch */
 void kvmi_arch_update_page_tracking(struct kvm *kvm,
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index a3c67af8674e..0642356d4e04 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -764,6 +764,61 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm)
 	return kvmi_sock_write(ikvm, vec, n, msg_size);
 }
 
+u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
+		     bool *singlestep, bool *rep_complete, u64 *ctx_addr,
+		     u8 *ctx_data, u32 *ctx_size)
+{
+	u32 max_ctx_size = *ctx_size;
+	struct kvmi_event_pf e;
+	struct kvmi_event_pf_reply r;
+	int err, action;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+	e.gva = gva;
+	e.access = access;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_PF, &e, sizeof(e),
+			      &r, sizeof(r), &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	if (e.padding1 || e.padding2) {
+		struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+		kvmi_err(ikvm, "%s: non zero padding %u,%u\n",
+			__func__, e.padding1, e.padding2);
+		kvmi_sock_shutdown(ikvm);
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ctx_size = 0;
+
+	if (r.ctx_size > max_ctx_size) {
+		struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+		kvmi_err(ikvm, "%s: ctx_size (recv:%u max:%u)\n",
+				__func__, r.ctx_size, max_ctx_size);
+
+		kvmi_sock_shutdown(ikvm);
+
+		*singlestep = false;
+		*rep_complete = 0;
+
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*singlestep = r.singlestep;
+	*rep_complete = r.rep_complete;
+
+	*ctx_size = min_t(u32, r.ctx_size, sizeof(r.ctx_data));
+	*ctx_addr = r.ctx_addr;
+	if (*ctx_size)
+		memcpy(ctx_data, r.ctx_data, *ctx_size);
+
+	return action;
+}
+
 u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
 {
 	int err, action;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 32/92] kvm: introspection: add KVMI_GET_PAGE_ACCESS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (30 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 31/92] kvm: introspection: add KVMI_EVENT_PF Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 33/92] kvm: introspection: add KVMI_SET_PAGE_ACCESS Adalbert Lazăr
                   ` (62 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

Returns the spte access bits (rwx) for an array of guest physical
addresses.

It does this by checking the radix tree in which only the spte bits
"enforced" by the introspection tool are saved. This information should
already be known by the tool. Not to mention that the KVMI_EVENT_PF
events are sent only for EPT violation caused by these restrictions.
So, we might drop this command.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 54 ++++++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                | 41 +++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 11 ++++++
 virt/kvm/kvmi.c                    |  9 +++++
 virt/kvm/kvmi_int.h                |  6 ++++
 virt/kvm/kvmi_msg.c                | 17 ++++++++++
 6 files changed, 138 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 0fc51b57b1e8..c27fea73ccfb 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -509,6 +509,60 @@ by the *KVMI_CONTROL_VM_EVENTS* command.
 * -KVM_EPERM - the access is restricted by the host
 * -KVM_EOPNOTSUPP - one the events can't be intercepted in the current setup
 
+9. KVMI_GET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 view;
+		__u16 count;
+		__u32 padding;
+		__u64 gpa[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_page_access_reply {
+		__u8 access[0];
+	};
+
+Returns the spte access bits (rwx) for an array of ``count`` guest
+physical addresses.
+
+The valid access bits for *KVMI_GET_PAGE_ACCESS* and *KVMI_SET_PAGE_ACCESS*
+are::
+
+	KVMI_PAGE_ACCESS_R
+	KVMI_PAGE_ACCESS_W
+	KVMI_PAGE_ACCESS_X
+
+By default, for any guest physical address, the returned access mode will
+be 'rwx' (all the above bits). If the introspection tool must prevent
+the code execution from a guest page, for example, it should use the
+KVMI_SET_PAGE_ACCESS command to set the 'rw' bits for any guest physical
+addresses contained in that page. Of course, in order to receive
+page fault events when these violations take place, the KVMI_CONTROL_EVENTS
+command must be used to enable this type of event (KVMI_EVENT_PF).
+
+On Intel hardware with multiple EPT views, the ``view`` argument selects the
+EPT view (0 is primary). On all other hardware it must be zero.
+
+:Errors:
+
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EOPNOTSUPP - a SPT view was selected but the hardware doesn't support it
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 121819f9c487..59cf33127b4b 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -183,3 +183,44 @@ void kvmi_arch_update_page_tracking(struct kvm *kvm,
 		}
 	}
 }
+
+int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const struct kvmi_get_page_access *req,
+				  struct kvmi_get_page_access_reply **dest,
+				  size_t *dest_size)
+{
+	struct kvmi_get_page_access_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	size_t k, n = req->count;
+	int ec = 0;
+
+	if (req->padding)
+		return -KVM_EINVAL;
+
+	if (msg->size < sizeof(*req) + req->count * sizeof(req->gpa[0]))
+		return -KVM_EINVAL;
+
+	if (req->view != 0)	/* TODO */
+		return -KVM_EOPNOTSUPP;
+
+	rpl_size = sizeof(*rpl) + sizeof(rpl->access[0]) * n;
+	rpl = kvmi_msg_alloc_check(rpl_size);
+	if (!rpl)
+		return -KVM_ENOMEM;
+
+	for (k = 0; k < n && ec == 0; k++)
+		ec = kvmi_cmd_get_page_access(ikvm, req->gpa[k],
+					      &rpl->access[k]);
+
+	if (ec) {
+		kvmi_msg_free(rpl);
+		return ec;
+	}
+
+	*dest = rpl;
+	*dest_size = rpl_size;
+
+	return 0;
+}
+
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 40a5c304c26f..047436a0bdc0 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -116,6 +116,17 @@ struct kvmi_get_guest_info_reply {
 	__u32 padding[3];
 };
 
+struct kvmi_get_page_access {
+	__u16 view;
+	__u16 count;
+	__u32 padding;
+	__u64 gpa[0];
+};
+
+struct kvmi_get_page_access_reply {
+	__u8 access[0];
+};
+
 struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 0264115a7f4d..20505e4c4b5f 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1072,6 +1072,15 @@ void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access)
+{
+	gfn_t gfn = gpa_to_gfn(gpa);
+
+	kvmi_get_gfn_access(ikvm, gfn, access);
+
+	return 0;
+}
+
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index d478d9a2e769..00dc5cf72f88 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -159,6 +159,7 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm);
 void *kvmi_msg_alloc(void);
 void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
+int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
@@ -174,6 +175,11 @@ void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 void kvmi_arch_update_page_tracking(struct kvm *kvm,
 				    struct kvm_memory_slot *slot,
 				    struct kvmi_mem_access *m);
+int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const struct kvmi_get_page_access *req,
+				  struct kvmi_get_page_access_reply **dest,
+				  size_t *dest_size);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 0642356d4e04..09ad17479abb 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -29,6 +29,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_EVENT]                 = "KVMI_EVENT",
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
+	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 };
@@ -323,6 +324,21 @@ static int handle_control_cmd_response(struct kvmi *ikvm,
 	return err;
 }
 
+static int handle_get_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *req)
+{
+	struct kvmi_get_page_access_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	ec = kvmi_arch_cmd_get_page_access(ikvm, msg, req, &rpl, &rpl_size);
+
+	err = kvmi_msg_vm_maybe_reply(ikvm, msg, ec, rpl, rpl_size);
+	kvmi_msg_free(rpl);
+	return err;
+}
+
 static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
 {
 	return hdr->padding1 || hdr->padding2;
@@ -338,6 +354,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
 	[KVMI_CONTROL_VM_EVENTS]     = handle_control_vm_events,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
+	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
 	[KVMI_GET_VERSION]           = handle_get_version,
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 33/92] kvm: introspection: add KVMI_SET_PAGE_ACCESS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (31 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 32/92] kvm: introspection: add KVMI_GET_PAGE_ACCESS Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 34/92] Documentation: Introduce EPT based Subpage Protection Adalbert Lazăr
                   ` (61 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This command sets the spte access bits (rwx) for an array of guest
physical addresses (through the page track subsystem).

These pages, with the requested access bits, are also kept in a radix
tree in order to filter out the #PF events which are of no interest to
the introspection tool.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 54 ++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                | 36 ++++++++++++++++++
 include/uapi/linux/kvmi.h          | 15 ++++++++
 virt/kvm/kvmi.c                    | 61 ++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  4 ++
 virt/kvm/kvmi_msg.c                | 13 +++++++
 6 files changed, 183 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index c27fea73ccfb..b64a030507cf 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -563,6 +563,60 @@ EPT view (0 is primary). On all other hardware it must be zero.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to allocate the reply
 
+10. KVMI_SET_PAGE_ACCESS
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_set_page_access {
+		__u16 view;
+		__u16 count;
+		__u32 padding;
+		struct kvmi_page_access_entry entries[0];
+	};
+
+where::
+
+	struct kvmi_page_access_entry {
+		__u64 gpa;
+		__u8 access;
+		__u8 padding1;
+		__u16 padding2;
+		__u32 padding3;
+	};
+
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the spte access bits (rwx) for an array of ``count`` guest physical
+addresses.
+
+The command will fail with -KVM_EINVAL if any of the specified combination
+of access bits is not supported.
+
+The command will make the changes in order and it will stop on the first
+error. The introspection tool should handle the rollback.
+
+In order to 'forget' an address, all the access bits ('rwx') must be set.
+
+:Errors:
+
+* -KVM_EINVAL - the specified access bits combination is invalid
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EINVAL - the message size is invalid
+* -KVM_EOPNOTSUPP - a SPT view was selected but the hardware doesn't support it
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to add the page tracking structures
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 59cf33127b4b..3238ef176ad6 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -224,3 +224,39 @@ int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
 	return 0;
 }
 
+int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const struct kvmi_set_page_access *req)
+{
+	const struct kvmi_page_access_entry *entry = req->entries;
+	const struct kvmi_page_access_entry *end = req->entries + req->count;
+	u8 unknown_bits = ~(KVMI_PAGE_ACCESS_R | KVMI_PAGE_ACCESS_W
+			    | KVMI_PAGE_ACCESS_X);
+	int ec = 0;
+
+	if (req->padding)
+		return -KVM_EINVAL;
+
+	if (msg->size < sizeof(*req) + (end - entry) * sizeof(*entry))
+		return -KVM_EINVAL;
+
+	if (req->view != 0)	/* TODO */
+		return -KVM_EOPNOTSUPP;
+
+	for (; entry < end; entry++) {
+		if ((entry->access & unknown_bits) || entry->padding1
+				|| entry->padding2 || entry->padding3)
+			ec = -KVM_EINVAL;
+		else
+			ec = kvmi_cmd_set_page_access(ikvm, entry->gpa,
+						      entry->access);
+		if (ec)
+			kvmi_warn(ikvm, "%s: %llx %x padding %x,%x,%x",
+				  __func__, entry->gpa, entry->access,
+				  entry->padding1, entry->padding2,
+				  entry->padding3);
+	}
+
+	return ec;
+}
+
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 047436a0bdc0..2ddbb1fea807 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -127,6 +127,21 @@ struct kvmi_get_page_access_reply {
 	__u8 access[0];
 };
 
+struct kvmi_page_access_entry {
+	__u64 gpa;
+	__u8 access;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 padding3;
+};
+
+struct kvmi_set_page_access {
+	__u16 view;
+	__u16 count;
+	__u32 padding;
+	struct kvmi_page_access_entry entries[0];
+};
+
 struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 20505e4c4b5f..4a9a4430a460 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -73,6 +73,57 @@ static int kvmi_get_gfn_access(struct kvmi *ikvm, const gfn_t gfn,
 	return m ? 0 : -1;
 }
 
+static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi_mem_access *__m;
+	struct kvmi *ikvm = IKVM(kvm);
+	int err = 0;
+	int idx;
+
+	m = kmem_cache_zalloc(radix_cache, GFP_KERNEL);
+	if (!m)
+		return -KVM_ENOMEM;
+
+	m->gfn = gfn;
+	m->access = access;
+
+	if (radix_tree_preload(GFP_KERNEL)) {
+		err = -KVM_ENOMEM;
+		goto exit;
+	}
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	write_lock(&ikvm->access_tree_lock);
+
+	__m = __kvmi_get_gfn_access(ikvm, gfn);
+	if (__m) {
+		__m->access = access;
+		kvmi_arch_update_page_tracking(kvm, NULL, __m);
+		if (access == full_access) {
+			radix_tree_delete(&ikvm->access_tree, gfn);
+			kmem_cache_free(radix_cache, __m);
+		}
+	} else {
+		radix_tree_insert(&ikvm->access_tree, gfn, m);
+		kvmi_arch_update_page_tracking(kvm, NULL, m);
+		m = NULL;
+	}
+
+	write_unlock(&ikvm->access_tree_lock);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	radix_tree_preload_end();
+
+exit:
+	if (m)
+		kmem_cache_free(radix_cache, m);
+
+	return err;
+}
+
 static bool kvmi_restricted_access(struct kvmi *ikvm, gpa_t gpa, u8 access)
 {
 	u8 allowed_access;
@@ -1081,6 +1132,16 @@ int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access)
 	return 0;
 }
 
+int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access)
+{
+	gfn_t gfn = gpa_to_gfn(gpa);
+	u8 ignored_access;
+
+	kvmi_get_gfn_access(ikvm, gfn, &ignored_access);
+
+	return kvmi_set_gfn_access(ikvm->kvm, gfn, access);
+}
+
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 00dc5cf72f88..c54be93349b7 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -160,6 +160,7 @@ void *kvmi_msg_alloc(void);
 void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
+int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access);
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
@@ -180,6 +181,9 @@ int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
 				  const struct kvmi_get_page_access *req,
 				  struct kvmi_get_page_access_reply **dest,
 				  size_t *dest_size);
+int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const struct kvmi_set_page_access *req);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 09ad17479abb..c150e7bdd440 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -32,6 +32,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
+	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
 };
 
 static bool is_known_message(u16 id)
@@ -339,6 +340,17 @@ static int handle_get_page_access(struct kvmi *ikvm,
 	return err;
 }
 
+static int handle_set_page_access(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *req)
+{
+	int ec;
+
+	ec = kvmi_arch_cmd_set_page_access(ikvm, msg, req);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
 {
 	return hdr->padding1 || hdr->padding2;
@@ -356,6 +368,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
 	[KVMI_GET_VERSION]           = handle_get_version,
+	[KVMI_SET_PAGE_ACCESS]       = handle_set_page_access,
 };
 
 static int handle_event_reply(struct kvm_vcpu *vcpu,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 34/92] Documentation: Introduce EPT based Subpage Protection
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (32 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 33/92] kvm: introspection: add KVMI_SET_PAGE_ACCESS Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 35/92] KVM: VMX: Add control flags for SPP enabling Adalbert Lazăr
                   ` (60 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, yi.z.zhang

From: Yang Weijiang <weijiang.yang@intel.com>

Co-developed-by: yi.z.zhang@linux.intel.com
Signed-off-by: yi.z.zhang@linux.intel.com
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-2-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/spp_kvm.txt | 173 ++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)
 create mode 100644 Documentation/virtual/kvm/spp_kvm.txt

diff --git a/Documentation/virtual/kvm/spp_kvm.txt b/Documentation/virtual/kvm/spp_kvm.txt
new file mode 100644
index 000000000000..bdf94922cba9
--- /dev/null
+++ b/Documentation/virtual/kvm/spp_kvm.txt
@@ -0,0 +1,173 @@
+EPT-Based Sub-Page Protection (SPP) for KVM
+====================================================
+
+1.Overview
+  EPT-based Sub-Page Protection(SPP) allows VMM to specify
+  fine-grained(128byte per sub-page) write-protection for guest physical
+  memory. When it's enabled, the CPU enforces write-access permission
+  for the sub-pages within a 4KB page, if corresponding bit is set in
+  permission vector, write to sub-page region is allowed, otherwise,
+  it's prevented with a EPT violation.
+
+2.SPP Operation
+  Sub-Page Protection Table (SPPT) is introduced to manage sub-page
+  write-access permission.
+
+  It is active when:
+  a) large paging is disabled on host side.
+  b) "sub-page write protection" VM-execution control is 1.
+  c) SPP is initialized with KVM_INIT_SPP ioctl successfully.
+  d) Sub-page permissions are set with KVM_SUBPAGES_SET_ACCESS ioctl
+     successfully. see below sections for details.
+
+  __________________________________________________________________________
+
+  How SPP hardware works:
+  __________________________________________________________________________
+
+  Guest write access --> GPA --> Walk EPT --> EPT leaf entry -----|
+  |---------------------------------------------------------------|
+  |-> if VMexec_control.spp && ept_leaf_entry.spp_bit (bit 61)
+       |
+       |-> <false> --> EPT legacy behavior
+       |
+       |
+       |-> <true>  --> if ept_leaf_entry.writable
+                        |
+                        |-> <true>  --> Ignore SPP
+                        |
+                        |-> <false> --> GPA --> Walk SPP 4-level table--|
+                                                                        |
+  |------------<----------get-the-SPPT-point-from-VMCS-filed-----<------|
+  |
+  Walk SPP L4E table
+  |
+  |---> if-entry-misconfiguration ------------>-------|-------<---------|
+   |                                                  |                 |
+  else                                                |                 |
+   |                                                  |                 |
+   |   |------------------SPP VMexit<-----------------|                 |
+   |   |                                                                |
+   |   |-> exit_qualification & sppt_misconfig --> sppt misconfig       |
+   |   |                                                                |
+   |   |-> exit_qualification & sppt_miss --> sppt miss                 |
+   |---|                                                                |
+       |                                                                |
+  walk SPPT L3E--|--> if-entry-misconfiguration------------>------------|
+                 |                                                      |
+                else                                                    |
+                 |                                                      |
+                 |                                                      |
+          walk SPPT L2E --|--> if-entry-misconfiguration-------->-------|
+                          |                                             |
+                         else                                           |
+                          |                                             |
+                          |                                             |
+                   walk SPPT L1E --|-> if-entry-misconfiguration--->----|
+                                   |
+                                 else
+                                   |
+                                   |-> if sub-page writable
+                                   |-> <true>  allow, write access
+                                   |-> <false> disallow, EPT violation
+  ______________________________________________________________________________
+
+3.IOCTL Interfaces
+
+    KVM_INIT_SPP:
+    Allocate storage for sub-page permission vectors and SPPT root page.
+
+    KVM_SUBPAGES_GET_ACCESS:
+    Get sub-page write permission vectors for given continuous guest pages.
+
+    KVM_SUBPAGES_SET_ACCESS
+    Set sub-pages write permission vectors for given continuous guest pages.
+
+    /* for KVM_SUBPAGES_GET_ACCESS and KVM_SUBPAGES_SET_ACCESS */
+    struct kvm_subpage_info {
+       __u64 gfn; /* the first page gfn of the continuous pages */
+       __u64 npages; /* number of 4K pages */
+       __u64 *access_map; /* sub-page write-access bitmap array */
+    };
+
+    #define KVM_SUBPAGES_GET_ACCESS   _IOR(KVMIO,  0x49, __u64)
+    #define KVM_SUBPAGES_SET_ACCESS   _IOW(KVMIO,  0x4a, __u64)
+    #define KVM_INIT_SPP              _IOW(KVMIO,  0x4b, __u64)
+
+4.Set Sub-Page Permission
+
+  * To enable SPP protection, system admin sets sub-page permission via
+    KVM_SUBPAGES_SET_ACCESS ioctl:
+
+    (1) If the target 4KB pages are there, it locates EPT leaf entries
+        via the guest physical addresses, sets the bit 61 of the corresponding
+        entries to enable sub-page protection, then set up SPPT paging structure.
+    (2) otherwise, stores the [gfn,permission] mappings in KVM data structure. When
+        EPT page-fault is generated due to access to target page, it settles
+        EPT entry configuration together with SPPT setup, this is called lazy mode
+        setup.
+
+   The SPPT paging structure format is as below:
+
+   Format of the SPPT L4E, L3E, L2E:
+   | Bit    | Contents                                                                 |
+   | :----- | :------------------------------------------------------------------------|
+   | 0      | Valid entry when set; indicates whether the entry is present             |
+   | 11:1   | Reserved (0)                                                             |
+   | N-1:12 | Physical address of 4KB aligned SPPT LX-1 Table referenced by this entry |
+   | 51:N   | Reserved (0)                                                             |
+   | 63:52  | Reserved (0)                                                             |
+   Note: N is the physical address width supported by the processor. X is the page level
+
+   Format of the SPPT L1E:
+   | Bit   | Contents                                                          |
+   | :---- | :---------------------------------------------------------------- |
+   | 0+2i  | Write permission for i-th 128 byte sub-page region.               |
+   | 1+2i  | Reserved (0).                                                     |
+   Note: 0<=i<=31
+
+5.SPPT-induced VM exit
+
+  * SPPT miss and misconfiguration induced VM exit
+
+    A SPPT missing VM exit occurs when walk the SPPT, there is no SPPT
+    misconfiguration but a paging-structure entry is not
+    present in any of L4E/L3E/L2E entries.
+
+    A SPPT misconfiguration VM exit occurs when reserved bits or unsupported values
+    are set in SPPT entry.
+
+    *NOTE* SPPT miss and SPPT misconfigurations can occur only due to an
+    attempt to write memory with a guest physical address.
+
+  * SPP permission induced VM exit
+    SPP sub-page permission induced violation is reported as EPT violation
+    thesefore causes VM exit.
+
+6.SPPT-induced VM exit handling
+
+  #define EXIT_REASON_SPP                 66
+
+  static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
+    ...
+    [EXIT_REASON_SPP]                     = handle_spp,
+    ...
+  };
+
+  New exit qualification for SPPT-induced vmexits.
+
+  | Bit   | Contents                                                          |
+  | :---- | :---------------------------------------------------------------- |
+  | 10:0  | Reserved (0).                                                     |
+  | 11    | SPPT VM exit type. Set for SPPT Miss, cleared for SPPT Misconfig. |
+  | 12    | NMI unblocking due to IRET                                        |
+  | 63:13 | Reserved (0)                                                      |
+
+  In addition to the exit qualification, guest linear address and guest
+  physical address fields will be reported.
+
+  * SPPT miss and misconfiguration induced VM exit
+    Allocate a physical page for the SPPT and set the entry correctly.
+
+  * SPP permission induced VM exit
+    This kind of VM exit is left to VMI tool to handle.

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 35/92] KVM: VMX: Add control flags for SPP enabling
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (33 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 34/92] Documentation: Introduce EPT based Subpage Protection Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 36/92] KVM: VMX: Implement functions for SPPT paging setup Adalbert Lazăr
                   ` (59 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

Check SPP capability in MSR_IA32_VMX_PROCBASED_CTLS2, its 23-bit
indicates SPP support. Mark SPP bit in CPU capabilities bitmap if
it's supported.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-3-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/vmx.h         |  1 +
 arch/x86/kernel/cpu/intel.c        |  4 ++++
 arch/x86/kvm/vmx/capabilities.h    |  5 +++++
 arch/x86/kvm/vmx/vmx.c             | 10 ++++++++++
 5 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..183b4fd864c6 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -228,6 +228,7 @@
 #define X86_FEATURE_FLEXPRIORITY	( 8*32+ 2) /* Intel FlexPriority */
 #define X86_FEATURE_EPT			( 8*32+ 3) /* Intel Extended Page Table */
 #define X86_FEATURE_VPID		( 8*32+ 4) /* Intel Virtual Processor ID */
+#define X86_FEATURE_SPP			( 8*32+ 5) /* Intel EPT-based Sub-Page Write Protection */
 
 #define X86_FEATURE_VMMCALL		( 8*32+15) /* Prefer VMMCALL to VMCALL */
 #define X86_FEATURE_XENPV		( 8*32+16) /* "" Xen paravirtual guest */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4e4133e86484..a2c9e18e0ad7 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -81,6 +81,7 @@
 #define SECONDARY_EXEC_XSAVES			0x00100000
 #define SECONDARY_EXEC_PT_USE_GPA		0x01000000
 #define SECONDARY_EXEC_MODE_BASED_EPT_EXEC	0x00400000
+#define SECONDARY_EXEC_ENABLE_SPP		0x00800000
 #define SECONDARY_EXEC_TSC_SCALING              0x02000000
 
 #define PIN_BASED_EXT_INTR_MASK                 0x00000001
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fc3c07fe7df5..b55156ce16da 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -476,6 +476,7 @@ static void detect_vmx_virtcap(struct cpuinfo_x86 *c)
 #define X86_VMX_FEATURE_PROC_CTLS2_EPT		0x00000002
 #define X86_VMX_FEATURE_PROC_CTLS2_VPID		0x00000020
 #define x86_VMX_FEATURE_EPT_CAP_AD		0x00200000
+#define X86_VMX_FEATURE_PROC_CTLS2_SPP		0x00800000
 
 	u32 vmx_msr_low, vmx_msr_high, msr_ctl, msr_ctl2;
 	u32 msr_vpid_cap, msr_ept_cap;
@@ -486,6 +487,7 @@ static void detect_vmx_virtcap(struct cpuinfo_x86 *c)
 	clear_cpu_cap(c, X86_FEATURE_EPT);
 	clear_cpu_cap(c, X86_FEATURE_VPID);
 	clear_cpu_cap(c, X86_FEATURE_EPT_AD);
+	clear_cpu_cap(c, X86_FEATURE_SPP);
 
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
 	msr_ctl = vmx_msr_high | vmx_msr_low;
@@ -509,6 +511,8 @@ static void detect_vmx_virtcap(struct cpuinfo_x86 *c)
 		}
 		if (msr_ctl2 & X86_VMX_FEATURE_PROC_CTLS2_VPID)
 			set_cpu_cap(c, X86_FEATURE_VPID);
+		if (msr_ctl2 & X86_VMX_FEATURE_PROC_CTLS2_SPP)
+			set_cpu_cap(c, X86_FEATURE_SPP);
 	}
 }
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 854e144131c6..8221ecbf6516 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -239,6 +239,11 @@ static inline bool cpu_has_vmx_pml(void)
 	return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML;
 }
 
+static inline bool cpu_has_vmx_ept_spp(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_SPP;
+}
+
 static inline bool vmx_xsaves_supported(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 97cfd5a316f3..f94e3defd9cf 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -114,6 +114,8 @@ static u64 __read_mostly host_xss;
 bool __read_mostly enable_pml = 1;
 module_param_named(pml, enable_pml, bool, S_IRUGO);
 
+static bool __read_mostly spp_supported = 0;
+
 #define MSR_BITMAP_MODE_X2APIC		1
 #define MSR_BITMAP_MODE_X2APIC_APICV	2
 
@@ -2247,6 +2249,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			SECONDARY_EXEC_RDSEED_EXITING |
 			SECONDARY_EXEC_RDRAND_EXITING |
 			SECONDARY_EXEC_ENABLE_PML |
+			SECONDARY_EXEC_ENABLE_SPP |
 			SECONDARY_EXEC_TSC_SCALING |
 			SECONDARY_EXEC_PT_USE_GPA |
 			SECONDARY_EXEC_PT_CONCEAL_VMX |
@@ -3901,6 +3904,9 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
 	if (!enable_pml)
 		exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
 
+	if (!spp_supported)
+		exec_control &= ~SECONDARY_EXEC_ENABLE_SPP;
+
 	if (vmx_xsaves_supported()) {
 		/* Exposing XSAVES only when XSAVE is exposed */
 		bool xsaves_enabled =
@@ -7570,6 +7576,10 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_flexpriority())
 		flexpriority_enabled = 0;
 
+	if (cpu_has_vmx_ept_spp() && enable_ept &&
+	    boot_cpu_has(X86_FEATURE_SPP))
+		spp_supported = 1;
+
 	if (!cpu_has_virtual_nmis())
 		enable_vnmi = 0;
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 36/92] KVM: VMX: Implement functions for SPPT paging setup
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (34 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 35/92] KVM: VMX: Add control flags for SPP enabling Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 37/92] KVM: VMX: Introduce SPP access bitmap and operation functions Adalbert Lazăr
                   ` (58 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

SPPT is a 4-level paging structure similar to EPT, when SPP is
kicked for target physical page, bit 61 of the corresponding
EPT enty will be flaged, then SPPT is traversed with the gfn to
build up entries, the leaf entry of SPPT contains the access
bitmap for subpages inside the target 4KB physical page, one bit
per 128-byte subpage.

SPPT entries are set up in below cases:
1. the EPT faulted page is SPP protected.
2. SPP mis-config induced vmexit is handled.
3. User configures SPP protected pages via SPP IOCTLs.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-4-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |   7 +-
 arch/x86/kvm/mmu.c              | 207 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h              |   1 +
 include/linux/kvm_host.h        |   3 +
 4 files changed, 217 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f1b3d89a0430..c05984f39923 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -271,7 +271,8 @@ union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned spp:1;
+		unsigned reserved:5;
 
 		/*
 		 * This is left at the top of the word so that
@@ -1410,6 +1411,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u64 error_code,
 		       void *insn, int insn_len);
+
+int kvm_mmu_setup_spp_structure(struct kvm_vcpu *vcpu,
+				u32 access_map, gfn_t gfn);
+
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
 void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu, gpa_t new_cr3, bool skip_tlb_flush);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 810e3e5bd575..8a6287cd2be4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -206,6 +206,11 @@ static const union kvm_mmu_page_role mmu_base_role_mask = {
 		({ spte = mmu_spte_get_lockless(_walker.sptep); 1; });	\
 	     __shadow_walk_next(&(_walker), spte))
 
+#define for_each_shadow_spp_entry(_vcpu, _addr, _walker)    \
+	for (shadow_spp_walk_init(&(_walker), _vcpu, _addr);	\
+	     shadow_walk_okay(&(_walker));			\
+	     shadow_walk_next(&(_walker)))
+
 static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
@@ -505,6 +510,11 @@ static int is_shadow_present_pte(u64 pte)
 	return (pte != 0) && !is_mmio_spte(pte);
 }
 
+static int is_spp_shadow_present(u64 pte)
+{
+	return pte & PT_PRESENT_MASK;
+}
+
 static int is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
@@ -524,6 +534,11 @@ static bool is_executable_pte(u64 spte)
 	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
 }
 
+static bool is_spp_spte(struct kvm_mmu_page *sp)
+{
+	return sp->role.spp;
+}
+
 static kvm_pfn_t spte_to_pfn(u64 pte)
 {
 	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
@@ -1751,6 +1766,87 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static bool __rmap_open_subpage_bit(struct kvm *kvm,
+				    struct kvm_rmap_head *rmap_head)
+{
+	struct rmap_iterator iter;
+	bool flush = false;
+	u64 *sptep;
+	u64 spte;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+		/*
+		 * SPP works only when the page is write-protected
+		 * and SPP bit is set in EPT leaf entry.
+		 */
+		flush |= spte_write_protect(sptep, false);
+		spte = *sptep | PT_SPP_MASK;
+		flush |= mmu_spte_update(sptep, spte);
+	}
+
+	return flush;
+}
+
+static int kvm_mmu_open_subpage_write_protect(struct kvm *kvm,
+					      struct kvm_memory_slot *slot,
+					      gfn_t gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	bool flush = false;
+
+	/*
+	 * SPP is only supported with 4KB level1 memory page, check
+	 * if the page is mapped in EPT leaf entry.
+	 */
+	rmap_head = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
+
+	if (!rmap_head->val)
+		return -EFAULT;
+
+	flush |= __rmap_open_subpage_bit(kvm, rmap_head);
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	return 0;
+}
+
+static bool __rmap_clear_subpage_bit(struct kvm *kvm,
+				     struct kvm_rmap_head *rmap_head)
+{
+	struct rmap_iterator iter;
+	bool flush = false;
+	u64 *sptep;
+	u64 spte;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+		spte = (*sptep & ~PT_SPP_MASK) | PT_WRITABLE_MASK;
+		flush |= mmu_spte_update(sptep, spte);
+	}
+
+	return flush;
+}
+
+static int kvm_mmu_clear_subpage_write_protect(struct kvm *kvm,
+					       struct kvm_memory_slot *slot,
+					       gfn_t gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	bool flush = false;
+
+	rmap_head = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
+
+	if (!rmap_head->val)
+		return -EFAULT;
+
+	flush |= __rmap_clear_subpage_bit(kvm, rmap_head);
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	return 0;
+}
+
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn)
 {
@@ -2505,6 +2601,30 @@ static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return acc;
 }
 
+struct kvm_mmu_page *kvm_mmu_get_spp_page(struct kvm_vcpu *vcpu,
+						 gfn_t gfn,
+						 unsigned int level)
+
+{
+	struct kvm_mmu_page *sp;
+	union kvm_mmu_page_role role;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = true;
+	role.spp = true;
+
+	sp = kvm_mmu_alloc_page(vcpu, true);
+	sp->gfn = gfn;
+	sp->role = role;
+	hlist_add_head(&sp->hash_link,
+		       &vcpu->kvm->arch.mmu_page_hash
+		       [kvm_page_table_hashfn(gfn)]);
+	clear_page(sp->spt);
+	return sp;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_get_spp_page);
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -2632,6 +2752,16 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 				    addr);
 }
 
+static void shadow_spp_walk_init(struct kvm_shadow_walk_iterator *iterator,
+				 struct kvm_vcpu *vcpu, u64 addr)
+{
+	iterator->addr = addr;
+	iterator->shadow_addr = vcpu->arch.mmu->sppt_root;
+
+	/* SPP Table is a 4-level paging structure */
+	iterator->level = 4;
+}
+
 static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
 {
 	if (iterator->level < PT_PAGE_TABLE_LEVEL)
@@ -2682,6 +2812,18 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 		mark_unsync(sptep);
 }
 
+static void link_spp_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+				 struct kvm_mmu_page *sp)
+{
+	u64 spte;
+
+	spte = __pa(sp->spt) | PT_PRESENT_MASK;
+
+	mmu_spte_set(sptep, spte);
+
+	mmu_page_add_parent_pte(vcpu, sp, sptep);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
@@ -4253,6 +4395,71 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	return RET_PF_RETRY;
 }
 
+static u64 format_spp_spte(u32 spp_wp_bitmap)
+{
+	u64 new_spte = 0;
+	int i = 0;
+
+	/*
+	 * One 4K page contains 32 sub-pages, in SPP table L4E, old bits
+	 * are reserved, so we need to transfer u32 subpage write
+	 * protect bitmap to u64 SPP L4E format.
+	 */
+	while (i < 32) {
+		if (spp_wp_bitmap & (1ULL << i))
+			new_spte |= 1ULL << (i * 2);
+
+		i++;
+	}
+
+	return new_spte;
+}
+
+static void mmu_spp_spte_set(u64 *sptep, u64 new_spte)
+{
+	__set_spte(sptep, new_spte);
+}
+
+int kvm_mmu_setup_spp_structure(struct kvm_vcpu *vcpu,
+				u32 access_map, gfn_t gfn)
+{
+	struct kvm_shadow_walk_iterator iter;
+	struct kvm_mmu_page *sp;
+	gfn_t pseudo_gfn;
+	u64 old_spte, spp_spte;
+	int ret = -EFAULT;
+
+	/* direct_map spp start */
+	if (!VALID_PAGE(vcpu->arch.mmu->sppt_root))
+		return -EFAULT;
+
+	for_each_shadow_spp_entry(vcpu, (u64)gfn << PAGE_SHIFT, iter) {
+		if (iter.level == PT_PAGE_TABLE_LEVEL) {
+			spp_spte = format_spp_spte(access_map);
+			old_spte = mmu_spte_get_lockless(iter.sptep);
+			if (old_spte != spp_spte) {
+				mmu_spp_spte_set(iter.sptep, spp_spte);
+				kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+			}
+
+			ret = 0;
+			break;
+		}
+
+		if (!is_spp_shadow_present(*iter.sptep)) {
+			u64 base_addr = iter.addr;
+
+			base_addr &= PT64_LVL_ADDR_MASK(iter.level);
+			pseudo_gfn = base_addr >> PAGE_SHIFT;
+			sp = kvm_mmu_get_spp_page(vcpu, pseudo_gfn,
+						  iter.level - 1);
+			link_spp_shadow_page(vcpu, iter.sptep, sp);
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_setup_spp_structure);
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 45948dabe0b6..8c34decd6422 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -26,6 +26,7 @@
 #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
+#define PT_SPP_MASK (1ULL << 61)
 #define PT64_NX_SHIFT 63
 #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e876921938b6..ca7597e429df 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -832,6 +832,9 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
+struct kvm_mmu_page *kvm_mmu_get_spp_page(struct kvm_vcpu *vcpu,
+			gfn_t gfn, unsigned int level);
+
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
  * All architectures that want to use vzalloc currently also

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 37/92] KVM: VMX: Introduce SPP access bitmap and operation functions
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (35 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 36/92] KVM: VMX: Implement functions for SPPT paging setup Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 38/92] KVM: VMX: Add init/set/get functions for SPP Adalbert Lazăr
                   ` (57 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

Create access bitmap for SPP subpages, 4KB/128B = 32bits,
for each 4KB physical page, 32bits are required. The bitmap can
be easily accessed with a gfn. The initial access bitmap for each
physical page is 0xFFFFFFFF, meaning SPP is not enabled for the
subpages.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-5-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c              | 50 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              | 11 ++++++++
 3 files changed, 62 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c05984f39923..f0878631b12a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -790,6 +790,7 @@ struct kvm_lpage_info {
 
 struct kvm_arch_memory_slot {
 	struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
+	u32 *subpage_wp_info;
 	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 	unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
 };
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8a6287cd2be4..f2774bbcfeed 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1482,6 +1482,56 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 	return sptep;
 }
 
+#define FULL_SPP_ACCESS		((u32)((1ULL << 32) - 1))
+
+static int kvm_subpage_create_bitmaps(struct kvm *kvm)
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	int i, j, ret;
+	u32 *buff;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(memslot, slots) {
+			buff = kvzalloc(memslot->npages*
+				sizeof(*memslot->arch.subpage_wp_info),
+				GFP_KERNEL);
+
+			if (!buff) {
+			      ret = -ENOMEM;
+			      goto out_free;
+			}
+			memslot->arch.subpage_wp_info = buff;
+
+			for(j = 0; j< memslot->npages; j++)
+			      buff[j] = FULL_SPP_ACCESS;
+		}
+	}
+
+	return 0;
+out_free:
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(memslot, slots) {
+			if (memslot->arch.subpage_wp_info) {
+				kvfree(memslot->arch.subpage_wp_info);
+				memslot->arch.subpage_wp_info = NULL;
+			}
+		}
+	}
+
+	return ret;
+}
+
+static u32 *gfn_to_subpage_wp_info(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	unsigned long idx;
+
+	idx = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+	return &slot->arch.subpage_wp_info[idx];
+}
+
 #define for_each_rmap_spte(_rmap_head_, _iter_, _spte_)			\
 	for (_spte_ = rmap_get_first(_rmap_head_, _iter_);		\
 	     _spte_; _spte_ = rmap_get_next(_iter_))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ef6d9dd80086..2ac1e0aba1fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9320,6 +9320,17 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_hv_destroy_vm(kvm);
 }
 
+void kvm_subpage_free_memslot(struct kvm_memory_slot *free,
+			      struct kvm_memory_slot *dont)
+{
+
+	if (!dont || free->arch.subpage_wp_info !=
+	    dont->arch.subpage_wp_info) {
+		kvfree(free->arch.subpage_wp_info);
+		free->arch.subpage_wp_info = NULL;
+	}
+}
+
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
 			   struct kvm_memory_slot *dont)
 {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 38/92] KVM: VMX: Add init/set/get functions for SPP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (36 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 37/92] KVM: VMX: Introduce SPP access bitmap and operation functions Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 39/92] KVM: VMX: Introduce SPP user-space IOCTLs Adalbert Lazăr
                   ` (56 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

init_spp() must be called before {get, set}_subpage
functions, it creates subpage access bitmaps for memory pages
and issues a KVM request to setup SPPT root pages.

kvm_mmu_set_subpages() is to enable SPP bit in EPT leaf page
and setup corresponding SPPT entries. The mmu_lock
is held before above operation. If it's called in EPT fault and
SPPT mis-config induced handler, mmu_lock is acquired outside
the function, otherwise, it's acquired inside it.

kvm_mmu_get_subpages() is used to query access bitmap for
protected page, it's also used in EPT fault handler to check
whether the fault EPT page is SPP protected as well.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-6-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  18 ++++
 arch/x86/include/asm/vmx.h      |   2 +
 arch/x86/kvm/mmu.c              | 160 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  48 ++++++++++
 arch/x86/kvm/x86.c              |  57 ++++++++++++
 include/linux/kvm_host.h        |   3 +
 include/uapi/linux/kvm.h        |   9 ++
 7 files changed, 297 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f0878631b12a..7ee6e1ff5ee9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -399,8 +399,13 @@ struct kvm_mmu {
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
 	void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			   u64 *spte, const void *pte);
+	int (*get_subpages)(struct kvm *kvm, struct kvm_subpage *spp_info);
+	int (*set_subpages)(struct kvm *kvm, struct kvm_subpage *spp_info);
+	int (*init_spp)(struct kvm *kvm);
+
 	hpa_t root_hpa;
 	gpa_t root_cr3;
+	hpa_t sppt_root;
 	union kvm_mmu_role mmu_role;
 	u8 root_level;
 	u8 shadow_root_level;
@@ -929,6 +934,8 @@ struct kvm_arch {
 
 	bool guest_can_read_msr_platform_info;
 	bool exception_payload_enabled;
+
+	bool spp_active;
 };
 
 struct kvm_vm_stat {
@@ -1202,6 +1209,11 @@ struct kvm_x86_ops {
 	int (*nested_enable_evmcs)(struct kvm_vcpu *vcpu,
 				   uint16_t *vmcs_version);
 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
+
+	bool (*get_spp_status)(void);
+	int (*get_subpages)(struct kvm *kvm, struct kvm_subpage *spp_info);
+	int (*set_subpages)(struct kvm *kvm, struct kvm_subpage *spp_info);
+	int (*init_spp)(struct kvm *kvm);
 };
 
 struct kvm_arch_async_pf {
@@ -1420,6 +1432,12 @@ void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
 void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu, gpa_t new_cr3, bool skip_tlb_flush);
 
+int kvm_mmu_get_subpages(struct kvm *kvm, struct kvm_subpage *spp_info,
+			 bool mmu_locked);
+int kvm_mmu_set_subpages(struct kvm *kvm, struct kvm_subpage *spp_info,
+			 bool mmu_locked);
+int kvm_mmu_init_spp(struct kvm *kvm);
+
 void kvm_enable_tdp(void);
 void kvm_disable_tdp(void);
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a2c9e18e0ad7..6cb05ac07453 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -224,6 +224,8 @@ enum vmcs_field {
 	XSS_EXIT_BITMAP_HIGH            = 0x0000202D,
 	ENCLS_EXITING_BITMAP		= 0x0000202E,
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
+	SPPT_POINTER			= 0x00002030,
+	SPPT_POINTER_HIGH		= 0x00002031,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f2774bbcfeed..38e79210d010 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3846,6 +3846,9 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
 			mmu_free_root_page(vcpu->kvm, &mmu->root_hpa,
 					   &invalid_list);
+			if (vcpu->kvm->arch.spp_active)
+				mmu_free_root_page(vcpu->kvm, &mmu->sppt_root,
+						   &invalid_list);
 		} else {
 			for (i = 0; i < 4; ++i)
 				if (mmu->pae_root[i] != 0)
@@ -4510,6 +4513,158 @@ int kvm_mmu_setup_spp_structure(struct kvm_vcpu *vcpu,
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_setup_spp_structure);
+
+int kvm_mmu_init_spp(struct kvm *kvm)
+{
+	int i, ret;
+	struct kvm_vcpu *vcpu;
+	int root_level;
+	struct kvm_mmu_page *ssp_sp;
+
+
+	if (!kvm_x86_ops->get_spp_status())
+	      return -ENODEV;
+
+	if (kvm->arch.spp_active)
+	      return 0;
+
+	ret = kvm_subpage_create_bitmaps(kvm);
+
+	if (ret)
+	      return ret;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/* prepare caches for SPP setup.*/
+		mmu_topup_memory_caches(vcpu);
+		root_level = vcpu->arch.mmu->shadow_root_level;
+		ssp_sp = kvm_mmu_get_spp_page(vcpu, 0, root_level);
+		++ssp_sp->root_count;
+		vcpu->arch.mmu->sppt_root = __pa(ssp_sp->spt);
+		kvm_make_request(KVM_REQ_LOAD_CR3, vcpu);
+	}
+
+	kvm->arch.spp_active = true;
+	return 0;
+}
+
+int kvm_mmu_get_subpages(struct kvm *kvm, struct kvm_subpage *spp_info,
+			 bool mmu_locked)
+{
+	u32 *access = spp_info->access_map;
+	gfn_t gfn = spp_info->base_gfn;
+	int npages = spp_info->npages;
+	struct kvm_memory_slot *slot;
+	int i;
+	int ret;
+
+	if (!kvm->arch.spp_active)
+	      return -ENODEV;
+
+	if (!mmu_locked)
+	      spin_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < npages; i++, gfn++) {
+		slot = gfn_to_memslot(kvm, gfn);
+		if (!slot) {
+			ret = -EFAULT;
+			goto out_unlock;
+		}
+		access[i] = *gfn_to_subpage_wp_info(slot, gfn);
+	}
+
+	ret = i;
+
+out_unlock:
+	if (!mmu_locked)
+	      spin_unlock(&kvm->mmu_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_get_subpages);
+
+int kvm_mmu_set_subpages(struct kvm *kvm, struct kvm_subpage *spp_info,
+			 bool mmu_locked)
+{
+	u32 *access = spp_info->access_map;
+	gfn_t gfn = spp_info->base_gfn;
+	int npages = spp_info->npages;
+	struct kvm_memory_slot *slot;
+	struct kvm_vcpu *vcpu;
+	struct kvm_rmap_head *rmap_head;
+	int i, k;
+	u32 *wp_map;
+	int ret = -EFAULT;
+
+	if (!kvm->arch.spp_active)
+		return -ENODEV;
+
+	if (!mmu_locked)
+	      spin_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < npages; i++, gfn++) {
+		slot = gfn_to_memslot(kvm, gfn);
+		if (!slot)
+			goto out_unlock;
+
+		/*
+		 * check whether the target 4KB page exists in EPT leaf
+		 * entries.If it's there, we can setup SPP protection now,
+		 * otherwise, need to defer it to EPT page fault handler.
+		 */
+		rmap_head = __gfn_to_rmap(gfn, PT_PAGE_TABLE_LEVEL, slot);
+
+		if (rmap_head->val) {
+			/*
+			 * if all subpages are not writable, open SPP bit in
+			 * EPT leaf entry to enable SPP protection for
+			 * corresponding page.
+			 */
+			if (access[i] != FULL_SPP_ACCESS) {
+				ret = kvm_mmu_open_subpage_write_protect(kvm,
+						slot, gfn);
+
+				if (ret)
+					goto out_err;
+
+				kvm_for_each_vcpu(k, vcpu, kvm)
+					kvm_mmu_setup_spp_structure(vcpu,
+						access[i], gfn);
+			} else {
+				ret = kvm_mmu_clear_subpage_write_protect(kvm,
+						slot, gfn);
+				if (ret)
+					goto out_err;
+			}
+
+		} else
+			pr_info("%s - No ETP entry, gfn = 0x%llx, access = 0x%x.\n", __func__, gfn, access[i]);
+
+		/* if this function is called in tdp_page_fault() or
+		 * spp_handler(), mmu_locked = true, SPP access bitmap
+		 * is being used, otherwise, it's being stored.
+		 */
+		if (!mmu_locked) {
+			wp_map = gfn_to_subpage_wp_info(slot, gfn);
+			*wp_map = access[i];
+		}
+	}
+
+	ret = i;
+out_err:
+	if (ret < 0)
+	      pr_info("SPP-Error, didn't get the gfn:" \
+		      "%llx from EPT leaf.\n"
+		      "Current we don't support SPP on" \
+		      "huge page.\n"
+		      "Please disable huge page and have" \
+		      "another try.\n", gfn);
+out_unlock:
+	if (!mmu_locked)
+	      spin_unlock(&kvm->mmu_lock);
+
+	return ret;
+}
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
@@ -5207,6 +5362,9 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 	context->get_cr3 = get_cr3;
 	context->get_pdptr = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;
+	context->get_subpages = kvm_x86_ops->get_subpages;
+	context->set_subpages = kvm_x86_ops->set_subpages;
+	context->init_spp = kvm_x86_ops->init_spp;
 
 	if (!is_paging(vcpu)) {
 		context->nx = false;
@@ -5403,6 +5561,8 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
 		uint i;
 
 		vcpu->arch.mmu->root_hpa = INVALID_PAGE;
+		if (!vcpu->kvm->arch.spp_active)
+			vcpu->arch.mmu->sppt_root = INVALID_PAGE;
 
 		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 			vcpu->arch.mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f94e3defd9cf..a50dd2b9d438 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2853,11 +2853,17 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
 	return eptp;
 }
 
+static inline u64 construct_spptp(unsigned long root_hpa)
+{
+	return root_hpa & PAGE_MASK;
+}
+
 void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long guest_cr3;
 	u64 eptp;
+	u64 spptp;
 
 	guest_cr3 = cr3;
 	if (enable_ept) {
@@ -2880,6 +2886,12 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		ept_load_pdptrs(vcpu);
 	}
 
+	if (kvm->arch.spp_active && VALID_PAGE(vcpu->arch.mmu->sppt_root)) {
+		spptp = construct_spptp(vcpu->arch.mmu->sppt_root);
+		vmcs_write64(SPPT_POINTER, spptp);
+		vmx_flush_tlb(vcpu, true);
+	}
+
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
@@ -5743,6 +5755,9 @@ static void dump_vmcs(void)
 		pr_err("PostedIntrVec = 0x%02x\n", vmcs_read16(POSTED_INTR_NV));
 	if ((secondary_exec_control & SECONDARY_EXEC_ENABLE_EPT))
 		pr_err("EPT pointer = 0x%016llx\n", vmcs_read64(EPT_POINTER));
+	if ((secondary_exec_control & SECONDARY_EXEC_ENABLE_SPP))
+		pr_err("SPPT pointer = 0x%016llx\n", vmcs_read64(SPPT_POINTER));
+
 	n = vmcs_read32(CR3_TARGET_COUNT);
 	for (i = 0; i + 1 < n; i += 4)
 		pr_err("CR3 target%u=%016lx target%u=%016lx\n",
@@ -7646,6 +7661,12 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
 	}
 
+	if (!spp_supported) {
+		kvm_x86_ops->get_subpages = NULL;
+		kvm_x86_ops->set_subpages = NULL;
+		kvm_x86_ops->init_spp = NULL;
+	}
+
 	if (!cpu_has_vmx_preemption_timer())
 		kvm_x86_ops->request_immediate_exit = __kvm_request_immediate_exit;
 
@@ -7706,6 +7727,28 @@ static bool vmx_spt_fault(struct kvm_vcpu *vcpu)
 	return (vmx->exit_reason == EXIT_REASON_EPT_VIOLATION);
 }
 
+static bool vmx_get_spp_status(void)
+{
+	return spp_supported;
+}
+
+static int vmx_get_subpages(struct kvm *kvm,
+			    struct kvm_subpage *spp_info)
+{
+	return kvm_get_subpages(kvm, spp_info);
+}
+
+static int vmx_set_subpages(struct kvm *kvm,
+			    struct kvm_subpage *spp_info)
+{
+	return kvm_set_subpages(kvm, spp_info);
+}
+
+static int vmx_init_spp(struct kvm *kvm)
+{
+	return kvm_init_spp(kvm);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -7856,6 +7899,11 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.set_nested_state = NULL,
 	.get_vmcs12_pages = NULL,
 	.nested_enable_evmcs = NULL,
+
+	.get_spp_status = vmx_get_spp_status,
+	.get_subpages = vmx_get_subpages,
+	.set_subpages = vmx_set_subpages,
+	.init_spp = vmx_init_spp,
 };
 
 static void vmx_cleanup_l1d_flush(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ac1e0aba1fc..b8ae25cb227b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4576,6 +4576,61 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 	return r;
 }
 
+static int kvm_vm_ioctl_get_subpages(struct kvm *kvm,
+				     struct kvm_subpage *spp_info)
+{
+	return kvm_arch_get_subpages(kvm, spp_info);
+}
+
+static int kvm_vm_ioctl_set_subpages(struct kvm *kvm,
+				     struct kvm_subpage *spp_info)
+{
+	return kvm_arch_set_subpages(kvm, spp_info);
+}
+
+static int kvm_vm_ioctl_init_spp(struct kvm *kvm)
+{
+	return kvm_arch_init_spp(kvm);
+}
+
+int kvm_get_subpages(struct kvm *kvm,
+		     struct kvm_subpage *spp_info)
+{
+	int ret;
+
+	mutex_lock(&kvm->slots_lock);
+	ret = kvm_mmu_get_subpages(kvm, spp_info, false);
+	mutex_unlock(&kvm->slots_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_get_subpages);
+
+int kvm_set_subpages(struct kvm *kvm,
+		     struct kvm_subpage *spp_info)
+{
+	int ret;
+
+	mutex_lock(&kvm->slots_lock);
+	ret = kvm_mmu_set_subpages(kvm, spp_info, false);
+	mutex_unlock(&kvm->slots_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_set_subpages);
+
+int kvm_init_spp(struct kvm *kvm)
+{
+	int ret;
+
+	mutex_lock(&kvm->slots_lock);
+	ret = kvm_mmu_init_spp(kvm);
+	mutex_unlock(&kvm->slots_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_init_spp);
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -9352,6 +9407,8 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
 	}
 
 	kvm_page_track_free_memslot(free, dont);
+	if (kvm->arch.spp_active)
+	      kvm_subpage_free_memslot(free, dont);
 }
 
 int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca7597e429df..0b9a0f546397 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -834,6 +834,9 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
 struct kvm_mmu_page *kvm_mmu_get_spp_page(struct kvm_vcpu *vcpu,
 			gfn_t gfn, unsigned int level);
+int kvm_get_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
+int kvm_set_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
+int kvm_init_spp(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2ff05fd123e3..ad8f2a3ca72d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,15 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SUBPAGES_GET_ACCESS and KVM_SUBPAGES_SET_ACCESS */
+#define SUBPAGE_MAX_BITMAP   64
+struct kvm_subpage {
+	__u64 base_gfn;
+	__u64 npages;
+	 /* sub-page write-access bitmap array */
+	__u32 access_map[SUBPAGE_MAX_BITMAP];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 39/92] KVM: VMX: Introduce SPP user-space IOCTLs
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (37 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 38/92] KVM: VMX: Add init/set/get functions for SPP Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 40/92] KVM: VMX: Handle SPP induced vmexit and page fault Adalbert Lazăr
                   ` (55 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

User application, e.g., QEMU or VMI, must initialize SPP
before gets/sets SPP subpages, the dynamic initialization is to
reduce the extra storage cost if the SPP feature is not not used.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-7-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c       | 73 ++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  3 ++
 include/uapi/linux/kvm.h |  3 ++
 3 files changed, 79 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8ae25cb227b..ef29ef7617bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4926,6 +4926,53 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		if (copy_from_user(&hvevfd, argp, sizeof(hvevfd)))
 			goto out;
 		r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd);
+	}
+	case KVM_SUBPAGES_GET_ACCESS: {
+		struct kvm_subpage spp_info;
+
+		if (!kvm->arch.spp_active) {
+			r = -ENODEV;
+			goto out;
+		}
+
+		r = -EFAULT;
+		if (copy_from_user(&spp_info, argp, sizeof(spp_info)))
+			goto out;
+
+		r = -EINVAL;
+		if (spp_info.npages == 0 ||
+		    spp_info.npages > SUBPAGE_MAX_BITMAP)
+			goto out;
+
+		r = kvm_vm_ioctl_get_subpages(kvm, &spp_info);
+		if (copy_to_user(argp, &spp_info, sizeof(spp_info))) {
+			r = -EFAULT;
+			goto out;
+		}
+		break;
+	}
+	case KVM_SUBPAGES_SET_ACCESS: {
+		struct kvm_subpage spp_info;
+
+		if (!kvm->arch.spp_active) {
+			r = -ENODEV;
+			goto out;
+		}
+
+		r = -EFAULT;
+		if (copy_from_user(&spp_info, argp, sizeof(spp_info)))
+			goto out;
+
+		r = -EINVAL;
+		if (spp_info.npages == 0 ||
+		    spp_info.npages > SUBPAGE_MAX_BITMAP)
+			goto out;
+
+		r = kvm_vm_ioctl_set_subpages(kvm, &spp_info);
+		break;
+	}
+	case KVM_INIT_SPP: {
+		r = kvm_vm_ioctl_init_spp(kvm);
 		break;
 	}
 	default:
@@ -9906,6 +9953,32 @@ bool kvm_arch_has_irq_bypass(void)
 	return kvm_x86_ops->update_pi_irte != NULL;
 }
 
+int kvm_arch_get_subpages(struct kvm *kvm,
+			  struct kvm_subpage *spp_info)
+{
+	if (!kvm_x86_ops->get_subpages)
+		return -EINVAL;
+
+	return kvm_x86_ops->get_subpages(kvm, spp_info);
+}
+
+int kvm_arch_set_subpages(struct kvm *kvm,
+			  struct kvm_subpage *spp_info)
+{
+	if (!kvm_x86_ops->set_subpages)
+		return -EINVAL;
+
+	return kvm_x86_ops->set_subpages(kvm, spp_info);
+}
+
+int kvm_arch_init_spp(struct kvm *kvm)
+{
+	if (!kvm_x86_ops->init_spp)
+		return -EINVAL;
+
+	return kvm_x86_ops->init_spp(kvm);
+}
+
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 				      struct irq_bypass_producer *prod)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b9a0f546397..ae4106aae16e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -837,6 +837,9 @@ struct kvm_mmu_page *kvm_mmu_get_spp_page(struct kvm_vcpu *vcpu,
 int kvm_get_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
 int kvm_set_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
 int kvm_init_spp(struct kvm *kvm);
+int kvm_arch_get_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
+int kvm_arch_set_subpages(struct kvm *kvm, struct kvm_subpage *spp_info);
+int kvm_arch_init_spp(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ad8f2a3ca72d..86dd57e67539 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1248,6 +1248,9 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SUBPAGES_GET_ACCESS   _IOR(KVMIO,  0x49, __u64)
+#define KVM_SUBPAGES_SET_ACCESS   _IOW(KVMIO,  0x4a, __u64)
+#define KVM_INIT_SPP              _IOW(KVMIO,  0x4b, __u64)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 40/92] KVM: VMX: Handle SPP induced vmexit and page fault
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (38 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 39/92] KVM: VMX: Introduce SPP user-space IOCTLs Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 41/92] KVM: MMU: Enable Lazy mode SPPT setup Adalbert Lazăr
                   ` (54 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, He Chen, Zhang Yi

From: Yang Weijiang <weijiang.yang@intel.com>

If write to subpage is not allowed, EPT violation is generated,
it's propagated to QEMU or VMI to handle.

If the target page is SPP protected, however SPPT missing is
encoutered while traversing with gfn, vmexit is generated so
that KVM can handle the issue. Any SPPT misconfig will be
propagated to QEMU or VMI.

A SPP specific bit(11) is added to exit_qualification and a new
exit reason(66) is introduced for SPP.

Co-developed-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Co-developed-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-8-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/vmx.h      |  7 ++++
 arch/x86/include/uapi/asm/vmx.h |  2 +
 arch/x86/kvm/mmu.c              | 17 ++++++++
 arch/x86/kvm/vmx/vmx.c          | 71 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h        |  5 +++
 5 files changed, 102 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 6cb05ac07453..11ca64ced578 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -547,6 +547,13 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
+/*
+ * Exit Qualifications for SPPT-Induced vmexits
+ */
+#define SPPT_INDUCED_EXIT_TYPE_BIT     11
+#define SPPT_INDUCED_EXIT_TYPE         (1 << SPPT_INDUCED_EXIT_TYPE_BIT)
+#define SPPT_INTR_INFO_UNBLOCK_NMI     INTR_INFO_UNBLOCK_NMI
+
 /*
  * VM-instruction error numbers
  */
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index f0b0c90dd398..ac67622bac5a 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -85,6 +85,7 @@
 #define EXIT_REASON_PML_FULL            62
 #define EXIT_REASON_XSAVES              63
 #define EXIT_REASON_XRSTORS             64
+#define EXIT_REASON_SPP                 66
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -141,6 +142,7 @@
 	{ EXIT_REASON_ENCLS,                 "ENCLS" }, \
 	{ EXIT_REASON_RDSEED,                "RDSEED" }, \
 	{ EXIT_REASON_PML_FULL,              "PML_FULL" }, \
+	{ EXIT_REASON_SPP,                   "SPP" }, \
 	{ EXIT_REASON_XSAVES,                "XSAVES" }, \
 	{ EXIT_REASON_XRSTORS,               "XRSTORS" }
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 38e79210d010..d59108a3ebbf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3692,6 +3692,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if ((error_code & PFERR_WRITE_MASK) &&
 		    spte_can_locklessly_be_made_writable(spte))
 		{
+			/*
+			 * Record write protect fault caused by
+			 * Sub-page Protection, let VMI decide
+			 * the next step.
+			 */
+			if (spte & PT_SPP_MASK) {
+				fault_handled = true;
+				vcpu->run->exit_reason = KVM_EXIT_SPP;
+				vcpu->run->spp.addr = gva;
+				kvm_skip_emulated_instruction(vcpu);
+				break;
+			}
+
 			new_spte |= PT_WRITABLE_MASK;
 
 			/*
@@ -5880,6 +5893,10 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 		r = vcpu->arch.mmu->page_fault(vcpu, cr2,
 					       lower_32_bits(error_code),
 					       false);
+
+		if (vcpu->run->exit_reason == KVM_EXIT_SPP)
+			return 0;
+
 		WARN_ON(r == RET_PF_INVALID);
 	}
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a50dd2b9d438..5d4b61aaff9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5335,6 +5335,76 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
 	return handle_nop(vcpu);
 }
 
+static int handle_spp(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification;
+	struct kvm_memory_slot *slot;
+	gpa_t gpa;
+	gfn_t gfn;
+
+	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	/*
+	 * SPP VM exit happened while executing iret from NMI,
+	 * "blocked by NMI" bit has to be set before next VM entry.
+	 * There are errata that may cause this bit to not be set:
+	 * AAK134, BY25.
+	 */
+	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+	    (exit_qualification & SPPT_INTR_INFO_UNBLOCK_NMI))
+		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+			      GUEST_INTR_STATE_NMI);
+
+	vcpu->arch.exit_qualification = exit_qualification;
+	if (exit_qualification & SPPT_INDUCED_EXIT_TYPE) {
+		struct kvm_subpage spp_info = {0};
+		int ret;
+
+		/*
+		 * SPPT missing
+		 * We don't set SPP write access for the corresponding
+		 * GPA, if we haven't setup, we need to construct
+		 * SPP table here.
+		 */
+		pr_info("SPP - SPPT entry missing!\n");
+		gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+		gfn = gpa >> PAGE_SHIFT;
+		slot = gfn_to_memslot(vcpu->kvm, gfn);
+		if (!slot)
+		      return -EFAULT;
+
+		/*
+		 * if the target gfn is not protected, but SPPT is
+		 * traversed now, regard this as some kind of fault.
+		 */
+		spp_info.base_gfn = gfn;
+		spp_info.npages = 1;
+
+		spin_lock(&(vcpu->kvm->mmu_lock));
+		ret = kvm_mmu_get_subpages(vcpu->kvm, &spp_info, true);
+		if (ret == 1) {
+			kvm_mmu_setup_spp_structure(vcpu,
+				spp_info.access_map[0], gfn);
+		}
+		spin_unlock(&(vcpu->kvm->mmu_lock));
+
+		return 1;
+
+	}
+
+	/*
+	 * SPPT Misconfig
+	 * This is probably caused by some mis-configuration in SPPT
+	 * entries, cannot handle it here, escalate the fault to
+	 * emulator.
+	 */
+	WARN_ON(1);
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_SPP;
+	pr_alert("SPP - SPPT Misconfiguration!\n");
+	return 0;
+}
+
 static int handle_invpcid(struct kvm_vcpu *vcpu)
 {
 	u32 vmx_instruction_info;
@@ -5538,6 +5608,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVVPID]                 = handle_vmx_instruction,
 	[EXIT_REASON_RDRAND]                  = handle_invalid_op,
 	[EXIT_REASON_RDSEED]                  = handle_invalid_op,
+	[EXIT_REASON_SPP]                     = handle_spp,
 	[EXIT_REASON_XSAVES]                  = handle_xsaves,
 	[EXIT_REASON_XRSTORS]                 = handle_xrstors,
 	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 86dd57e67539..81f08eec9061 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -244,6 +244,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_SPP              28
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -399,6 +400,10 @@ struct kvm_run {
 		struct {
 			__u8 vector;
 		} eoi;
+		/* KVM_EXIT_SPP */
+		struct {
+			__u64 addr;
+		} spp;
 		/* KVM_EXIT_HYPERV */
 		struct kvm_hyperv_exit hyperv;
 		/* Fix the size of the union. */

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 41/92] KVM: MMU: Enable Lazy mode SPPT setup
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (39 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 40/92] KVM: VMX: Handle SPP induced vmexit and page fault Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 42/92] KVM: MMU: Handle host memory remapping and reclaim Adalbert Lazăr
                   ` (53 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Yang Weijiang <weijiang.yang@intel.com>

If SPP subpages are set while the physical page are not
available in EPT leaf entry, the mapping is first stored
in SPP access bitmap buffer. SPPT setup is deferred to
access to the protected page, in EPT page fault handler,
the SPPT enries are set up.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-9-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/mmu.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d59108a3ebbf..24222e3add91 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4400,6 +4400,26 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
 	return kvm_mtrr_check_gfn_range_consistency(vcpu, gfn, page_num);
 }
 
+static int kvm_enable_spp_protection(struct kvm *kvm, u64 gfn)
+{
+	struct kvm_subpage spp_info = {0};
+	struct kvm_memory_slot *slot;
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!slot)
+		return -EFAULT;
+
+	spp_info.base_gfn = gfn;
+	spp_info.npages = 1;
+
+	if (kvm_mmu_get_subpages(kvm, &spp_info, true) < 0)
+		return -EFAULT;
+
+	if (spp_info.access_map[0] != FULL_SPP_ACCESS)
+		kvm_mmu_set_subpages(kvm, &spp_info, true);
+
+	return 0;
+}
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 			  bool prefault)
 {
@@ -4451,6 +4471,10 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
+
+	if (vcpu->kvm->arch.spp_active && level == PT_PAGE_TABLE_LEVEL)
+		kvm_enable_spp_protection(vcpu->kvm, gfn);
+
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	return r;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 42/92] KVM: MMU: Handle host memory remapping and reclaim
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (40 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 41/92] KVM: MMU: Enable Lazy mode SPPT setup Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 43/92] kvm: introspection: add KVMI_CONTROL_SPP Adalbert Lazăr
                   ` (52 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Yang Weijiang <weijiang.yang@intel.com>

Host page swapping/migration may change the translation in
EPT leaf entry, if the target page is SPP protected,
re-enable SPP protection in MMU notifier. If SPPT shadow
page is reclaimed, the level1 pages don't have rmap to clear.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Message-Id: <20190717133751.12910-10-weijiang.yang@intel.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/mmu.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24222e3add91..0b859b1797f6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2004,6 +2004,24 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			new_spte &= ~PT_WRITABLE_MASK;
 			new_spte &= ~SPTE_HOST_WRITEABLE;
 
+			/*
+			 * if it's EPT leaf entry and the physical page is
+			 * SPP protected, then re-enable SPP protection for
+			 * the page.
+			 */
+			if (kvm->arch.spp_active &&
+			    level == PT_PAGE_TABLE_LEVEL) {
+				struct kvm_subpage spp_info = {0};
+				int i;
+
+				spp_info.base_gfn = gfn;
+				spp_info.npages = 1;
+				i = kvm_mmu_get_subpages(kvm, &spp_info, true);
+				if (i == 1 &&
+				    spp_info.access_map[0] != FULL_SPP_ACCESS)
+					new_spte |= PT_SPP_MASK;
+			}
+
 			new_spte = mark_spte_for_access_track(new_spte);
 
 			mmu_spte_clear_track_bits(sptep);
@@ -2905,6 +2923,10 @@ static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
 		if (is_last_spte(pte, sp->role.level)) {
+			/* SPPT leaf entries don't have rmaps*/
+			if (sp->role.level == PT_PAGE_TABLE_LEVEL &&
+			    is_spp_spte(sp))
+				return true;
 			drop_spte(kvm, spte);
 			if (is_large_pte(pte))
 				--kvm->stat.lpages;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 43/92] kvm: introspection: add KVMI_CONTROL_SPP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (41 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 42/92] KVM: MMU: Handle host memory remapping and reclaim Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 15:59 ` [RFC PATCH v6 44/92] kvm: introspection: extend the internal database of tracked pages with write_bitmap info Adalbert Lazăr
                   ` (51 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This command enables/disables subpage protection (SPP) for the current VM.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 33 ++++++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                |  4 ++++
 include/uapi/linux/kvmi.h          |  7 +++++++
 virt/kvm/kvmi_int.h                |  6 ++++++
 virt/kvm/kvmi_msg.c                | 33 ++++++++++++++++++++++++++++++
 5 files changed, 83 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index b64a030507cf..c1d12aaa8633 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -617,6 +617,39 @@ In order to 'forget' an address, all the access bits ('rwx') must be set.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to add the page tracking structures
 
+11. KVMI_CONTROL_SPP
+--------------------
+
+:Architectures: x86/intel
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_spp {
+		__u8 enable;
+		__u8 padding1;
+		__u16 padding2;
+		__u32 padding3;
+	}
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+
+Enables/disables subpage protection (SPP) for the current VM.
+
+If SPP is not enabled, *KVMI_GET_PAGE_WRITE_BITMAP* and
+*KVMI_SET_PAGE_WRITE_BITMAP* commands will fail.
+
+:Errors:
+
+* -KVM_EINVAL - padding is not zero
+* -KVM_EOPNOTSUPP - the hardware doesn't support SPP
+* -KVM_EOPNOTSUPP - the current implementation can't disable SPP
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 3238ef176ad6..01fd218e213c 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -260,3 +260,7 @@ int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
 	return ec;
 }
 
+int kvmi_arch_cmd_control_spp(struct kvmi *ikvm)
+{
+	return kvm_arch_init_spp(ikvm->kvm);
+}
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 2ddbb1fea807..9f2b13718e47 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -142,6 +142,13 @@ struct kvmi_set_page_access {
 	struct kvmi_page_access_entry entries[0];
 };
 
+struct kvmi_control_spp {
+	__u8 enable;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 padding3;
+};
+
 struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index c54be93349b7..3f0c7a03b4a1 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -130,6 +130,11 @@ struct kvmi {
 	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
 	DECLARE_BITMAP(vm_ev_mask, KVMI_NUM_EVENTS);
 
+	struct {
+		bool initialized;
+		atomic_t enabled;
+	} spp;
+
 	bool cmd_reply_disabled;
 };
 
@@ -184,6 +189,7 @@ int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
 int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
 				  const struct kvmi_msg_hdr *msg,
 				  const struct kvmi_set_page_access *req);
+int kvmi_arch_cmd_control_spp(struct kvmi *ikvm);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index c150e7bdd440..e501a807c8a2 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -25,6 +25,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
 	[KVMI_CONTROL_EVENTS]        = "KVMI_CONTROL_EVENTS",
+	[KVMI_CONTROL_SPP]           = "KVMI_CONTROL_SPP",
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
 	[KVMI_EVENT]                 = "KVMI_EVENT",
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
@@ -300,6 +301,37 @@ static int kvmi_get_vcpu(struct kvmi *ikvm, unsigned int vcpu_idx,
 	return 0;
 }
 
+static bool enable_spp(struct kvmi *ikvm)
+{
+	if (!ikvm->spp.initialized) {
+		int err = kvmi_arch_cmd_control_spp(ikvm);
+
+		ikvm->spp.initialized = true;
+
+		if (!err)
+			atomic_set(&ikvm->spp.enabled, true);
+	}
+
+	return atomic_read(&ikvm->spp.enabled);
+}
+
+static int handle_control_spp(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg,
+			      const void *_req)
+{
+	const struct kvmi_control_spp *req = _req;
+	int ec;
+
+	if (req->padding1 || req->padding2 || req->padding3)
+		ec = -KVM_EINVAL;
+	else if (req->enable && enable_spp(ikvm))
+		ec = 0;
+	else
+		ec = -KVM_EOPNOTSUPP;
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static int handle_control_cmd_response(struct kvmi *ikvm,
 					const struct kvmi_msg_hdr *msg,
 					const void *_req)
@@ -364,6 +396,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CHECK_COMMAND]         = handle_check_command,
 	[KVMI_CHECK_EVENT]           = handle_check_event,
 	[KVMI_CONTROL_CMD_RESPONSE]  = handle_control_cmd_response,
+	[KVMI_CONTROL_SPP]           = handle_control_spp,
 	[KVMI_CONTROL_VM_EVENTS]     = handle_control_vm_events,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 44/92] kvm: introspection: extend the internal database of tracked pages with write_bitmap info
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (42 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 43/92] kvm: introspection: add KVMI_CONTROL_SPP Adalbert Lazăr
@ 2019-08-09 15:59 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 45/92] kvm: introspection: add KVMI_GET_PAGE_WRITE_BITMAP Adalbert Lazăr
                   ` (50 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 15:59 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This will allow us to use the subpage protection feature.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 virt/kvm/kvmi.c     | 46 +++++++++++++++++++++++++++++++++++++--------
 virt/kvm/kvmi_int.h |  1 +
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 4a9a4430a460..e18dfffa25ac 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -32,6 +32,7 @@ static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 static const u8 full_access  =	KVMI_PAGE_ACCESS_R |
 				KVMI_PAGE_ACCESS_W |
 				KVMI_PAGE_ACCESS_X;
+static const u32 default_write_access_bitmap;
 
 void *kvmi_msg_alloc(void)
 {
@@ -57,23 +58,32 @@ static struct kvmi_mem_access *__kvmi_get_gfn_access(struct kvmi *ikvm,
 	return radix_tree_lookup(&ikvm->access_tree, gfn);
 }
 
+/*
+ * TODO: intercept any SPP change made on pages present in our radix tree.
+ *
+ * bitmap must have the same value as the corresponding SPPT entry.
+ */
 static int kvmi_get_gfn_access(struct kvmi *ikvm, const gfn_t gfn,
-			       u8 *access)
+			       u8 *access, u32 *write_bitmap)
 {
 	struct kvmi_mem_access *m;
 
+	*write_bitmap = default_write_access_bitmap;
 	*access = full_access;
 
 	read_lock(&ikvm->access_tree_lock);
 	m = __kvmi_get_gfn_access(ikvm, gfn);
-	if (m)
+	if (m) {
 		*access = m->access;
+		*write_bitmap = m->write_bitmap;
+	}
 	read_unlock(&ikvm->access_tree_lock);
 
 	return m ? 0 : -1;
 }
 
-static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access)
+static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access,
+			       u32 write_bitmap)
 {
 	struct kvmi_mem_access *m;
 	struct kvmi_mem_access *__m;
@@ -87,6 +97,7 @@ static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access)
 
 	m->gfn = gfn;
 	m->access = access;
+	m->write_bitmap = write_bitmap;
 
 	if (radix_tree_preload(GFP_KERNEL)) {
 		err = -KVM_ENOMEM;
@@ -100,6 +111,7 @@ static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access)
 	__m = __kvmi_get_gfn_access(ikvm, gfn);
 	if (__m) {
 		__m->access = access;
+		__m->write_bitmap = write_bitmap;
 		kvmi_arch_update_page_tracking(kvm, NULL, __m);
 		if (access == full_access) {
 			radix_tree_delete(&ikvm->access_tree, gfn);
@@ -124,12 +136,22 @@ static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access)
 	return err;
 }
 
+static bool spp_access_allowed(gpa_t gpa, unsigned long bitmap)
+{
+	u32 off = (gpa & ~PAGE_MASK);
+	u32 spp = off / 128;
+
+	return test_bit(spp, &bitmap);
+}
+
 static bool kvmi_restricted_access(struct kvmi *ikvm, gpa_t gpa, u8 access)
 {
+	u32 allowed_bitmap;
 	u8 allowed_access;
 	int err;
 
-	err = kvmi_get_gfn_access(ikvm, gpa_to_gfn(gpa), &allowed_access);
+	err = kvmi_get_gfn_access(ikvm, gpa_to_gfn(gpa), &allowed_access,
+				  &allowed_bitmap);
 
 	if (err)
 		return false;
@@ -138,8 +160,14 @@ static bool kvmi_restricted_access(struct kvmi *ikvm, gpa_t gpa, u8 access)
 	 * We want to be notified only for violations involving access
 	 * bits that we've specifically cleared
 	 */
-	if ((~allowed_access) & access)
+	if ((~allowed_access) & access) {
+		bool write_access = (access & KVMI_PAGE_ACCESS_W);
+
+		if (write_access && spp_access_allowed(gpa, allowed_bitmap))
+			return false;
+
 		return true;
+	}
 
 	return false;
 }
@@ -1126,8 +1154,9 @@ void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access)
 {
 	gfn_t gfn = gpa_to_gfn(gpa);
+	u32 ignored_write_bitmap;
 
-	kvmi_get_gfn_access(ikvm, gfn, access);
+	kvmi_get_gfn_access(ikvm, gfn, access, &ignored_write_bitmap);
 
 	return 0;
 }
@@ -1136,10 +1165,11 @@ int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access)
 {
 	gfn_t gfn = gpa_to_gfn(gpa);
 	u8 ignored_access;
+	u32 write_bitmap;
 
-	kvmi_get_gfn_access(ikvm, gfn, &ignored_access);
+	kvmi_get_gfn_access(ikvm, gfn, &ignored_access, &write_bitmap);
 
-	return kvmi_set_gfn_access(ikvm->kvm, gfn, access);
+	return kvmi_set_gfn_access(ikvm->kvm, gfn, access, write_bitmap);
 }
 
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 3f0c7a03b4a1..d9a10a3b7082 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -141,6 +141,7 @@ struct kvmi {
 struct kvmi_mem_access {
 	gfn_t gfn;
 	u8 access;
+	u32 write_bitmap;
 	struct kvmi_arch_mem_access arch;
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 45/92] kvm: introspection: add KVMI_GET_PAGE_WRITE_BITMAP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (43 preceding siblings ...)
  2019-08-09 15:59 ` [RFC PATCH v6 44/92] kvm: introspection: extend the internal database of tracked pages with write_bitmap info Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 46/92] kvm: introspection: add KVMI_SET_PAGE_WRITE_BITMAP Adalbert Lazăr
                   ` (49 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This command returns subpage protection (SPP) write bitmaps for an array
of guest physical addresses of 4KB size.

Like the KVMI_GET_PAGE_ACCESS command, it checks only the radix tree,
not the SPP tables.  So, either we change it to check the SPP tables
or we drop it. Given the fact that the KVMI_EVENT_PF events are filter
using the radix tree and that the introspection tool should know what
it tracks, we should choose the later.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 44 ++++++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                | 44 ++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 11 ++++++++
 virt/kvm/kvmi.c                    | 11 ++++++++
 virt/kvm/kvmi_int.h                | 11 ++++++++
 virt/kvm/kvmi_msg.c                | 18 ++++++++++++
 6 files changed, 139 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index c1d12aaa8633..2ffb92b0fa71 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -650,6 +650,50 @@ If SPP is not enabled, *KVMI_GET_PAGE_WRITE_BITMAP* and
 * -KVM_EOPNOTSUPP - the hardware doesn't support SPP
 * -KVM_EOPNOTSUPP - the current implementation can't disable SPP
 
+12. KVMI_GET_PAGE_WRITE_BITMAP
+------------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_page_write_bitmap {
+		__u16 view;
+		__u16 count;
+		__u32 padding;
+		__u64 gpa[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_page_write_bitmap_reply {
+		__u32 bitmap[0];
+	};
+
+Returns subpage protection (SPP) write bitmaps for an array of ``count``
+guest physical addresses of 4KB size.
+
+By default, for any guest physical address, the returned bits will be zero
+(no write access for any subpage if the *KVMI_PAGE_ACCESS_W* flag has been
+cleared for the whole 4KB page - see *KVMI_SET_PAGE_ACCESS*).
+
+On Intel hardware with multiple EPT views, the ``view`` argument selects the
+EPT view (0 is primary). On all other hardware it must be zero.
+
+:Errors:
+
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EOPNOTSUPP - a SPT view was selected but the hardware doesn't support it
+* -KVM_EOPNOTSUPP - the hardware doesn't support SPP or hasn't been enabled
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 01fd218e213c..356ec79936b3 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -224,6 +224,50 @@ int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
 	return 0;
 }
 
+int kvmi_arch_cmd_get_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const struct kvmi_get_page_write_bitmap
+					*req,
+					struct kvmi_get_page_write_bitmap_reply
+					**dest, size_t *dest_size)
+{
+	struct kvmi_get_page_write_bitmap_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	u16 k, n = req->count;
+	int ec = 0;
+
+	if (req->padding)
+		return -KVM_EINVAL;
+
+	if (msg->size < sizeof(*req) + req->count * sizeof(req->gpa[0]))
+		return -KVM_EINVAL;
+
+	if (!kvmi_spp_enabled(ikvm))
+		return -KVM_EOPNOTSUPP;
+
+	if (req->view != 0)	/* TODO */
+		return -KVM_EOPNOTSUPP;
+
+	rpl_size = sizeof(*rpl) + sizeof(rpl->bitmap[0]) * n;
+	rpl = kvmi_msg_alloc_check(rpl_size);
+	if (!rpl)
+		return -KVM_ENOMEM;
+
+	for (k = 0; k < n && ec == 0; k++)
+		ec = kvmi_cmd_get_page_write_bitmap(ikvm, req->gpa[k],
+						    &rpl->bitmap[k]);
+
+	if (ec) {
+		kvmi_msg_free(rpl);
+		return ec;
+	}
+
+	*dest = rpl;
+	*dest_size = rpl_size;
+
+	return 0;
+}
+
 int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
 				  const struct kvmi_msg_hdr *msg,
 				  const struct kvmi_set_page_access *req)
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 9f2b13718e47..19a6a50df96b 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -149,6 +149,17 @@ struct kvmi_control_spp {
 	__u32 padding3;
 };
 
+struct kvmi_get_page_write_bitmap {
+	__u16 view;
+	__u16 count;
+	__u32 padding;
+	__u64 gpa[0];
+};
+
+struct kvmi_get_page_write_bitmap_reply {
+	__u32 bitmap[0];
+};
+
 struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index e18dfffa25ac..22e233ca474c 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1161,6 +1161,17 @@ int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access)
 	return 0;
 }
 
+int kvmi_cmd_get_page_write_bitmap(struct kvmi *ikvm, u64 gpa,
+				   u32 *write_bitmap)
+{
+	gfn_t gfn = gpa_to_gfn(gpa);
+	u8 ignored_access;
+
+	kvmi_get_gfn_access(ikvm, gfn, &ignored_access, write_bitmap);
+
+	return 0;
+}
+
 int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access)
 {
 	gfn_t gfn = gpa_to_gfn(gpa);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index d9a10a3b7082..7243c57be27a 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -150,6 +150,11 @@ static inline bool is_event_enabled(struct kvm_vcpu *vcpu, int event)
 	return test_bit(event, IVCPU(vcpu)->ev_mask);
 }
 
+static inline bool kvmi_spp_enabled(struct kvmi *ikvm)
+{
+	return atomic_read(&ikvm->spp.enabled);
+}
+
 /* kvmi_msg.c */
 bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
@@ -167,6 +172,7 @@ void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
 int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access);
+int kvmi_cmd_get_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 *bitmap);
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
@@ -191,6 +197,11 @@ int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
 				  const struct kvmi_msg_hdr *msg,
 				  const struct kvmi_set_page_access *req);
 int kvmi_arch_cmd_control_spp(struct kvmi *ikvm);
+int kvmi_arch_cmd_get_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const struct kvmi_get_page_write_bitmap *req,
+					struct kvmi_get_page_write_bitmap_reply **dest,
+					size_t *dest_size);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index e501a807c8a2..eb247ac3e037 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -31,6 +31,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
+	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
@@ -383,6 +384,22 @@ static int handle_set_page_access(struct kvmi *ikvm,
 	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
 }
 
+static int handle_get_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const void *req)
+{
+	struct kvmi_get_page_write_bitmap_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	ec = kvmi_arch_cmd_get_page_write_bitmap(ikvm, msg, req, &rpl,
+						 &rpl_size);
+
+	err = kvmi_msg_vm_maybe_reply(ikvm, msg, ec, rpl, rpl_size);
+	kvmi_msg_free(rpl);
+	return err;
+}
+
 static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
 {
 	return hdr->padding1 || hdr->padding2;
@@ -400,6 +417,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CONTROL_VM_EVENTS]     = handle_control_vm_events,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
+	[KVMI_GET_PAGE_WRITE_BITMAP] = handle_get_page_write_bitmap,
 	[KVMI_GET_VERSION]           = handle_get_version,
 	[KVMI_SET_PAGE_ACCESS]       = handle_set_page_access,
 };

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 46/92] kvm: introspection: add KVMI_SET_PAGE_WRITE_BITMAP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (44 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 45/92] kvm: introspection: add KVMI_GET_PAGE_WRITE_BITMAP Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 47/92] kvm: introspection: add KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL Adalbert Lazăr
                   ` (48 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This command sets the subpage protection (SPP) write bitmap for an array
of guest physical addresses of 4KB bytes.

Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 66 ++++++++++++++++++++++++++++++
 arch/x86/kvm/kvmi.c                | 30 ++++++++++++++
 include/uapi/linux/kvmi.h          | 13 ++++++
 virt/kvm/kvmi.c                    | 37 +++++++++++++++++
 virt/kvm/kvmi_int.h                |  4 ++
 virt/kvm/kvmi_msg.c                | 13 ++++++
 6 files changed, 163 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 2ffb92b0fa71..69557c63ff94 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -694,6 +694,72 @@ EPT view (0 is primary). On all other hardware it must be zero.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to allocate the reply
 
+13. KVMI_SET_PAGE_WRITE_BITMAP
+------------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_set_page_write_bitmap {
+		__u16 view;
+		__u16 count;
+		__u32 padding;
+		struct kvmi_page_write_bitmap_entry entries[0];
+	};
+
+where::
+
+	struct kvmi_page_write_bitmap_entry {
+		__u64 gpa;
+		__u32 bitmap;
+		__u32 padding;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+
+Sets the subpage protection (SPP) write bitmap for an array of ``count``
+guest physical addresses of 4KB bytes.
+
+The command will make the changes starting with the first entry and
+it will stop on the first error. The introspection tool should handle
+the rollback.
+
+While the *KVMI_SET_PAGE_ACCESS* command can be used to write-protect a
+4KB page, this command can write-protect 128-bytes subpages inside of a
+4KB page by setting the corresponding bit to 1 (write allowed) or to 0
+(write disallowed). For example, to allow write access to the A and B
+subpages only, the bitmap must be set to::
+
+	BIT(A) | BIT(B)
+
+A and B must be a number between 0 (first subpage) and 31 (last subpage).
+
+Using this command to set all bits to 1 (allow write access for
+all subpages) will allow write access to the whole 4KB page (like a
+*KVMI_SET_PAGE_ACCESS* command with the *KVMI_PAGE_ACCESS_W* flag set)
+and vice versa.
+
+Using this command to set any bit to 0 will write-protect the whole 4KB
+page (like a *KVMI_SET_PAGE_ACCESS* command with the *KVMI_PAGE_ACCESS_W*
+flag cleared) and allow write access only for subpages with the
+corresponding bit set to 1.
+
+:Errors:
+
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EOPNOTSUPP - a SPT view was selected but the hardware doesn't support it
+* -KVM_EOPNOTSUPP - the hardware doesn't support SPP or hasn't been enabled
+* -KVM_EINVAL - the write access is already allowed for the whole 4KB page
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to add the page tracking structures
+
 Events
 ======
 
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 356ec79936b3..fa290fbf1f75 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -304,6 +304,36 @@ int kvmi_arch_cmd_set_page_access(struct kvmi *ikvm,
 	return ec;
 }
 
+int kvmi_arch_cmd_set_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const struct kvmi_set_page_write_bitmap
+					*req)
+{
+	u16 k, n = req->count;
+	int ec = 0;
+
+	if (req->padding)
+		return -KVM_EINVAL;
+
+	if (msg->size < sizeof(*req) + req->count * sizeof(req->entries[0]))
+		return -KVM_EINVAL;
+
+	if (!kvmi_spp_enabled(ikvm))
+		return -KVM_EOPNOTSUPP;
+
+	if (req->view != 0)	/* TODO */
+		return -KVM_EOPNOTSUPP;
+
+	for (k = 0; k < n && ec == 0; k++) {
+		u64 gpa = req->entries[k].gpa;
+		u32 bitmap = req->entries[k].bitmap;
+
+		ec = kvmi_cmd_set_page_write_bitmap(ikvm, gpa, bitmap);
+	}
+
+	return ec;
+}
+
 int kvmi_arch_cmd_control_spp(struct kvmi *ikvm)
 {
 	return kvm_arch_init_spp(ikvm->kvm);
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 19a6a50df96b..0b3139c52a30 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -160,6 +160,19 @@ struct kvmi_get_page_write_bitmap_reply {
 	__u32 bitmap[0];
 };
 
+struct kvmi_page_write_bitmap_entry {
+	__u64 gpa;
+	__u32 bitmap;
+	__u32 padding;
+};
+
+struct kvmi_set_page_write_bitmap {
+	__u16 view;
+	__u16 count;
+	__u32 padding;
+	struct kvmi_page_write_bitmap_entry entries[0];
+};
+
 struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 22e233ca474c..d2bebef98d8d 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -99,6 +99,24 @@ static int kvmi_set_gfn_access(struct kvm *kvm, gfn_t gfn, u8 access,
 	m->access = access;
 	m->write_bitmap = write_bitmap;
 
+	/*
+	 * Only try to set SPP bitmap when the page is writable.
+	 * Be careful, kvm_mmu_set_subpages() will enable page write-protection
+	 * by default when set SPP bitmap. If bitmap contains all 1s, it'll
+	 * make the page writable by default too.
+	 */
+	if (!(access & KVMI_PAGE_ACCESS_W) && kvmi_spp_enabled(ikvm)) {
+		struct kvm_subpage spp_info;
+
+		spp_info.base_gfn = gfn;
+		spp_info.npages = 1;
+		spp_info.access_map[0] = write_bitmap;
+
+		err = kvm_arch_set_subpages(kvm, &spp_info);
+		if (err)
+			goto exit;
+	}
+
 	if (radix_tree_preload(GFP_KERNEL)) {
 		err = -KVM_ENOMEM;
 		goto exit;
@@ -1183,6 +1201,25 @@ int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access)
 	return kvmi_set_gfn_access(ikvm->kvm, gfn, access, write_bitmap);
 }
 
+int kvmi_cmd_set_page_write_bitmap(struct kvmi *ikvm, u64 gpa,
+				   u32 write_bitmap)
+{
+	bool write_allowed_for_all;
+	gfn_t gfn = gpa_to_gfn(gpa);
+	u32 ignored_write_bitmap;
+	u8 access;
+
+	kvmi_get_gfn_access(ikvm, gfn, &access, &ignored_write_bitmap);
+
+	write_allowed_for_all = (write_bitmap == (u32)((1ULL << 32) - 1));
+	if (write_allowed_for_all)
+		access |= KVMI_PAGE_ACCESS_W;
+	else
+		access &= ~KVMI_PAGE_ACCESS_W;
+
+	return kvmi_set_gfn_access(ikvm->kvm, gfn, access, write_bitmap);
+}
+
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 7243c57be27a..18c00dae0f2f 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -173,6 +173,7 @@ void kvmi_msg_free(void *addr);
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
 int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access);
 int kvmi_cmd_get_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 *bitmap);
+int kvmi_cmd_set_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 bitmap);
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
@@ -202,6 +203,9 @@ int kvmi_arch_cmd_get_page_write_bitmap(struct kvmi *ikvm,
 					const struct kvmi_get_page_write_bitmap *req,
 					struct kvmi_get_page_write_bitmap_reply **dest,
 					size_t *dest_size);
+int kvmi_arch_cmd_set_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const struct kvmi_set_page_write_bitmap *req);
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index eb247ac3e037..f9efb52d49c3 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -35,6 +35,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
+	[KVMI_SET_PAGE_WRITE_BITMAP] = "KVMI_SET_PAGE_WRITE_BITMAP",
 };
 
 static bool is_known_message(u16 id)
@@ -400,6 +401,17 @@ static int handle_get_page_write_bitmap(struct kvmi *ikvm,
 	return err;
 }
 
+static int handle_set_page_write_bitmap(struct kvmi *ikvm,
+					const struct kvmi_msg_hdr *msg,
+					const void *req)
+{
+	int ec;
+
+	ec = kvmi_arch_cmd_set_page_write_bitmap(ikvm, msg, req);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
 {
 	return hdr->padding1 || hdr->padding2;
@@ -420,6 +432,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_PAGE_WRITE_BITMAP] = handle_get_page_write_bitmap,
 	[KVMI_GET_VERSION]           = handle_get_version,
 	[KVMI_SET_PAGE_ACCESS]       = handle_set_page_access,
+	[KVMI_SET_PAGE_WRITE_BITMAP] = handle_set_page_write_bitmap,
 };
 
 static int handle_event_reply(struct kvm_vcpu *vcpu,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 47/92] kvm: introspection: add KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (45 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 46/92] kvm: introspection: add KVMI_SET_PAGE_WRITE_BITMAP Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 48/92] kvm: add kvm_vcpu_kick_and_wait() Adalbert Lazăr
                   ` (47 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

These commands allows the introspection tool to read/write from/to the
guest memory.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  60 ++++++++++++++++
 include/uapi/linux/kvmi.h          |  11 +++
 virt/kvm/kvmi.c                    | 107 +++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |   7 ++
 virt/kvm/kvmi_msg.c                |  42 +++++++++++
 5 files changed, 227 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 69557c63ff94..eef32107837a 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -760,6 +760,66 @@ corresponding bit set to 1.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to add the page tracking structures
 
+14. KVMI_READ_PHYSICAL
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_read_physical {
+		__u64 gpa;
+		__u64 size;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	__u8 data[0];
+
+Reads from the guest memory.
+
+Currently, the size must be non-zero and the read must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+15. KVMI_WRITE_PHYSICAL
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_write_physical {
+		__u64 gpa;
+		__u64 size;
+		__u8  data[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Writes into the guest memory.
+
+Currently, the size must be non-zero and the write must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
 Events
 ======
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index 0b3139c52a30..be3f066f314e 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -191,6 +191,17 @@ struct kvmi_control_vm_events {
 	__u32 padding2;
 };
 
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8  data[0];
+};
+
 struct kvmi_vcpu_hdr {
 	__u16 vcpu;
 	__u16 padding1;
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index d2bebef98d8d..a84eb150e116 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2017-2019 Bitdefender S.R.L.
  *
  */
+#include <linux/mmu_context.h>
 #include <uapi/linux/kvmi.h>
 #include "kvmi_int.h"
 #include <linux/kthread.h>
@@ -1220,6 +1221,112 @@ int kvmi_cmd_set_page_write_bitmap(struct kvmi *ikvm, u64 gpa,
 	return kvmi_set_gfn_access(ikvm->kvm, gfn, access, write_bitmap);
 }
 
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long hva;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&kvm->srcu);
+	hva = gfn_to_hva(kvm, gfn);
+	srcu_read_unlock(&kvm->srcu, srcu_idx);
+
+	return hva;
+}
+
+static long get_user_pages_remote_unlocked(struct mm_struct *mm,
+	unsigned long start,
+	unsigned long nr_pages,
+	unsigned int gup_flags,
+	struct page **pages)
+{
+	long ret;
+	struct task_struct *tsk = NULL;
+	struct vm_area_struct **vmas = NULL;
+	int locked = 1;
+
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+		pages, vmas, &locked);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+
+static void *get_page_ptr(struct kvm *kvm, gpa_t gpa, struct page **page,
+			  bool write)
+{
+	unsigned int flags = write ? FOLL_WRITE : 0;
+	unsigned long hva;
+
+	*page = NULL;
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
+
+	if (kvm_is_error_hva(hva)) {
+		kvmi_err(IKVM(kvm), "Invalid gpa %llx\n", gpa);
+		return NULL;
+	}
+
+	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, flags, page) != 1) {
+		kvmi_err(IKVM(kvm),
+			 "Failed to get the page for hva %lx gpa %llx\n",
+			 hva, gpa);
+		return NULL;
+	}
+
+	return kmap_atomic(*page);
+}
+
+static void put_page_ptr(void *ptr, struct page *page)
+{
+	if (ptr)
+		kunmap_atomic(ptr);
+	if (page)
+		put_page(page);
+}
+
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size, int(*send)(
+	struct kvmi *, const struct kvmi_msg_hdr *,
+	int err, const void *buf, size_t),
+	const struct kvmi_msg_hdr *ctx)
+{
+	int err, ec = 0;
+	struct page *page = NULL;
+	void *ptr_page = NULL, *ptr = NULL;
+	size_t ptr_size = 0;
+
+	ptr_page = get_page_ptr(kvm, gpa, &page, false);
+	if (!ptr_page) {
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	ptr = ptr_page + (gpa & ~PAGE_MASK);
+	ptr_size = size;
+
+out:
+	err = send(IKVM(kvm), ctx, ec, ptr, ptr_size);
+
+	put_page_ptr(ptr_page, page);
+	return err;
+}
+
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
+{
+	struct page *page;
+	void *ptr;
+
+	ptr = get_page_ptr(kvm, gpa, &page, true);
+	if (!ptr)
+		return -KVM_EINVAL;
+
+	memcpy(ptr + (gpa & ~PAGE_MASK), buf, size);
+
+	put_page_ptr(ptr, page);
+
+	return 0;
+}
+
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 18c00dae0f2f..7bdff70d4309 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -174,6 +174,13 @@ int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
 int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access);
 int kvmi_cmd_get_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 *bitmap);
 int kvmi_cmd_set_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 bitmap);
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size,
+			   int (*send)(struct kvmi *,
+					const struct kvmi_msg_hdr*,
+					int err, const void *buf, size_t),
+			   const struct kvmi_msg_hdr *ctx);
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size,
+			    const void *buf);
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index f9efb52d49c3..9c20a9cfda42 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -34,8 +34,10 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
+	[KVMI_READ_PHYSICAL]         = "KVMI_READ_PHYSICAL",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
 	[KVMI_SET_PAGE_WRITE_BITMAP] = "KVMI_SET_PAGE_WRITE_BITMAP",
+	[KVMI_WRITE_PHYSICAL]        = "KVMI_WRITE_PHYSICAL",
 };
 
 static bool is_known_message(u16 id)
@@ -303,6 +305,44 @@ static int kvmi_get_vcpu(struct kvmi *ikvm, unsigned int vcpu_idx,
 	return 0;
 }
 
+static bool invalid_page_access(u64 gpa, u64 size)
+{
+	u64 off = gpa & ~PAGE_MASK;
+
+	return (size == 0 || size > PAGE_SIZE || off + size > PAGE_SIZE);
+}
+
+static int handle_read_physical(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_read_physical *req = _req;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	return kvmi_cmd_read_physical(ikvm->kvm, req->gpa, req->size,
+				      kvmi_msg_vm_maybe_reply, msg);
+}
+
+static int handle_write_physical(struct kvmi *ikvm,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req)
+{
+	const struct kvmi_write_physical *req = _req;
+	int ec;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	if (msg->size < sizeof(*req) + req->size)
+		return -EINVAL;
+
+	ec = kvmi_cmd_write_physical(ikvm->kvm, req->gpa, req->size, req->data);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
+}
+
 static bool enable_spp(struct kvmi *ikvm)
 {
 	if (!ikvm->spp.initialized) {
@@ -431,8 +471,10 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
 	[KVMI_GET_PAGE_WRITE_BITMAP] = handle_get_page_write_bitmap,
 	[KVMI_GET_VERSION]           = handle_get_version,
+	[KVMI_READ_PHYSICAL]         = handle_read_physical,
 	[KVMI_SET_PAGE_ACCESS]       = handle_set_page_access,
 	[KVMI_SET_PAGE_WRITE_BITMAP] = handle_set_page_write_bitmap,
+	[KVMI_WRITE_PHYSICAL]        = handle_write_physical,
 };
 
 static int handle_event_reply(struct kvm_vcpu *vcpu,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 48/92] kvm: add kvm_vcpu_kick_and_wait()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (46 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 47/92] kvm: introspection: add KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 49/92] kvm: introspection: add KVMI_PAUSE_VCPU and KVMI_EVENT_PAUSE_VCPU Adalbert Lazăr
                   ` (46 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This function is needed for the KVMI_PAUSE_VCPU command. There are
cases when it is easier for the introspection tool if it knows that
the vCPU doesn't run guest code when the command is completed, without
waiting for the KVMI_EVENT_PAUSE_VCPU event.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae4106aae16e..09bc06747642 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -738,6 +738,7 @@ void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
+void kvm_vcpu_kick_and_wait(struct kvm_vcpu *vcpu);
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e11069b9565..5256d7321d0e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2370,6 +2370,16 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
 #endif /* !CONFIG_S390 */
 
+void kvm_vcpu_kick_and_wait(struct kvm_vcpu *vcpu)
+{
+	if (kvm_vcpu_wake_up(vcpu))
+		return;
+
+	if (kvm_request_needs_ipi(vcpu, KVM_REQUEST_WAIT))
+		smp_call_function_single(vcpu->cpu, ack_flush, NULL, 1);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_kick_and_wait);
+
 int kvm_vcpu_yield_to(struct kvm_vcpu *target)
 {
 	struct pid *pid;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 49/92] kvm: introspection: add KVMI_PAUSE_VCPU and KVMI_EVENT_PAUSE_VCPU
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (47 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 48/92] kvm: add kvm_vcpu_kick_and_wait() Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 50/92] kvm: introspection: add KVMI_GET_REGISTERS Adalbert Lazăr
                   ` (45 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This is the only vCPU command handled by the receiving worker.
It increments a pause request counter and kicks the vCPU.

This event is send by the vCPU thread, but has a low priority. It
will be sent after any other vCPU introspection event and when no vCPU
introspection command is queued.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 68 ++++++++++++++++++++++++++++++
 include/uapi/linux/kvm_para.h      |  1 +
 include/uapi/linux/kvmi.h          |  7 +++
 virt/kvm/kvmi.c                    | 65 ++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  4 ++
 virt/kvm/kvmi_msg.c                | 61 +++++++++++++++++++++++++++
 6 files changed, 206 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index eef32107837a..558d3eb6007f 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -820,6 +820,48 @@ one page (offset + size <= PAGE_SIZE).
 
 * -KVM_EINVAL - the specified gpa is invalid
 
+16. KVMI_PAUSE_VCPU
+-------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters:
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_pause_vcpu {
+		__u8 wait;
+		__u8 padding1;
+		__u16 padding2;
+		__u32 padding3;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+
+Kicks the vCPU from guest.
+
+If `wait` is 1, the command will wait for vCPU to acknowledge the IPI.
+
+The vCPU will handle the pending commands/events and send the
+*KVMI_EVENT_PAUSE_VCPU* event (one for every successful *KVMI_PAUSE_VCPU*
+command) before returning to guest.
+
+Please note that new vCPUs might by created at any time.
+The introspection tool should use *KVMI_CONTROL_VM_EVENTS* to enable the
+*KVMI_EVENT_CREATE_VCPU* event in order to stop these new vCPUs as well
+(by delaying the event reply).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY  - the selected vCPU has too many queued *KVMI_EVENT_PAUSE_VCPU* events
+* -KVM_EPERM  - the *KVMI_EVENT_PAUSE_VCPU* event is disallowed (see *KVMI_CONTROL_EVENTS*)
+		and the introspection tool expects a reply.
 Events
 ======
 
@@ -992,3 +1034,29 @@ The *RETRY* action is used by the introspector to retry the execution of
 the current instruction. Either using single-step (if ``singlestep`` is
 not zero) or return to guest (if the introspector changed the instruction
 pointer or the page restrictions).
+
+4. KVMI_EVENT_PAUSE_VCPU
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent in response to a *KVMI_PAUSE_VCPU* command and
+cannot be disabled via *KVMI_CONTROL_EVENTS*.
+
+This event has a low priority. It will be sent after any other vCPU
+introspection event and when no vCPU introspection command is queued.
+
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 54c0e20f5b64..07e3f2662b36 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -18,6 +18,7 @@
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
 #define KVM_EAGAIN		11
+#define KVM_EBUSY		EBUSY
 #define KVM_ENOMEM		ENOMEM
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index be3f066f314e..ca9c6b6aeed5 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -177,6 +177,13 @@ struct kvmi_get_vcpu_info_reply {
 	__u64 tsc_speed;
 };
 
+struct kvmi_pause_vcpu {
+	__u8 wait;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 padding3;
+};
+
 struct kvmi_control_events {
 	__u16 event_id;
 	__u8 enable;
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index a84eb150e116..85de2da3eb7b 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -11,6 +11,8 @@
 #include <linux/kthread.h>
 #include <linux/bitmap.h>
 
+#define MAX_PAUSE_REQUESTS 1001
+
 static struct kmem_cache *msg_cache;
 static struct kmem_cache *radix_cache;
 static struct kmem_cache *job_cache;
@@ -1090,6 +1092,39 @@ static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+static bool __kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+	bool ret = false;
+
+	action = kvmi_msg_send_pause_vcpu(vcpu);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ret = true;
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "PAUSE");
+	}
+
+	return ret;
+}
+
+static bool kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	ret = __kvmi_pause_vcpu_event(vcpu);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 void kvmi_run_jobs(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
@@ -1154,6 +1189,7 @@ int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu)
 
 void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 {
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
 	struct kvmi *ikvm;
 
 	ikvm = kvmi_get(vcpu->kvm);
@@ -1165,6 +1201,12 @@ void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 
 		if (err)
 			break;
+
+		if (!atomic_read(&ivcpu->pause_requests))
+			break;
+
+		atomic_dec(&ivcpu->pause_requests);
+		kvmi_pause_vcpu_event(vcpu);
 	}
 
 	kvmi_put(vcpu->kvm);
@@ -1351,10 +1393,33 @@ int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 	return 0;
 }
 
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	unsigned int req = KVM_REQ_INTROSPECTION;
+
+	if (atomic_read(&ivcpu->pause_requests) > MAX_PAUSE_REQUESTS)
+		return -KVM_EBUSY;
+
+	atomic_inc(&ivcpu->pause_requests);
+	kvm_make_request(req, vcpu);
+	if (wait)
+		kvm_vcpu_kick_and_wait(vcpu);
+	else
+		kvm_vcpu_kick(vcpu);
+
+	return 0;
+}
+
 static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
 
+	/*
+	 * The thread that might increment this atomic is stopped
+	 * and this thread is the only one that could decrement it.
+	 */
+	atomic_set(&ivcpu->pause_requests, 0);
 	ivcpu->reply_waiting = false;
 }
 
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 7bdff70d4309..cb3b0ce87bc1 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -100,6 +100,8 @@ struct kvmi_vcpu {
 	bool rep_complete;
 	bool effective_rep_complete;
 
+	atomic_t pause_requests;
+
 	bool reply_waiting;
 	struct kvmi_vcpu_reply reply;
 
@@ -164,6 +166,7 @@ u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete,
 		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
 u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
 int kvmi_msg_send_unhook(struct kvmi *ikvm);
 
 /* kvmi.c */
@@ -185,6 +188,7 @@ int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait);
 int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
 int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 9c20a9cfda42..a4446eed354d 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -34,6 +34,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
+	[KVMI_PAUSE_VCPU]            = "KVMI_PAUSE_VCPU",
 	[KVMI_READ_PHYSICAL]         = "KVMI_READ_PHYSICAL",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
 	[KVMI_SET_PAGE_WRITE_BITMAP] = "KVMI_SET_PAGE_WRITE_BITMAP",
@@ -457,6 +458,53 @@ static bool invalid_vcpu_hdr(const struct kvmi_vcpu_hdr *hdr)
 	return hdr->padding1 || hdr->padding2;
 }
 
+/*
+ * We handle this vCPU command on the receiving thread to make it easier
+ * for userspace to implement a 'pause VM' command. Usually, this is done
+ * by sending one 'pause vCPU' command for every vCPU. By handling the
+ * command here, the userspace can:
+ *    - optimize, by not requesting a reply for the first N-1 vCPU's
+ *    - consider the VM stopped once it receives the reply
+ *      for the last 'pause vCPU' command
+ */
+static int handle_pause_vcpu(struct kvmi *ikvm,
+			     const struct kvmi_msg_hdr *msg,
+			     const void *_req)
+{
+	const struct kvmi_pause_vcpu *req = _req;
+	const struct kvmi_vcpu_hdr *cmd;
+	struct kvm_vcpu *vcpu = NULL;
+	int err;
+
+	if (req->padding1 || req->padding2 || req->padding3) {
+		err = -KVM_EINVAL;
+		goto reply;
+	}
+
+	cmd = (const struct kvmi_vcpu_hdr *) (msg + 1);
+
+	if (invalid_vcpu_hdr(cmd)) {
+		err = -KVM_EINVAL;
+		goto reply;
+	}
+
+	if (!is_event_allowed(ikvm, KVMI_EVENT_PAUSE_VCPU)) {
+		err = -KVM_EPERM;
+
+		if (ikvm->cmd_reply_disabled)
+			return kvmi_msg_vm_reply(ikvm, msg, err, NULL, 0);
+
+		goto reply;
+	}
+
+	err = kvmi_get_vcpu(ikvm, cmd->vcpu, &vcpu);
+	if (!err)
+		err = kvmi_cmd_pause_vcpu(vcpu, req->wait == 1);
+
+reply:
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, err, NULL, 0);
+}
+
 /*
  * These commands are executed on the receiving thread/worker.
  */
@@ -471,6 +519,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
 	[KVMI_GET_PAGE_WRITE_BITMAP] = handle_get_page_write_bitmap,
 	[KVMI_GET_VERSION]           = handle_get_version,
+	[KVMI_PAUSE_VCPU]            = handle_pause_vcpu,
 	[KVMI_READ_PHYSICAL]         = handle_read_physical,
 	[KVMI_SET_PAGE_ACCESS]       = handle_set_page_access,
 	[KVMI_SET_PAGE_WRITE_BITMAP] = handle_set_page_write_bitmap,
@@ -966,3 +1015,15 @@ u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
 
 	return action;
 }
+
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu)
+{
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_PAUSE_VCPU, NULL, 0,
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 50/92] kvm: introspection: add KVMI_GET_REGISTERS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (48 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 49/92] kvm: introspection: add KVMI_PAUSE_VCPU and KVMI_EVENT_PAUSE_VCPU Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 51/92] kvm: introspection: add KVMI_SET_REGISTERS Adalbert Lazăr
                   ` (44 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This command is used to get kvm_regs and kvm_sregs structures,
plus the list of struct kvm_msrs.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 43 ++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   | 15 ++++++
 arch/x86/kvm/kvmi.c                | 78 ++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  5 ++
 virt/kvm/kvmi_msg.c                | 17 +++++++
 5 files changed, 158 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 558d3eb6007f..edf81e03ca3c 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -862,6 +862,49 @@ The introspection tool should use *KVMI_CONTROL_VM_EVENTS* to enable the
 * -KVM_EBUSY  - the selected vCPU has too many queued *KVMI_EVENT_PAUSE_VCPU* events
 * -KVM_EPERM  - the *KVMI_EVENT_PAUSE_VCPU* event is disallowed (see *KVMI_CONTROL_EVENTS*)
 		and the introspection tool expects a reply.
+
+17. KVMI_GET_REGISTERS
+----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_get_registers {
+		__u16 nmsrs;
+		__u16 padding1;
+		__u32 padding2;
+		__u32 msrs_idx[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_registers_reply {
+		__u32 mode;
+		__u32 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs msrs;
+	};
+
+For the given vCPU and the ``nmsrs`` sized array of MSRs registers,
+returns the current vCPU mode (in bytes: 2, 4 or 8), the general purpose
+registers, the special registers and the requested set of MSRs.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the indicated MSR-s is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
 Events
 ======
 
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index 551f9ed1ed9c..98fb27e1273c 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -26,4 +26,19 @@ struct kvmi_event_arch {
 	} msrs;
 };
 
+struct kvmi_get_registers {
+	__u16 nmsrs;
+	__u16 padding1;
+	__u32 padding2;
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_reply {
+	__u32 mode;
+	__u32 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index fa290fbf1f75..a78771b21d2f 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -7,6 +7,25 @@
 #include "x86.h"
 #include "../../../virt/kvm/kvmi_int.h"
 
+static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
+				       const struct kvmi_get_registers *req,
+				       size_t *rpl_size)
+{
+	struct kvmi_get_registers_reply *rpl;
+	u16 k, n = req->nmsrs;
+
+	*rpl_size = sizeof(*rpl) + sizeof(rpl->msrs.entries[0]) * n;
+	rpl = kvmi_msg_alloc_check(*rpl_size);
+	if (rpl) {
+		rpl->msrs.nmsrs = n;
+
+		for (k = 0; k < n; k++)
+			rpl->msrs.entries[k].index = req->msrs_idx[k];
+	}
+
+	return rpl;
+}
+
 /*
  * TODO: this can be done from userspace.
  *   - all these registers are sent with struct kvmi_event_arch
@@ -38,6 +57,65 @@ static unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
 	return mode;
 }
 
+static int kvmi_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
+			      struct kvm_regs *regs,
+			      struct kvm_sregs *sregs,
+			      struct kvm_msrs *msrs)
+{
+	struct kvm_msr_entry *msr = msrs->entries;
+	struct kvm_msr_entry *end = msrs->entries + msrs->nmsrs;
+
+	kvm_arch_vcpu_get_regs(vcpu, regs);
+	kvm_arch_vcpu_get_sregs(vcpu, sregs);
+	*mode = kvmi_vcpu_mode(vcpu, sregs);
+
+	for (; msr < end; msr++) {
+		struct msr_data m = {
+			.index = msr->index,
+			.host_initiated = true
+		};
+		int err = kvm_get_msr(vcpu, &m);
+
+		if (err)
+			return -KVM_EINVAL;
+
+		msr->data = m.data;
+	}
+
+	return 0;
+}
+
+int kvmi_arch_cmd_get_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const struct kvmi_get_registers *req,
+				struct kvmi_get_registers_reply **dest,
+				size_t *dest_size)
+{
+	struct kvmi_get_registers_reply *rpl;
+	size_t rpl_size = 0;
+	int err;
+
+	if (req->padding1 || req->padding2)
+		return -KVM_EINVAL;
+
+	if (msg->size < sizeof(struct kvmi_vcpu_hdr)
+			+ sizeof(*req) + req->nmsrs * sizeof(req->msrs_idx[0]))
+		return -KVM_EINVAL;
+
+	rpl = alloc_get_registers_reply(msg, req, &rpl_size);
+	if (!rpl)
+		return -KVM_ENOMEM;
+
+	err = kvmi_get_registers(vcpu, &rpl->mode, &rpl->regs,
+				 &rpl->sregs, &rpl->msrs);
+
+	*dest = rpl;
+	*dest_size = rpl_size;
+
+	return err;
+
+}
+
 static void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event_arch *event)
 {
 	struct msr_data msr;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index cb3b0ce87bc1..b547809d13ae 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -200,6 +200,11 @@ void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 void kvmi_arch_update_page_tracking(struct kvm *kvm,
 				    struct kvm_memory_slot *slot,
 				    struct kvmi_mem_access *m);
+int kvmi_arch_cmd_get_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const struct kvmi_get_registers *req,
+				struct kvmi_get_registers_reply **dest,
+				size_t *dest_size);
 int kvmi_arch_cmd_get_page_access(struct kvmi *ikvm,
 				  const struct kvmi_msg_hdr *msg,
 				  const struct kvmi_get_page_access *req,
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index a4446eed354d..9ae0622ff09e 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -32,6 +32,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
+	[KVMI_GET_REGISTERS]         = "KVMI_GET_REGISTERS",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
 	[KVMI_PAUSE_VCPU]            = "KVMI_PAUSE_VCPU",
@@ -589,6 +590,21 @@ static int handle_get_vcpu_info(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, 0, &rpl, sizeof(rpl));
 }
 
+static int handle_get_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *req, vcpu_reply_fct reply_cb)
+{
+	struct kvmi_get_registers_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	ec = kvmi_arch_cmd_get_registers(vcpu, msg, req, &rpl, &rpl_size);
+
+	err = reply_cb(vcpu, msg, ec, rpl, rpl_size);
+	kvmi_msg_free(rpl);
+	return err;
+}
+
 static int handle_control_events(struct kvm_vcpu *vcpu,
 				 const struct kvmi_msg_hdr *msg,
 				 const void *_req,
@@ -622,6 +638,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      vcpu_reply_fct) = {
 	[KVMI_CONTROL_EVENTS]   = handle_control_events,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
+	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 51/92] kvm: introspection: add KVMI_SET_REGISTERS
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (49 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 50/92] kvm: introspection: add KVMI_GET_REGISTERS Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 52/92] kvm: introspection: add KVMI_GET_CPUID Adalbert Lazăr
                   ` (43 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mihai Donțu <mdontu@bitdefender.com>

This command is allowed only during a vCPU event (an event has been sent
and the vCPU is waiting for the reply). The registers will be set only
when the reply has been received.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 28 +++++++++++++++++++++++++
 arch/x86/kvm/x86.c                 | 33 ++++++++++++++++++++++++++++++
 include/linux/kvm_host.h           |  1 +
 virt/kvm/kvmi.c                    | 25 ++++++++++++++++++++++
 virt/kvm/kvmi_int.h                |  5 +++++
 virt/kvm/kvmi_msg.c                | 16 +++++++++++++++
 6 files changed, 108 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index edf81e03ca3c..b6722d071ab7 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -905,6 +905,34 @@ registers, the special registers and the requested set of MSRs.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to allocate the reply
 
+18. KVMI_SET_REGISTERS
+----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvm_regs;
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the general purpose registers for the given vCPU. The changes become
+visible to other threads accessing the KVM vCPU structure after the event
+currently being handled is replied to.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ef29ef7617bf..62d15bbb2332 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8431,6 +8431,39 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+/*
+ * Similar to __set_regs() but it does not reset the exceptions
+ */
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
+	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
+	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
+	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
+	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
+#ifdef CONFIG_X86_64
+	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
+	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
+	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
+	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
+	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
+	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
+	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
+	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
+#endif
+
+	kvm_rip_write(vcpu, regs->rip);
+	kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+}
+
 void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	struct kvm_segment cs;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 09bc06747642..c8eb1a4d997f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -791,6 +791,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 void kvm_arch_vcpu_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 void kvm_arch_vcpu_get_sregs(struct kvm_vcpu *vcpu,
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 85de2da3eb7b..a20891d3a2ce 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1212,6 +1212,31 @@ void kvmi_handle_requests(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+void kvmi_post_reply(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (ivcpu->have_delayed_regs) {
+		kvm_arch_vcpu_set_regs(vcpu, &ivcpu->delayed_regs);
+		ivcpu->have_delayed_regs = false;
+	}
+}
+
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (ivcpu->reply_waiting) {
+		/* defer set registers until we get the reply */
+		memcpy(&ivcpu->delayed_regs, regs, sizeof(ivcpu->delayed_regs));
+		ivcpu->have_delayed_regs = true;
+	} else {
+		kvmi_err(IKVM(vcpu->kvm), "Dropped KVMI_SET_REGISTERS\n");
+	}
+
+	return 0;
+}
+
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access)
 {
 	gfn_t gfn = gpa_to_gfn(gpa);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index b547809d13ae..7bc3dd1f2298 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -105,6 +105,9 @@ struct kvmi_vcpu {
 	bool reply_waiting;
 	struct kvmi_vcpu_reply reply;
 
+	bool have_delayed_regs;
+	struct kvm_regs delayed_regs;
+
 	DECLARE_BITMAP(ev_mask, KVMI_NUM_EVENTS);
 
 	struct list_head job_list;
@@ -173,6 +176,7 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm);
 void *kvmi_msg_alloc(void);
 void *kvmi_msg_alloc_check(size_t size);
 void kvmi_msg_free(void *addr);
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs);
 int kvmi_cmd_get_page_access(struct kvmi *ikvm, u64 gpa, u8 *access);
 int kvmi_cmd_set_page_access(struct kvmi *ikvm, u64 gpa, u8 access);
 int kvmi_cmd_get_page_write_bitmap(struct kvmi *ikvm, u64 gpa, u32 *bitmap);
@@ -190,6 +194,7 @@ int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
 int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait);
 int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
+void kvmi_post_reply(struct kvm_vcpu *vcpu);
 int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void (*fct)(struct kvm_vcpu *vcpu, void *ctx),
 		 void *ctx, void (*free_fct)(void *ctx));
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 9ae0622ff09e..355cec70a28d 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -39,6 +39,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_READ_PHYSICAL]         = "KVMI_READ_PHYSICAL",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
 	[KVMI_SET_PAGE_WRITE_BITMAP] = "KVMI_SET_PAGE_WRITE_BITMAP",
+	[KVMI_SET_REGISTERS]         = "KVMI_SET_REGISTERS",
 	[KVMI_WRITE_PHYSICAL]        = "KVMI_WRITE_PHYSICAL",
 };
 
@@ -605,6 +606,19 @@ static int handle_get_registers(struct kvm_vcpu *vcpu,
 	return err;
 }
 
+static int handle_set_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req,
+				vcpu_reply_fct reply_cb)
+{
+	const struct kvm_regs *regs = _req;
+	int err;
+
+	err = kvmi_cmd_set_registers(vcpu, regs);
+
+	return reply_cb(vcpu, msg, err, NULL, 0);
+}
+
 static int handle_control_events(struct kvm_vcpu *vcpu,
 				 const struct kvmi_msg_hdr *msg,
 				 const void *_req,
@@ -640,6 +654,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
+	[KVMI_SET_REGISTERS]    = handle_set_registers,
 };
 
 static void kvmi_job_vcpu_cmd(struct kvm_vcpu *vcpu, void *_ctx)
@@ -937,6 +952,7 @@ int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
 	if (err)
 		goto out;
 
+	kvmi_post_reply(vcpu);
 	*action = ivcpu->reply.action;
 
 out:

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 52/92] kvm: introspection: add KVMI_GET_CPUID
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (50 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 51/92] kvm: introspection: add KVMI_SET_REGISTERS Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 53/92] kvm: introspection: add KVMI_INJECT_EXCEPTION + KVMI_EVENT_TRAP Adalbert Lazăr
                   ` (42 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Marian Rotariu

From: Marian Rotariu <marian.c.rotariu@gmail.com>

This command returns a CPUID leaf (as seen by the guest OS).

Signed-off-by: Marian Rotariu <marian.c.rotariu@gmail.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 36 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   | 12 ++++++++++
 arch/x86/kvm/kvmi.c                | 19 ++++++++++++++++
 include/uapi/linux/kvm_para.h      |  1 +
 virt/kvm/kvmi_int.h                |  3 +++
 virt/kvm/kvmi_msg.c                | 16 +++++++++++++
 6 files changed, 87 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index b6722d071ab7..9e15132ed976 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -933,6 +933,42 @@ currently being handled is replied to.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+19. KVMI_GET_CPUID
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_get_cpuid {
+		__u32 function;
+		__u32 index;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_cpuid_reply {
+		__u32 eax;
+		__u32 ebx;
+		__u32 ecx;
+		__u32 edx;
+	};
+
+Returns a CPUID leaf (as seen by the guest OS).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOENT - the selected leaf is not present or is invalid
+
 Events
 ======
 
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index 98fb27e1273c..fa2719226198 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -41,4 +41,16 @@ struct kvmi_get_registers_reply {
 	struct kvm_msrs msrs;
 };
 
+struct kvmi_get_cpuid {
+	__u32 function;
+	__u32 index;
+};
+
+struct kvmi_get_cpuid_reply {
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index a78771b21d2f..4615bbe9c0db 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2019 Bitdefender S.R.L.
  */
 #include "x86.h"
+#include "cpuid.h"
 #include "../../../virt/kvm/kvmi_int.h"
 
 static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
@@ -211,6 +212,24 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	return ret;
 }
 
+int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
+			    const struct kvmi_get_cpuid *req,
+			    struct kvmi_get_cpuid_reply *rpl)
+{
+	struct kvm_cpuid_entry2 *e;
+
+	e = kvm_find_cpuid_entry(vcpu, req->function, req->index);
+	if (!e)
+		return -KVM_ENOENT;
+
+	rpl->eax = e->eax;
+	rpl->ebx = e->ebx;
+	rpl->ecx = e->ecx;
+	rpl->edx = e->edx;
+
+	return 0;
+}
+
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl)
 {
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 07e3f2662b36..553f168331a4 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -19,6 +19,7 @@
 #define KVM_EOPNOTSUPP		95
 #define KVM_EAGAIN		11
 #define KVM_EBUSY		EBUSY
+#define KVM_ENOENT		ENOENT
 #define KVM_ENOMEM		ENOMEM
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 7bc3dd1f2298..22508d147495 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -230,6 +230,9 @@ int kvmi_arch_cmd_set_page_write_bitmap(struct kvmi *ikvm,
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
+int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
+			    const struct kvmi_get_cpuid *req,
+			    struct kvmi_get_cpuid_reply *rpl);
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl);
 
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 355cec70a28d..9548042de618 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -29,6 +29,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
 	[KVMI_EVENT]                 = "KVMI_EVENT",
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
+	[KVMI_GET_CPUID]             = "KVMI_GET_CPUID",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
@@ -641,6 +642,20 @@ static int handle_control_events(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, ec, NULL, 0);
 }
 
+static int handle_get_cpuid(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg,
+			    const void *req, vcpu_reply_fct reply_cb)
+{
+	struct kvmi_get_cpuid_reply rpl;
+	int ec;
+
+	memset(&rpl, 0, sizeof(rpl));
+
+	ec = kvmi_arch_cmd_get_cpuid(vcpu, req, &rpl);
+
+	return reply_cb(vcpu, msg, ec, &rpl, sizeof(rpl));
+}
+
 /*
  * These commands are executed on the vCPU thread. The receiving thread
  * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
@@ -652,6 +667,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      vcpu_reply_fct) = {
 	[KVMI_CONTROL_EVENTS]   = handle_control_events,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
+	[KVMI_GET_CPUID]        = handle_get_cpuid,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
 	[KVMI_SET_REGISTERS]    = handle_set_registers,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 53/92] kvm: introspection: add KVMI_INJECT_EXCEPTION + KVMI_EVENT_TRAP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (51 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 52/92] kvm: introspection: add KVMI_GET_CPUID Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 54/92] kvm: introspection: add KVMI_CONTROL_CR and KVMI_EVENT_CR Adalbert Lazăr
                   ` (41 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Joerg Roedel, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

The KVMI_INJECT_EXCEPTION command is used by the introspection tool to
inject exceptions (eg. get a page from swap). The exception is queued
right before entering the guest. If there is already an event pending
(exception, interrupt or NMI) we notify the introspection tool with the
KVMI_EVENT_TRAP event and abort the injection. The introspecion tool is
expected to try again at a later time.

CC: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  71 +++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   |   8 +++
 arch/x86/kvm/kvmi.c                | 108 +++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                 |  11 +++
 include/linux/kvmi.h               |   4 ++
 include/uapi/linux/kvmi.h          |   8 +++
 virt/kvm/kvmi.c                    |  40 +++++++++++
 virt/kvm/kvmi_int.h                |  16 +++++
 virt/kvm/kvmi_msg.c                |  21 ++++++
 9 files changed, 287 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 9e15132ed976..1eaed7c61148 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -969,6 +969,44 @@ Returns a CPUID leaf (as seen by the guest OS).
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOENT - the selected leaf is not present or is invalid
 
+20. KVMI_INJECT_EXCEPTION
+-------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_inject_exception {
+		__u8 nr;
+		__u8 has_error;
+		__u16 padding;
+		__u32 error_code;
+		__u64 address;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Injects a vCPU exception with or without an error code. In case of page fault
+exception, the guest virtual address has to be specified.
+
+The introspection tool should enable the *KVMI_EVENT_TRAP* event in
+order to be notified if the expection was not delivered.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified exception number is invalid
+* -KVM_EINVAL - the specified address is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
@@ -1167,3 +1205,36 @@ cannot be disabled via *KVMI_CONTROL_EVENTS*.
 This event has a low priority. It will be sent after any other vCPU
 introspection event and when no vCPU introspection command is queued.
 
+5. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 error_code;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent if a previous *KVMI_INJECT_EXCEPTION* command has
+been overwritten by an interrupt picked up during guest reentry and the
+introspection has been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event``, exception/interrupt number (vector), exception/interrupt
+type, exception code (``error_code``) and CR2 are sent to the introspector.
+
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index fa2719226198..b074ad735e84 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -26,6 +26,14 @@ struct kvmi_event_arch {
 	} msrs;
 };
 
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 error_code;
+	__u32 padding;
+	__u64 cr2;
+};
+
 struct kvmi_get_registers {
 	__u16 nmsrs;
 	__u16 padding1;
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 4615bbe9c0db..8c18030d12f4 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -6,6 +6,7 @@
  */
 #include "x86.h"
 #include "cpuid.h"
+#include <asm/vmx.h>
 #include "../../../virt/kvm/kvmi_int.h"
 
 static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
@@ -212,6 +213,87 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	return ret;
 }
 
+bool kvmi_arch_queue_exception(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.exception.injected &&
+	    !vcpu->arch.interrupt.injected &&
+	    !vcpu->arch.nmi_injected) {
+		struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+		struct x86_exception e = {
+			.vector = ivcpu->exception.nr,
+			.error_code_valid = ivcpu->exception.error_code_valid,
+			.error_code = ivcpu->exception.error_code,
+			.address = ivcpu->exception.address,
+		};
+
+		if (e.vector == PF_VECTOR)
+			kvm_inject_page_fault(vcpu, &e);
+		else if (e.error_code_valid)
+			kvm_queue_exception_e(vcpu, e.vector, e.error_code);
+		else
+			kvm_queue_exception(vcpu, e.vector);
+
+		return true;
+	}
+
+	return false;
+}
+
+static u32 kvmi_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
+			  u32 error_code, u64 cr2)
+{
+	struct kvmi_event_trap e = {
+		.error_code = error_code,
+		.vector = vector,
+		.type = type,
+		.cr2 = cr2
+	};
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_TRAP, &e, sizeof(e),
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
+void kvmi_arch_trap_event(struct kvm_vcpu *vcpu)
+{
+	u32 vector, type, err;
+	u32 action;
+
+	if (vcpu->arch.exception.injected) {
+		vector = vcpu->arch.exception.nr;
+		err = vcpu->arch.exception.error_code;
+
+		if (kvm_exception_is_soft(vector))
+			type = INTR_TYPE_SOFT_EXCEPTION;
+		else
+			type = INTR_TYPE_HARD_EXCEPTION;
+	} else if (vcpu->arch.interrupt.injected) {
+		vector = vcpu->arch.interrupt.nr;
+		err = 0;
+
+		if (vcpu->arch.interrupt.soft)
+			type = INTR_TYPE_SOFT_INTR;
+		else
+			type = INTR_TYPE_EXT_INTR;
+	} else {
+		vector = 0;
+		type = 0;
+		err = 0;
+	}
+
+	action = kvmi_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "TRAP");
+	}
+}
+
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl)
@@ -241,6 +323,32 @@ int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static bool is_vector_valid(u8 vector)
+{
+	return true;
+}
+
+static bool is_gva_valid(struct kvm_vcpu *vcpu, u64 gva)
+{
+	return true;
+}
+
+int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+				   bool error_code_valid,
+				   u32 error_code, u64 address)
+{
+	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
+		return -KVM_EINVAL;
+
+	IVCPU(vcpu)->exception.pending = true;
+	IVCPU(vcpu)->exception.nr = vector;
+	IVCPU(vcpu)->exception.error_code = error_code_valid ? error_code : 0;
+	IVCPU(vcpu)->exception.error_code_valid = error_code_valid;
+	IVCPU(vcpu)->exception.address = address;
+
+	return 0;
+}
+
 static const struct {
 	unsigned int allow_bit;
 	enum kvm_page_track_mode track_mode;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 62d15bbb2332..e38c0b95a0e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7930,6 +7930,17 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (!kvmi_queue_exception(vcpu))
+		kvmi_trap_event(vcpu);
+	else if (vcpu->arch.exception.pending) {
+		kvm_x86_ops->queue_exception(vcpu);
+		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
+			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
+						X86_EFLAGS_RF);
+		vcpu->arch.exception.pending = false;
+		vcpu->arch.exception.injected = true;
+	}
+
 	r = kvm_mmu_reload(vcpu);
 	if (unlikely(r)) {
 		goto cancel_injection;
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 80c15b9195e4..5ae02c64fb33 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -16,6 +16,8 @@ int kvmi_ioctl_event(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
 void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
+bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
+void kvmi_trap_event(struct kvm_vcpu *vcpu);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 void kvmi_init_emulate(struct kvm_vcpu *vcpu);
 void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
@@ -29,6 +31,8 @@ static inline void kvmi_destroy_vm(struct kvm *kvm) { }
 static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
+static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) { return true; }
+static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu) { }
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index ca9c6b6aeed5..a4583de5c2f6 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -215,6 +215,14 @@ struct kvmi_vcpu_hdr {
 	__u32 padding2;
 };
 
+struct kvmi_inject_exception {
+	__u8 nr;
+	__u8 has_error;
+	__u16 padding;
+	__u32 error_code;
+	__u64 address;
+};
+
 struct kvmi_event {
 	__u16 size;
 	__u16 vcpu;
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index a20891d3a2ce..e3f308898a60 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1058,6 +1058,46 @@ void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL(kvmi_activate_rep_complete);
 
+/*
+ * This function returns false if there is an exception or interrupt pending.
+ * It returns true in all other cases including KVMI not being initialized.
+ */
+bool kvmi_queue_exception(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (!IVCPU(vcpu)->exception.pending)
+		goto out;
+
+	ret = kvmi_arch_queue_exception(vcpu);
+
+	memset(&IVCPU(vcpu)->exception, 0, sizeof(IVCPU(vcpu)->exception));
+
+out:
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
+void kvmi_trap_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_TRAP))
+		kvmi_arch_trap_event(vcpu);
+
+	kvmi_put(vcpu->kvm);
+}
+
 static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 22508d147495..2eadeb6efde8 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -105,6 +105,14 @@ struct kvmi_vcpu {
 	bool reply_waiting;
 	struct kvmi_vcpu_reply reply;
 
+	struct {
+		u8 nr;
+		u32 error_code;
+		bool error_code_valid;
+		u64 address;
+		bool pending;
+	} exception;
+
 	bool have_delayed_regs;
 	struct kvm_regs delayed_regs;
 
@@ -165,6 +173,9 @@ bool kvmi_sock_get(struct kvmi *ikvm, int fd);
 void kvmi_sock_shutdown(struct kvmi *ikvm);
 void kvmi_sock_put(struct kvmi *ikvm);
 bool kvmi_msg_process(struct kvmi *ikvm);
+int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
+		    void *ev, size_t ev_size,
+		    void *rpl, size_t rpl_size, int *action);
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete,
 		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
@@ -230,10 +241,15 @@ int kvmi_arch_cmd_set_page_write_bitmap(struct kvmi *ikvm,
 void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev);
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
+bool kvmi_arch_queue_exception(struct kvm_vcpu *vcpu);
+void kvmi_arch_trap_event(struct kvm_vcpu *vcpu);
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl);
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl);
+int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+				   bool error_code_valid, u32 error_code,
+				   u64 address);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 9548042de618..e80d28dbb061 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -36,6 +36,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_REGISTERS]         = "KVMI_GET_REGISTERS",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
+	[KVMI_INJECT_EXCEPTION]      = "KVMI_INJECT_EXCEPTION",
 	[KVMI_PAUSE_VCPU]            = "KVMI_PAUSE_VCPU",
 	[KVMI_READ_PHYSICAL]         = "KVMI_READ_PHYSICAL",
 	[KVMI_SET_PAGE_ACCESS]       = "KVMI_SET_PAGE_ACCESS",
@@ -620,6 +621,25 @@ static int handle_set_registers(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, err, NULL, 0);
 }
 
+static int handle_inject_exception(struct kvm_vcpu *vcpu,
+				   const struct kvmi_msg_hdr *msg,
+				   const void *_req,
+				   vcpu_reply_fct reply_cb)
+{
+	const struct kvmi_inject_exception *req = _req;
+	int ec;
+
+	if (req->padding)
+		ec = -KVM_EINVAL;
+	else
+		ec = kvmi_arch_cmd_inject_exception(vcpu, req->nr,
+						    req->has_error,
+						    req->error_code,
+						    req->address);
+
+	return reply_cb(vcpu, msg, ec, NULL, 0);
+}
+
 static int handle_control_events(struct kvm_vcpu *vcpu,
 				 const struct kvmi_msg_hdr *msg,
 				 const void *_req,
@@ -670,6 +690,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 	[KVMI_GET_CPUID]        = handle_get_cpuid,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
+	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
 	[KVMI_SET_REGISTERS]    = handle_set_registers,
 };
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 54/92] kvm: introspection: add KVMI_CONTROL_CR and KVMI_EVENT_CR
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (52 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 53/92] kvm: introspection: add KVMI_INJECT_EXCEPTION + KVMI_EVENT_TRAP Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR Adalbert Lazăr
                   ` (40 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

Using the KVMI_CONTROL_CR command, the introspection tool subscribes to
KVMI_EVENT_CR events that will be sent when CR{0,3,4} is going to
be changed.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 70 ++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h    |  2 +
 arch/x86/include/asm/kvmi_host.h   | 16 +++++
 arch/x86/include/uapi/asm/kvmi.h   | 18 ++++++
 arch/x86/kvm/kvmi.c                | 95 ++++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c                 |  5 ++
 arch/x86/kvm/vmx/vmx.c             | 14 +++++
 arch/x86/kvm/x86.c                 | 19 +++++-
 virt/kvm/kvmi_int.h                |  7 +++
 virt/kvm/kvmi_msg.c                | 13 ++++
 10 files changed, 258 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 1eaed7c61148..2e6e285c8e2e 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1007,6 +1007,41 @@ order to be notified if the expection was not delivered.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+21. KVMI_CONTROL_CR
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_control_cr {
+		__u8 enable;
+		__u8 padding1;
+		__u16 padding2;
+		__u32 cr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific control register and must
+be used in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_CR*
+ID set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified control register is not part of the CR0, CR3
+   or CR4 set
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
@@ -1238,3 +1273,38 @@ introspection has been enabled for this event (see *KVMI_CONTROL_EVENTS*).
 ``kvmi_event``, exception/interrupt number (vector), exception/interrupt
 type, exception code (``error_code``) and CR2 are sent to the introspector.
 
+6. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a control register is going to be changed and the
+introspection has been enabled for this event and for this specific
+register (see **KVMI_CONTROL_EVENTS**).
+
+``kvmi_event``, the control register number, the old value and the new value
+are sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ee6e1ff5ee9..22f08f2732cc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1013,6 +1013,7 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(int index);
 	void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
+	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
 	bool (*spt_fault)(struct kvm_vcpu *vcpu);
 
@@ -1622,5 +1623,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
 bool kvm_spt_fault(struct kvm_vcpu *vcpu);
+void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
index 7ab6dd71a0c2..83a098dc8939 100644
--- a/arch/x86/include/asm/kvmi_host.h
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -9,4 +9,20 @@ struct kvmi_arch_mem_access {
 	unsigned long active[KVM_PAGE_TRACK_MAX][BITS_TO_LONGS(KVM_MEM_SLOTS_NUM)];
 };
 
+#ifdef CONFIG_KVM_INTROSPECTION
+
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value);
+
+#else /* CONFIG_KVM_INTROSPECTION */
+
+static inline bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+				 unsigned long old_value,
+				 unsigned long *new_value)
+{
+	return true;
+}
+
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 #endif /* _ASM_X86_KVMI_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index b074ad735e84..c983b4bd2c72 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -61,4 +61,22 @@ struct kvmi_get_cpuid_reply {
 	__u32 edx;
 };
 
+struct kvmi_control_cr {
+	__u8 enable;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 cr;
+};
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 8c18030d12f4..b3cab0db6a70 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -171,6 +171,72 @@ void kvmi_arch_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev)
 	kvmi_get_msrs(vcpu, event);
 }
 
+static u32 kvmi_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
+			u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_cr e = {
+		.cr = cr,
+		.old_value = old_value,
+		.new_value = new_value
+	};
+	struct kvmi_event_cr_reply r;
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_CR, &e, sizeof(e),
+			      &r, sizeof(r), &action);
+	if (err) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return action;
+}
+
+static bool __kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+			    unsigned long old_value, unsigned long *new_value)
+{
+	u64 ret_value;
+	u32 action;
+	bool ret = false;
+
+	if (!test_bit(cr, IVCPU(vcpu)->cr_mask))
+		return true;
+
+	action = kvmi_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		*new_value = ret_value;
+		ret = true;
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "CR");
+	}
+
+	return ret;
+}
+
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	if (old_value == *new_value)
+		return true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_CR))
+		ret = __kvmi_cr_event(vcpu, cr, old_value, new_value);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access)
 {
@@ -349,6 +415,35 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 	return 0;
 }
 
+int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
+			     const struct kvmi_control_cr *req)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	u32 cr = req->cr;
+
+	if (req->padding1 || req->padding2)
+		return -KVM_EINVAL;
+
+	switch (cr) {
+	case 0:
+		break;
+	case 3:
+		kvm_control_cr3_write_exiting(vcpu, req->enable);
+		break;
+	case 4:
+		break;
+	default:
+		return -KVM_EINVAL;
+	}
+
+	if (req->enable)
+		set_bit(cr, ivcpu->cr_mask);
+	else
+		clear_bit(cr, ivcpu->cr_mask);
+
+	return 0;
+}
+
 static const struct {
 	unsigned int allow_bit;
 	enum kvm_page_track_mode track_mode;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6b533698c73d..fc78b0052dee 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7110,6 +7110,10 @@ static bool svm_spt_fault(struct kvm_vcpu *vcpu)
 	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
 }
 
+static void svm_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7121,6 +7125,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
 	.has_emulated_msr = svm_has_emulated_msr,
 
+	.cr3_write_exiting = svm_cr3_write_exiting,
 	.nested_pagefault = svm_nested_pagefault,
 	.spt_fault = svm_spt_fault,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5d4b61aaff9a..6450c8c44771 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7784,6 +7784,19 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static void vmx_cr3_write_exiting(struct kvm_vcpu *vcpu,
+					 bool enable)
+{
+	if (enable)
+		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
+				CPU_BASED_CR3_LOAD_EXITING);
+	else
+		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+				CPU_BASED_CR3_LOAD_EXITING);
+
+	/* TODO: nested ? vmcs12->cpu_based_vm_exec_control */
+}
+
 static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
@@ -7831,6 +7844,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.cr3_write_exiting = vmx_cr3_write_exiting,
 	.nested_pagefault = vmx_nested_pagefault,
 	.spt_fault = vmx_spt_fault,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e38c0b95a0e7..2cd146ccc6ff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -22,6 +22,7 @@
 #include <linux/kvm_host.h>
 #include <uapi/linux/kvmi.h>
 #include <linux/kvmi.h>
+#include <asm/kvmi_host.h>
 #include "irq.h"
 #include "mmu.h"
 #include "i8254.h"
@@ -777,6 +778,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 0, old_cr0, &cr0))
+		return 1;
+
 	kvm_x86_ops->set_cr0(vcpu, cr0);
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
@@ -921,6 +925,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if (!kvmi_cr_event(vcpu, 4, old_cr4, &cr4))
+		return 1;
+
 	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
@@ -937,6 +944,7 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);
 
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
+	unsigned long old_cr3 = kvm_read_cr3(vcpu);
 	bool skip_tlb_flush = false;
 #ifdef CONFIG_X86_64
 	bool pcid_enabled = kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
@@ -947,7 +955,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 	}
 #endif
 
-	if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+	if (cr3 == old_cr3 && !pdptrs_changed(vcpu)) {
 		if (!skip_tlb_flush) {
 			kvm_mmu_sync_roots(vcpu);
 			kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
@@ -962,6 +970,9 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 3, old_cr3, &cr3))
+		return 1;
+
 	kvm_mmu_new_cr3(vcpu, cr3, skip_tlb_flush);
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
@@ -10072,6 +10083,12 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
+{
+	kvm_x86_ops->cr3_write_exiting(vcpu, enable);
+}
+EXPORT_SYMBOL(kvm_control_cr3_write_exiting);
+
 bool kvm_spt_fault(struct kvm_vcpu *vcpu)
 {
 	return kvm_x86_ops->spt_fault(vcpu);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 2eadeb6efde8..c92be3c2c131 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -26,6 +26,8 @@
 
 #define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
 
+#define KVMI_NUM_CR 9
+
 #define KVMI_CTX_DATA_SIZE FIELD_SIZEOF(struct kvmi_event_pf_reply, ctx_data)
 
 #define KVMI_MSG_SIZE_ALLOC (sizeof(struct kvmi_msg_hdr) + KVMI_MSG_SIZE)
@@ -117,6 +119,7 @@ struct kvmi_vcpu {
 	struct kvm_regs delayed_regs;
 
 	DECLARE_BITMAP(ev_mask, KVMI_NUM_EVENTS);
+	DECLARE_BITMAP(cr_mask, KVMI_NUM_CR);
 
 	struct list_head job_list;
 	spinlock_t job_lock;
@@ -204,6 +207,8 @@ int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
 int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait);
+struct kvmi * __must_check kvmi_get(struct kvm *kvm);
+void kvmi_put(struct kvm *kvm);
 int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
 void kvmi_post_reply(struct kvm_vcpu *vcpu);
 int kvmi_add_job(struct kvm_vcpu *vcpu,
@@ -251,5 +256,7 @@ int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 				   bool error_code_valid, u32 error_code,
 				   u64 address);
+int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
+			     const struct kvmi_control_cr *req);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index e80d28dbb061..d4f5459722bb 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -24,6 +24,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CHECK_COMMAND]         = "KVMI_CHECK_COMMAND",
 	[KVMI_CHECK_EVENT]           = "KVMI_CHECK_EVENT",
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
+	[KVMI_CONTROL_CR]            = "KVMI_CONTROL_CR",
 	[KVMI_CONTROL_EVENTS]        = "KVMI_CONTROL_EVENTS",
 	[KVMI_CONTROL_SPP]           = "KVMI_CONTROL_SPP",
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
@@ -662,6 +663,17 @@ static int handle_control_events(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, ec, NULL, 0);
 }
 
+static int handle_control_cr(struct kvm_vcpu *vcpu,
+			     const struct kvmi_msg_hdr *msg, const void *req,
+			     vcpu_reply_fct reply_cb)
+{
+	int ec;
+
+	ec = kvmi_arch_cmd_control_cr(vcpu, req);
+
+	return reply_cb(vcpu, msg, ec, NULL, 0);
+}
+
 static int handle_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_msg_hdr *msg,
 			    const void *req, vcpu_reply_fct reply_cb)
@@ -685,6 +697,7 @@ static int handle_get_cpuid(struct kvm_vcpu *vcpu,
 static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      const struct kvmi_msg_hdr *, const void *,
 			      vcpu_reply_fct) = {
+	[KVMI_CONTROL_CR]       = handle_control_cr,
 	[KVMI_CONTROL_EVENTS]   = handle_control_events,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
 	[KVMI_GET_CPUID]        = handle_get_cpuid,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (53 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 54/92] kvm: introspection: add KVMI_CONTROL_CR and KVMI_EVENT_CR Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-12 21:05   ` Sean Christopherson
  2019-08-19 18:52   ` Sean Christopherson
  2019-08-09 16:00 ` [RFC PATCH v6 56/92] kvm: x86: block any attempt to disable MSR interception if tracked by introspection Adalbert Lazăr
                   ` (39 subsequent siblings)
  94 siblings, 2 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

The KVMI_CONTROL_MSR is used to enable/disable introspection for a
specific MSR. The KVMI_EVENT_MSR is send when the tracked MSR is going
to be changed. The introspection tool can respond by allowing the guest
to continue with normal execution or by discarding the change.

This is meant to prevent malicious changes to MSR-s
such as MSR_IA32_SYSENTER_EIP.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  73 +++++++++++++++++
 arch/x86/include/asm/kvm_host.h    |   4 +
 arch/x86/include/asm/kvmi_host.h   |   6 ++
 arch/x86/include/uapi/asm/kvmi.h   |  18 ++++
 arch/x86/kvm/kvmi.c                | 127 +++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c                 |  15 ++++
 arch/x86/kvm/vmx/vmx.c             |  10 +++
 arch/x86/kvm/x86.c                 |  10 +++
 virt/kvm/kvmi_int.h                |   8 +-
 virt/kvm/kvmi_msg.c                |  13 +++
 10 files changed, 283 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 2e6e285c8e2e..c41c3edb0134 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1042,6 +1042,45 @@ ID set.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+22. KVMI_CONTROL_MSR
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_control_msr {
+		__u8 enable;
+		__u8 padding1;
+		__u16 padding2;
+		__u32 msr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific MSR and must be used
+in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_MSR* ID set.
+
+Currently, only MSRs within the following two ranges are supported. Trying
+to control events for any other register will fail with -KVM_EINVAL::
+
+	0          ... 0x00001fff
+	0xc0000000 ... 0xc0001fff
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified MSR is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
@@ -1308,3 +1347,37 @@ register (see **KVMI_CONTROL_EVENTS**).
 ``kvmi_event``, the control register number, the old value and the new value
 are sent to the introspector. The *CONTINUE* action will set the ``new_val``.
 
+7. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a model specific register is going to be changed
+and the introspection has been enabled for this event and for this specific
+register (see **KVMI_CONTROL_EVENTS**).
+
+``kvmi_event``, the MSR number, the old value and the new value are
+sent to the introspector. The *CONTINUE* action will set the ``new_val``.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 22f08f2732cc..91cd43a7a7bf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1013,6 +1013,8 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(int index);
 	void (*cpuid_update)(struct kvm_vcpu *vcpu);
 
+	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
 	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
 	bool (*spt_fault)(struct kvm_vcpu *vcpu);
@@ -1621,6 +1623,8 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
 bool kvm_spt_fault(struct kvm_vcpu *vcpu);
 void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable);
diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
index 83a098dc8939..8285d1eb0db6 100644
--- a/arch/x86/include/asm/kvmi_host.h
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -11,11 +11,17 @@ struct kvmi_arch_mem_access {
 
 #ifdef CONFIG_KVM_INTROSPECTION
 
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
 bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 		   unsigned long old_value, unsigned long *new_value);
 
 #else /* CONFIG_KVM_INTROSPECTION */
 
+static inline bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	return true;
+}
+
 static inline bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 				 unsigned long old_value,
 				 unsigned long *new_value)
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index c983b4bd2c72..08af2eccbdfb 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -79,4 +79,22 @@ struct kvmi_event_cr_reply {
 	__u64 new_val;
 };
 
+struct kvmi_control_msr {
+	__u8 enable;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 msr;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index b3cab0db6a70..5dba4f87afef 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -9,6 +9,133 @@
 #include <asm/vmx.h>
 #include "../../../virt/kvm/kvmi_int.h"
 
+static unsigned long *msr_mask(struct kvm_vcpu *vcpu, unsigned int *msr)
+{
+	switch (*msr) {
+	case 0 ... 0x1fff:
+		return IVCPU(vcpu)->msr_mask.low;
+	case 0xc0000000 ... 0xc0001fff:
+		*msr &= 0x1fff;
+		return IVCPU(vcpu)->msr_mask.high;
+	}
+
+	return NULL;
+}
+
+static bool test_msr_mask(struct kvm_vcpu *vcpu, unsigned int msr)
+{
+	unsigned long *mask = msr_mask(vcpu, &msr);
+
+	if (!mask)
+		return false;
+	if (!test_bit(msr, mask))
+		return false;
+
+	return true;
+}
+
+static int msr_control(struct kvm_vcpu *vcpu, unsigned int msr, bool enable)
+{
+	unsigned long *mask = msr_mask(vcpu, &msr);
+
+	if (!mask)
+		return -KVM_EINVAL;
+	if (enable)
+		set_bit(msr, mask);
+	else
+		clear_bit(msr, mask);
+	return 0;
+}
+
+int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
+			      const struct kvmi_control_msr *req)
+{
+	int err;
+
+	if (req->padding1 || req->padding2)
+		return -KVM_EINVAL;
+
+	err = msr_control(vcpu, req->msr, req->enable);
+
+	if (!err && req->enable)
+		kvm_arch_msr_intercept(vcpu, req->msr, req->enable);
+
+	return err;
+}
+
+static u32 kvmi_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
+			 u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_msr e = {
+		.msr = msr,
+		.old_value = old_value,
+		.new_value = new_value,
+	};
+	struct kvmi_event_msr_reply r;
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_MSR, &e, sizeof(e),
+			      &r, sizeof(r), &action);
+	if (err) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return action;
+}
+
+static bool __kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	struct msr_data old_msr = {
+		.host_initiated = true,
+		.index = msr->index,
+	};
+	bool ret = false;
+	u64 ret_value;
+	u32 action;
+
+	if (!test_msr_mask(vcpu, msr->index))
+		return true;
+	if (kvm_get_msr(vcpu, &old_msr))
+		return true;
+	if (old_msr.data == msr->data)
+		return true;
+
+	action = kvmi_send_msr(vcpu, msr->index, old_msr.data, msr->data,
+			       &ret_value);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		msr->data = ret_value;
+		ret = true;
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "MSR");
+	}
+
+	return ret;
+}
+
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	if (msr->host_initiated)
+		return true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_MSR))
+		ret = __kvmi_msr_event(vcpu, msr);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
 				       const struct kvmi_get_registers *req,
 				       size_t *rpl_size)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index fc78b0052dee..cdb315578979 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7098,6 +7098,20 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
 	return -ENODEV;
 }
 
+static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 *msrpm = svm->msrpm;
+
+	/*
+	 * The below code enable or disable the msr interception for both
+	 * read and write. The best way will be to get here the current
+	 * bit status for read and send that value as argument.
+	 */
+	set_msr_interception(msrpm, msr, enable, enable);
+}
+
 static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
 {
 	return false;
@@ -7126,6 +7140,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.has_emulated_msr = svm_has_emulated_msr,
 
 	.cr3_write_exiting = svm_cr3_write_exiting,
+	.msr_intercept = svm_msr_intercept,
 	.nested_pagefault = svm_nested_pagefault,
 	.spt_fault = svm_spt_fault,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6450c8c44771..0306c7ef3158 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7784,6 +7784,15 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+			      bool enable)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
+
+	vmx_set_intercept_for_msr(msr_bitmap, msr, MSR_TYPE_W, enable);
+}
+
 static void vmx_cr3_write_exiting(struct kvm_vcpu *vcpu,
 					 bool enable)
 {
@@ -7844,6 +7853,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.msr_intercept = vmx_msr_intercept,
 	.cr3_write_exiting = vmx_cr3_write_exiting,
 	.nested_pagefault = vmx_nested_pagefault,
 	.spt_fault = vmx_spt_fault,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cd146ccc6ff..ac027471c4f3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1319,6 +1319,9 @@ EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
  */
 int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
+	if (!kvmi_msr_event(vcpu, msr))
+		return 1;
+
 	switch (msr->index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:
@@ -10083,6 +10086,13 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	kvm_x86_ops->msr_intercept(vcpu, msr, enable);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
+
 void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
 {
 	kvm_x86_ops->cr3_write_exiting(vcpu, enable);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index c92be3c2c131..640a78b54947 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -27,7 +27,7 @@
 #define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
 
 #define KVMI_NUM_CR 9
-
+#define KVMI_NUM_MSR 0x2000
 #define KVMI_CTX_DATA_SIZE FIELD_SIZEOF(struct kvmi_event_pf_reply, ctx_data)
 
 #define KVMI_MSG_SIZE_ALLOC (sizeof(struct kvmi_msg_hdr) + KVMI_MSG_SIZE)
@@ -120,6 +120,10 @@ struct kvmi_vcpu {
 
 	DECLARE_BITMAP(ev_mask, KVMI_NUM_EVENTS);
 	DECLARE_BITMAP(cr_mask, KVMI_NUM_CR);
+	struct {
+		DECLARE_BITMAP(low, KVMI_NUM_MSR);
+		DECLARE_BITMAP(high, KVMI_NUM_MSR);
+	} msr_mask;
 
 	struct list_head job_list;
 	spinlock_t job_lock;
@@ -258,5 +262,7 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 				   u64 address);
 int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 			     const struct kvmi_control_cr *req);
+int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
+			      const struct kvmi_control_msr *req);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index d4f5459722bb..8a8951f13f8e 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -26,6 +26,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_CONTROL_CMD_RESPONSE]  = "KVMI_CONTROL_CMD_RESPONSE",
 	[KVMI_CONTROL_CR]            = "KVMI_CONTROL_CR",
 	[KVMI_CONTROL_EVENTS]        = "KVMI_CONTROL_EVENTS",
+	[KVMI_CONTROL_MSR]           = "KVMI_CONTROL_MSR",
 	[KVMI_CONTROL_SPP]           = "KVMI_CONTROL_SPP",
 	[KVMI_CONTROL_VM_EVENTS]     = "KVMI_CONTROL_VM_EVENTS",
 	[KVMI_EVENT]                 = "KVMI_EVENT",
@@ -674,6 +675,17 @@ static int handle_control_cr(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, ec, NULL, 0);
 }
 
+static int handle_control_msr(struct kvm_vcpu *vcpu,
+			      const struct kvmi_msg_hdr *msg, const void *req,
+			      vcpu_reply_fct reply_cb)
+{
+	int ec;
+
+	ec = kvmi_arch_cmd_control_msr(vcpu, req);
+
+	return reply_cb(vcpu, msg, ec, NULL, 0);
+}
+
 static int handle_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_msg_hdr *msg,
 			    const void *req, vcpu_reply_fct reply_cb)
@@ -699,6 +711,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 			      vcpu_reply_fct) = {
 	[KVMI_CONTROL_CR]       = handle_control_cr,
 	[KVMI_CONTROL_EVENTS]   = handle_control_events,
+	[KVMI_CONTROL_MSR]      = handle_control_msr,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
 	[KVMI_GET_CPUID]        = handle_get_cpuid,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 56/92] kvm: x86: block any attempt to disable MSR interception if tracked by introspection
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (54 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 57/92] kvm: introspection: add KVMI_GET_XSAVE Adalbert Lazăr
                   ` (38 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu,
	Sean Christopherson, Jim Mattson, Joerg Roedel, Vitaly Kuznetsov

From: Nicușor Cîțu <ncitu@bitdefender.com>

Intercept all calls that might disable the MSR interception (writes) and
do nothing if that specific MSR is currently tracked by the introspection
tool.

CC: Sean Christopherson <sean.j.christopherson@intel.com>
CC: Jim Mattson <jmattson@google.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvmi_host.h |  6 +++
 arch/x86/kvm/kvmi.c              | 25 +++++++++++++
 arch/x86/kvm/svm.c               | 33 ++++++++++-------
 arch/x86/kvm/vmx/vmx.c           | 63 +++++++++++++++++++-------------
 4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
index 8285d1eb0db6..86d90b7bed84 100644
--- a/arch/x86/include/asm/kvmi_host.h
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -12,6 +12,7 @@ struct kvmi_arch_mem_access {
 #ifdef CONFIG_KVM_INTROSPECTION
 
 bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
+bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr);
 bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 		   unsigned long old_value, unsigned long *new_value);
 
@@ -22,6 +23,11 @@ static inline bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	return true;
 }
 
+static inline bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr)
+{
+	return false;
+}
+
 static inline bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 				 unsigned long old_value,
 				 unsigned long *new_value)
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 5dba4f87afef..fc6956b50da2 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -136,6 +136,31 @@ bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	return ret;
 }
 
+bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr)
+{
+	struct kvmi *ikvm;
+	bool ret = false;
+
+	if (!vcpu)
+		return false;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	if (test_msr_mask(vcpu, msr)) {
+		kvmi_warn_once(ikvm,
+			       "Trying to disable write interception for MSR %x\n",
+			       msr);
+		ret = true;
+	}
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+EXPORT_SYMBOL(kvmi_monitored_msr);
+
 static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
 				       const struct kvmi_get_registers *req,
 				       size_t *rpl_size)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cdb315578979..e46a4c423545 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) "SVM: " fmt
 
 #include <linux/kvm_host.h>
+#include <asm/kvmi_host.h>
 
 #include "irq.h"
 #include "mmu.h"
@@ -1049,13 +1050,19 @@ static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr)
 	return !!test_bit(bit_write,  &tmp);
 }
 
-static void set_msr_interception(u32 *msrpm, unsigned msr,
+static void set_msr_interception(struct vcpu_svm *svm,
+				 u32 *msrpm, unsigned int msr,
 				 int read, int write)
 {
 	u8 bit_read, bit_write;
 	unsigned long tmp;
 	u32 offset;
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (!write && kvmi_monitored_msr(&svm->vcpu, msr))
+		return;
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	/*
 	 * If this warning triggers extend the direct_access_msrs list at the
 	 * beginning of the file
@@ -1085,7 +1092,7 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
 		if (!direct_access_msrs[i].always)
 			continue;
 
-		set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
+		set_msr_interception(NULL, msrpm, direct_access_msrs[i].index, 1, 1);
 	}
 }
 
@@ -1137,10 +1144,10 @@ static void svm_enable_lbrv(struct vcpu_svm *svm)
 	u32 *msrpm = svm->msrpm;
 
 	svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
 }
 
 static void svm_disable_lbrv(struct vcpu_svm *svm)
@@ -1148,10 +1155,10 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
 	u32 *msrpm = svm->msrpm;
 
 	svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
-	set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
+	set_msr_interception(svm, msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
 }
 
 static void disable_nmi_singlestep(struct vcpu_svm *svm)
@@ -4290,7 +4297,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		 * We update the L1 MSR bit as well since it will end up
 		 * touching the MSR anyway now.
 		 */
-		set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+		set_msr_interception(svm, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
 		break;
 	case MSR_IA32_PRED_CMD:
 		if (!msr->host_initiated &&
@@ -4306,7 +4313,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
 		if (is_guest_mode(vcpu))
 			break;
-		set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+		set_msr_interception(svm, svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
 		break;
 	case MSR_AMD64_VIRT_SPEC_CTRL:
 		if (!msr->host_initiated &&
@@ -7109,7 +7116,7 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	 * read and write. The best way will be to get here the current
 	 * bit status for read and send that value as argument.
 	 */
-	set_msr_interception(msrpm, msr, enable, enable);
+	set_msr_interception(svm, msrpm, msr, enable, enable);
 }
 
 static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0306c7ef3158..fff41adcdffe 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -21,6 +21,7 @@
 #include <linux/hrtimer.h>
 #include <linux/kernel.h>
 #include <linux/kvm_host.h>
+#include <asm/kvmi_host.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/mod_devicetable.h>
@@ -336,7 +337,8 @@ module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
 
 static bool guest_state_valid(struct kvm_vcpu *vcpu);
 static u32 vmx_segment_access_rights(struct kvm_segment *var);
-static __always_inline void vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap,
 							  u32 msr, int type);
 
 void vmx_vmexit(void);
@@ -1862,7 +1864,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * in the merging. We update the vmcs01 here for L1 as well
 		 * since it will end up touching the MSR anyway now.
 		 */
-		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+		vmx_disable_intercept_for_msr(vcpu, vmx->vmcs01.msr_bitmap,
 					      MSR_IA32_SPEC_CTRL,
 					      MSR_TYPE_RW);
 		break;
@@ -1890,7 +1892,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * vmcs02.msr_bitmap here since it gets completely overwritten
 		 * in the merging.
 		 */
-		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+		vmx_disable_intercept_for_msr(vcpu, vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
 					      MSR_TYPE_W);
 		break;
 	case MSR_IA32_ARCH_CAPABILITIES:
@@ -3463,7 +3465,8 @@ void free_vpid(int vpid)
 	spin_unlock(&vmx_vpid_lock);
 }
 
-static __always_inline void vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+static __always_inline void vmx_disable_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap,
 							  u32 msr, int type)
 {
 	int f = sizeof(unsigned long);
@@ -3471,6 +3474,11 @@ static __always_inline void vmx_disable_intercept_for_msr(unsigned long *msr_bit
 	if (!cpu_has_vmx_msr_bitmap())
 		return;
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if ((type & MSR_TYPE_W) && kvmi_monitored_msr(vcpu, msr))
+		return;
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	if (static_branch_unlikely(&enable_evmcs))
 		evmcs_touch_msr_bitmap();
 
@@ -3539,13 +3547,14 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
 	}
 }
 
-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
-			     			      u32 msr, int type, bool value)
+static __always_inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu,
+						      unsigned long *msr_bitmap,
+						      u32 msr, int type, bool value)
 {
 	if (value)
 		vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
 	else
-		vmx_disable_intercept_for_msr(msr_bitmap, msr, type);
+		vmx_disable_intercept_for_msr(vcpu, msr_bitmap, msr, type);
 }
 
 static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
@@ -3563,7 +3572,8 @@ static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
 	return mode;
 }
 
-static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
+static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu,
+					 unsigned long *msr_bitmap,
 					 u8 mode)
 {
 	int msr;
@@ -3579,11 +3589,11 @@ static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
 		 * TPR reads and writes can be virtualized even if virtual interrupt
 		 * delivery is not in use.
 		 */
-		vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
+		vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
 		if (mode & MSR_BITMAP_MODE_X2APIC_APICV) {
 			vmx_enable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_R);
-			vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
-			vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+			vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
+			vmx_disable_intercept_for_msr(vcpu, msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
 		}
 	}
 }
@@ -3599,29 +3609,30 @@ void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
 		return;
 
 	if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
-		vmx_update_msr_bitmap_x2apic(msr_bitmap, mode);
+		vmx_update_msr_bitmap_x2apic(vcpu, msr_bitmap, mode);
 
 	vmx->msr_bitmap_mode = mode;
 }
 
 void pt_update_intercept_for_msr(struct vcpu_vmx *vmx)
 {
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
 	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
 	bool flag = !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN);
 	u32 i;
 
-	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_RTIT_STATUS,
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, MSR_IA32_RTIT_STATUS,
 							MSR_TYPE_RW, flag);
-	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_RTIT_OUTPUT_BASE,
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, MSR_IA32_RTIT_OUTPUT_BASE,
 							MSR_TYPE_RW, flag);
-	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_RTIT_OUTPUT_MASK,
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, MSR_IA32_RTIT_OUTPUT_MASK,
 							MSR_TYPE_RW, flag);
-	vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_RTIT_CR3_MATCH,
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, MSR_IA32_RTIT_CR3_MATCH,
 							MSR_TYPE_RW, flag);
 	for (i = 0; i < vmx->pt_desc.addr_range; i++) {
-		vmx_set_intercept_for_msr(msr_bitmap,
+		vmx_set_intercept_for_msr(vcpu, msr_bitmap,
 			MSR_IA32_RTIT_ADDR0_A + i * 2, MSR_TYPE_RW, flag);
-		vmx_set_intercept_for_msr(msr_bitmap,
+		vmx_set_intercept_for_msr(vcpu, msr_bitmap,
 			MSR_IA32_RTIT_ADDR0_B + i * 2, MSR_TYPE_RW, flag);
 	}
 }
@@ -6823,13 +6834,13 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 		goto free_msrs;
 
 	msr_bitmap = vmx->vmcs01.msr_bitmap;
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_TSC, MSR_TYPE_R);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
-	vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_TSC, MSR_TYPE_R);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(NULL, msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
 	vmx->msr_bitmap_mode = 0;
 
 	vmx->loaded_vmcs = &vmx->vmcs01;
@@ -7790,7 +7801,7 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
 
-	vmx_set_intercept_for_msr(msr_bitmap, msr, MSR_TYPE_W, enable);
+	vmx_set_intercept_for_msr(vcpu, msr_bitmap, msr, MSR_TYPE_W, enable);
 }
 
 static void vmx_cr3_write_exiting(struct kvm_vcpu *vcpu,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 57/92] kvm: introspection: add KVMI_GET_XSAVE
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (55 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 56/92] kvm: x86: block any attempt to disable MSR interception if tracked by introspection Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 58/92] kvm: introspection: add KVMI_GET_MTRR_TYPE Adalbert Lazăr
                   ` (37 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This vCPU command is used to get the XSAVE area.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 31 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   |  4 ++++
 arch/x86/kvm/kvmi.c                | 21 ++++++++++++++++++++
 arch/x86/kvm/x86.c                 |  4 ++--
 include/linux/kvm_host.h           |  2 ++
 virt/kvm/kvmi_int.h                |  3 +++
 virt/kvm/kvmi_msg.c                | 17 ++++++++++++++++
 7 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index c41c3edb0134..c43ea1b33a51 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1081,6 +1081,37 @@ to control events for any other register will fail with -KVM_EINVAL::
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+23. KVMI_GET_XSAVE
+------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_xsave_reply {
+		__u32 region[0];
+	};
+
+Returns a buffer containing the XSAVE area. Currently, the size of
+``kvm_xsave`` is used, but it could change. The userspace should get
+the buffer size from the message size.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
 Events
 ======
 
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index 08af2eccbdfb..a3fcb1ef8404 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -97,4 +97,8 @@ struct kvmi_event_msr_reply {
 	__u64 new_val;
 };
 
+struct kvmi_get_xsave_reply {
+	__u32 region[0];
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index fc6956b50da2..078d714b59d5 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -790,3 +790,24 @@ int kvmi_arch_cmd_control_spp(struct kvmi *ikvm)
 {
 	return kvm_arch_init_spp(ikvm->kvm);
 }
+
+int kvmi_arch_cmd_get_xsave(struct kvm_vcpu *vcpu,
+			    struct kvmi_get_xsave_reply **dest,
+			    size_t *dest_size)
+{
+	struct kvmi_get_xsave_reply *rpl = NULL;
+	size_t rpl_size = sizeof(*rpl) + sizeof(struct kvm_xsave);
+	struct kvm_xsave *area;
+
+	rpl = kvmi_msg_alloc_check(rpl_size);
+	if (!rpl)
+		return -KVM_ENOMEM;
+
+	area = (struct kvm_xsave *) &rpl->region[0];
+	kvm_vcpu_ioctl_x86_get_xsave(vcpu, area);
+
+	*dest = rpl;
+	*dest_size = rpl_size;
+
+	return 0;
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac027471c4f3..05ff23180355 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3745,8 +3745,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	}
 }
 
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
-					 struct kvm_xsave *guest_xsave)
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave)
 {
 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c8eb1a4d997f..3aad3b96107b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -805,6 +805,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave);
 
 int kvm_arch_init(void *opaque);
 void kvm_arch_exit(void);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 640a78b54947..1a705cba4776 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -255,6 +255,9 @@ void kvmi_arch_trap_event(struct kvm_vcpu *vcpu);
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl);
+int kvmi_arch_cmd_get_xsave(struct kvm_vcpu *vcpu,
+			    struct kvmi_get_xsave_reply **dest,
+			    size_t *dest_size);
 int kvmi_arch_cmd_get_vcpu_info(struct kvm_vcpu *vcpu,
 				struct kvmi_get_vcpu_info_reply *rpl);
 int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 8a8951f13f8e..6bc18b7973cf 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -38,6 +38,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_GET_REGISTERS]         = "KVMI_GET_REGISTERS",
 	[KVMI_GET_VCPU_INFO]         = "KVMI_GET_VCPU_INFO",
 	[KVMI_GET_VERSION]           = "KVMI_GET_VERSION",
+	[KVMI_GET_XSAVE]             = "KVMI_GET_XSAVE",
 	[KVMI_INJECT_EXCEPTION]      = "KVMI_INJECT_EXCEPTION",
 	[KVMI_PAUSE_VCPU]            = "KVMI_PAUSE_VCPU",
 	[KVMI_READ_PHYSICAL]         = "KVMI_READ_PHYSICAL",
@@ -700,6 +701,21 @@ static int handle_get_cpuid(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, ec, &rpl, sizeof(rpl));
 }
 
+static int handle_get_xsave(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg, const void *req,
+			    vcpu_reply_fct reply_cb)
+{
+	struct kvmi_get_xsave_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	ec = kvmi_arch_cmd_get_xsave(vcpu, &rpl, &rpl_size);
+
+	err = reply_cb(vcpu, msg, ec, rpl, rpl_size);
+	kvmi_msg_free(rpl);
+	return err;
+}
+
 /*
  * These commands are executed on the vCPU thread. The receiving thread
  * passes the messages using a newly allocated 'struct kvmi_vcpu_cmd'
@@ -716,6 +732,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 	[KVMI_GET_CPUID]        = handle_get_cpuid,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
+	[KVMI_GET_XSAVE]        = handle_get_xsave,
 	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
 	[KVMI_SET_REGISTERS]    = handle_set_registers,
 };

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 58/92] kvm: introspection: add KVMI_GET_MTRR_TYPE
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (56 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 57/92] kvm: introspection: add KVMI_GET_XSAVE Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 59/92] kvm: introspection: add KVMI_EVENT_XSETBV Adalbert Lazăr
                   ` (36 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

This command returns the memory type for a guest physical address.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 32 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h   |  9 +++++++++
 arch/x86/kvm/kvmi.c                |  7 +++++++
 virt/kvm/kvmi_int.h                |  1 +
 virt/kvm/kvmi_msg.c                | 17 ++++++++++++++++
 5 files changed, 66 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index c43ea1b33a51..e58f0e22f188 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1112,6 +1112,38 @@ the buffer size from the message size.
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 * -KVM_ENOMEM - not enough memory to allocate the reply
 
+24. KVMI_GET_MTRR_TYPE
+----------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_get_mtrr_type {
+		__u64 gpa;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_mtrr_type_reply {
+		__u8 type;
+		__u8 padding[7];
+	};
+
+Returns the guest memory type for a specific physical address.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - padding is not zero
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
 Events
 ======
 
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index a3fcb1ef8404..c3c96e6e2a26 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -101,4 +101,13 @@ struct kvmi_get_xsave_reply {
 	__u32 region[0];
 };
 
+struct kvmi_get_mtrr_type {
+	__u64 gpa;
+};
+
+struct kvmi_get_mtrr_type_reply {
+	__u8 type;
+	__u8 padding[7];
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 078d714b59d5..0114ed66f4f3 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -811,3 +811,10 @@ int kvmi_arch_cmd_get_xsave(struct kvm_vcpu *vcpu,
 
 	return 0;
 }
+
+int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8 *type)
+{
+	*type = kvm_mtrr_get_guest_memory_type(vcpu, gpa_to_gfn(gpa));
+
+	return 0;
+}
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 1a705cba4776..ac2e13787f01 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -267,5 +267,6 @@ int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 			     const struct kvmi_control_cr *req);
 int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
 			      const struct kvmi_control_msr *req);
+int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8 *type);
 
 #endif
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 6bc18b7973cf..ee54d92b07ec 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -33,6 +33,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_CPUID]             = "KVMI_GET_CPUID",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
+	[KVMI_GET_MTRR_TYPE]         = "KVMI_GET_MTRR_TYPE",
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
 	[KVMI_GET_REGISTERS]         = "KVMI_GET_REGISTERS",
@@ -701,6 +702,21 @@ static int handle_get_cpuid(struct kvm_vcpu *vcpu,
 	return reply_cb(vcpu, msg, ec, &rpl, sizeof(rpl));
 }
 
+static int handle_get_mtrr_type(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req, vcpu_reply_fct reply_cb)
+{
+	const struct kvmi_get_mtrr_type *req = _req;
+	struct kvmi_get_mtrr_type_reply rpl;
+	int ec;
+
+	memset(&rpl, 0, sizeof(rpl));
+
+	ec = kvmi_arch_cmd_get_mtrr_type(vcpu, req->gpa, &rpl.type);
+
+	return reply_cb(vcpu, msg, ec, &rpl, sizeof(rpl));
+}
+
 static int handle_get_xsave(struct kvm_vcpu *vcpu,
 			    const struct kvmi_msg_hdr *msg, const void *req,
 			    vcpu_reply_fct reply_cb)
@@ -730,6 +746,7 @@ static int(*const msg_vcpu[])(struct kvm_vcpu *,
 	[KVMI_CONTROL_MSR]      = handle_control_msr,
 	[KVMI_EVENT_REPLY]      = handle_event_reply,
 	[KVMI_GET_CPUID]        = handle_get_cpuid,
+	[KVMI_GET_MTRR_TYPE]    = handle_get_mtrr_type,
 	[KVMI_GET_REGISTERS]    = handle_get_registers,
 	[KVMI_GET_VCPU_INFO]    = handle_get_vcpu_info,
 	[KVMI_GET_XSAVE]        = handle_get_xsave,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 59/92] kvm: introspection: add KVMI_EVENT_XSETBV
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (57 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 58/92] kvm: introspection: add KVMI_GET_MTRR_TYPE Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 60/92] kvm: x86: add kvm_arch_vcpu_set_guest_debug() Adalbert Lazăr
                   ` (35 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This event is sent when the extended control register XCR0 is going to
be changed.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 25 +++++++++++++++++++
 arch/x86/include/asm/kvmi_host.h   |  5 ++++
 arch/x86/kvm/kvmi.c                | 39 ++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                 |  5 ++++
 4 files changed, 74 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index e58f0e22f188..1d2431639770 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1444,3 +1444,28 @@ register (see **KVMI_CONTROL_EVENTS**).
 
 ``kvmi_event``, the MSR number, the old value and the new value are
 sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+8. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent when the extended control register XCR0 is going
+to be changed and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event`` is sent to the introspector.
diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
index 86d90b7bed84..3f066e7feee2 100644
--- a/arch/x86/include/asm/kvmi_host.h
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -15,6 +15,7 @@ bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
 bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr);
 bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 		   unsigned long old_value, unsigned long *new_value);
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
 
 #else /* CONFIG_KVM_INTROSPECTION */
 
@@ -35,6 +36,10 @@ static inline bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 	return true;
 }
 
+static inline void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
+{
+}
+
 #endif /* CONFIG_KVM_INTROSPECTION */
 
 #endif /* _ASM_X86_KVMI_HOST_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 0114ed66f4f3..0e9c91d2f282 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -389,6 +389,45 @@ bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 	return ret;
 }
 
+static u32 kvmi_send_xsetbv(struct kvm_vcpu *vcpu)
+{
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_XSETBV, NULL, 0,
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
+static void __kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	action = kvmi_send_xsetbv(vcpu);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "XSETBV");
+	}
+}
+
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_XSETBV))
+		__kvmi_xsetbv_event(vcpu);
+
+	kvmi_put(vcpu->kvm);
+}
+
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05ff23180355..278a286ba262 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -868,6 +868,11 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (xcr != vcpu->arch.xcr0)
+		kvmi_xsetbv_event(vcpu);
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 	if (kvm_x86_ops->get_cpl(vcpu) != 0 ||
 	    __kvm_set_xcr(vcpu, index, xcr)) {
 		kvm_inject_gp(vcpu, 0);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 60/92] kvm: x86: add kvm_arch_vcpu_set_guest_debug()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (58 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 59/92] kvm: introspection: add KVMI_EVENT_XSETBV Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 61/92] kvm: introspection: add KVMI_EVENT_BREAKPOINT Adalbert Lazăr
                   ` (34 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This function is need in order to intercept breakpoints and send
KVMI_EVENT_BREAKPOINT events to the introspection tool.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c       | 18 +++++++++++++-----
 include/linux/kvm_host.h |  2 ++
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 278a286ba262..e633f297e86d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8747,14 +8747,12 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
-int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
-					struct kvm_guest_debug *dbg)
+int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
+				  struct kvm_guest_debug *dbg)
 {
 	unsigned long rflags;
 	int i, r;
 
-	vcpu_load(vcpu);
-
 	if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
 		r = -EBUSY;
 		if (vcpu->arch.exception.pending)
@@ -8800,10 +8798,20 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 	r = 0;
 
 out:
-	vcpu_put(vcpu);
 	return r;
 }
 
+int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
+					struct kvm_guest_debug *dbg)
+{
+	int ret;
+
+	vcpu_load(vcpu);
+	ret = kvm_arch_vcpu_set_guest_debug(vcpu, dbg);
+	vcpu_put(vcpu);
+	return ret;
+}
+
 /*
  * Translate a guest virtual address to a guest physical address.
  */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3aad3b96107b..691c24598b4d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -804,6 +804,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 				    struct kvm_mp_state *mp_state);
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
+int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
+				  struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
 void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 				  struct kvm_xsave *guest_xsave);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 61/92] kvm: introspection: add KVMI_EVENT_BREAKPOINT
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (59 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 60/92] kvm: x86: add kvm_arch_vcpu_set_guest_debug() Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 62/92] kvm: introspection: add KVMI_EVENT_HYPERCALL Adalbert Lazăr
                   ` (33 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

This event is sent when a breakpoint was reached. It has to
be enabled with the KVMI_CONTROL_EVENTS command first.

The introspection tool can place breakpoints and use them as notification
for when the OS or an application has reached a certain state or is
trying to perform a certain operation (like creating a process).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 36 +++++++++++++
 arch/x86/kvm/kvmi.c                | 20 +++++++
 arch/x86/kvm/svm.c                 |  6 +++
 arch/x86/kvm/vmx/vmx.c             | 17 ++++--
 arch/x86/kvm/x86.c                 | 12 +++++
 include/linux/kvm_host.h           |  2 +
 include/linux/kvmi.h               |  7 +++
 include/uapi/linux/kvmi.h          |  6 +++
 virt/kvm/kvmi.c                    | 84 ++++++++++++++++++++++++++++--
 virt/kvm/kvmi_int.h                |  3 ++
 virt/kvm/kvmi_msg.c                | 17 ++++++
 11 files changed, 201 insertions(+), 9 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 1d2431639770..da216415bf32 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1469,3 +1469,39 @@ to be changed and the introspection has been enabled for this event
 (see *KVMI_CONTROL_EVENTS*).
 
 ``kvmi_event`` is sent to the introspector.
+
+9. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+		__u8 insn_len;
+		__u8 padding[7];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+Some of these breakpoints could have been injected by the introspector,
+placed in the slack space of various functions and used as notification
+for when the OS or an application has reached a certain state or is
+trying to perform a certain operation (like creating a process).
+
+``kvmi_event`` and the guest physical address are sent to the introspector.
+
+The *RETRY* action is used by the introspector for its own breakpoints.
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 0e9c91d2f282..e998223bca1e 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -428,6 +428,26 @@ void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
+{
+	u32 action;
+	u64 gpa;
+
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+
+	action = kvmi_msg_send_bp(vcpu, gpa, insn_len);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		kvm_arch_queue_bp(vcpu);
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		/* rip was most likely adjusted past the INT 3 instruction */
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "BP");
+	}
+}
+
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access)
 {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e46a4c423545..b4e59ef040b7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) "SVM: " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 #include <asm/kvmi_host.h>
 
 #include "irq.h"
@@ -2722,6 +2723,11 @@ static int bp_interception(struct vcpu_svm *svm)
 {
 	struct kvm_run *kvm_run = svm->vcpu.run;
 
+	if (!kvmi_breakpoint_event(&svm->vcpu,
+		svm->vmcb->save.cs.base + svm->vmcb->save.rip,
+		svm->vmcb->control.insn_len))
+		return 1;
+
 	kvm_run->exit_reason = KVM_EXIT_DEBUG;
 	kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
 	kvm_run->debug.arch.exception = BP_VECTOR;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fff41adcdffe..d560b583bf30 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -22,6 +22,7 @@
 #include <linux/kernel.h>
 #include <linux/kvm_host.h>
 #include <asm/kvmi_host.h>
+#include <linux/kvmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/mod_devicetable.h>
@@ -4484,7 +4485,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_run *kvm_run = vcpu->run;
 	u32 intr_info, ex_no, error_code;
-	unsigned long cr2, rip, dr6;
+	unsigned long cr2, dr6;
 	u32 vect_info;
 	enum emulation_result er;
 
@@ -4562,7 +4563,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		kvm_run->debug.arch.dr6 = dr6 | DR6_FIXED_1;
 		kvm_run->debug.arch.dr7 = vmcs_readl(GUEST_DR7);
 		/* fall through */
-	case BP_VECTOR:
+	case BP_VECTOR: {
+		unsigned long gva = vmcs_readl(GUEST_CS_BASE) +
+			kvm_rip_read(vcpu);
+
 		/*
 		 * Update instruction length as we may reinject #BP from
 		 * user space while in guest debugging mode. Reading it for
@@ -4570,11 +4574,16 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		 */
 		vmx->vcpu.arch.event_exit_inst_len =
 			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+
+		if (!kvmi_breakpoint_event(vcpu, gva,
+					   vmx->vcpu.arch.event_exit_inst_len))
+			return 1;
+
 		kvm_run->exit_reason = KVM_EXIT_DEBUG;
-		rip = kvm_rip_read(vcpu);
-		kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip;
+		kvm_run->debug.arch.pc = gva;
 		kvm_run->debug.arch.exception = ex_no;
 		break;
+	}
 	default:
 		kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
 		kvm_run->ex.exception = ex_no;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e633f297e86d..a9da8ac0d2b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8763,6 +8763,13 @@ int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
 			kvm_queue_exception(vcpu, BP_VECTOR);
 	}
 
+#ifdef CONFIG_KVM_INTROSPECTION
+	if (kvmi_bp_intercepted(vcpu, dbg->control)) {
+		r = -EBUSY;
+		goto out;
+	}
+#endif
+
 	/*
 	 * Read rflags as long as potentially injected trace flags are still
 	 * filtered out.
@@ -10106,6 +10113,11 @@ void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 }
 EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
 
+void kvm_arch_queue_bp(struct kvm_vcpu *vcpu)
+{
+	kvm_queue_exception(vcpu, BP_VECTOR);
+}
+
 void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
 {
 	kvm_x86_ops->cr3_write_exiting(vcpu, enable);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 691c24598b4d..b77914e944a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1330,4 +1330,6 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
 
+void kvm_arch_queue_bp(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 5ae02c64fb33..13b58b3202bb 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -16,11 +16,13 @@ int kvmi_ioctl_event(struct kvm *kvm, void __user *argp);
 int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
 void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len);
 bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 void kvmi_init_emulate(struct kvm_vcpu *vcpu);
 void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
+bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg);
 
 #else
 
@@ -29,12 +31,17 @@ static inline void kvmi_uninit(void) { }
 static inline void kvmi_create_vm(struct kvm *kvm) { }
 static inline void kvmi_destroy_vm(struct kvm *kvm) { }
 static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
+static inline bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva,
+					 u8 insn_len)
+			{ return true; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
 static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) { return true; }
 static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu) { }
+static inline bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg)
+			{ return false; }
 
 #endif /* CONFIG_KVM_INTROSPECTION */
 
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index a4583de5c2f6..b072e0a4f33d 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -256,4 +256,10 @@ struct kvmi_event_pf_reply {
 	__u8 ctx_data[256];
 };
 
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+	__u8 insn_len;
+	__u8 padding[7];
+};
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index e3f308898a60..4c868a94ac37 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -219,6 +219,48 @@ static void kvmi_clear_mem_access(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static int kvmi_control_event_breakpoint(struct kvm_vcpu *vcpu, bool enable)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvm_guest_debug dbg = {};
+	int err = 0;
+
+	if (enable) {
+		if (!is_event_enabled(vcpu, KVMI_EVENT_BREAKPOINT)) {
+			dbg.control = KVM_GUESTDBG_ENABLE |
+				      KVM_GUESTDBG_USE_SW_BP;
+			ivcpu->bp_intercepted = true;
+			err = kvm_arch_vcpu_set_guest_debug(vcpu, &dbg);
+		}
+	} else if (is_event_enabled(vcpu, KVMI_EVENT_BREAKPOINT)) {
+		ivcpu->bp_intercepted = false;
+		err = kvm_arch_vcpu_set_guest_debug(vcpu, &dbg);
+	}
+
+	return err;
+}
+
+bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg)
+{
+	struct kvmi *ikvm;
+	bool ret = false;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	if (IVCPU(vcpu)->bp_intercepted &&
+		!(dbg & (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP))) {
+		kvmi_warn_once(ikvm, "Trying to disable SW BP interception\n");
+		ret = true;
+	}
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+EXPORT_SYMBOL(kvmi_bp_intercepted);
+
 static void kvmi_cache_destroy(void)
 {
 	kmem_cache_destroy(msg_cache);
@@ -1058,6 +1100,26 @@ void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL(kvmi_activate_rep_complete);
 
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
+{
+	struct kvmi *ikvm;
+	bool ret = false;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_BREAKPOINT))
+		kvmi_arch_breakpoint_event(vcpu, gva, insn_len);
+	else
+		ret = true;
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+EXPORT_SYMBOL(kvmi_breakpoint_event);
+
 /*
  * This function returns false if there is an exception or interrupt pending.
  * It returns true in all other cases including KVMI not being initialized.
@@ -1438,13 +1500,25 @@ int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	int err;
 
-	if (enable)
-		set_bit(event_id, ivcpu->ev_mask);
-	else
-		clear_bit(event_id, ivcpu->ev_mask);
+	switch (event_id) {
+	case KVMI_EVENT_BREAKPOINT:
+		err = kvmi_control_event_breakpoint(vcpu, enable);
+		break;
+	default:
+		err = 0;
+		break;
+	}
 
-	return 0;
+	if (!err) {
+		if (enable)
+			set_bit(event_id, ivcpu->ev_mask);
+		else
+			clear_bit(event_id, ivcpu->ev_mask);
+	}
+
+	return err;
 }
 
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index ac2e13787f01..d039446922e6 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -118,6 +118,7 @@ struct kvmi_vcpu {
 	bool have_delayed_regs;
 	struct kvm_regs delayed_regs;
 
+	bool bp_intercepted;
 	DECLARE_BITMAP(ev_mask, KVMI_NUM_EVENTS);
 	DECLARE_BITMAP(cr_mask, KVMI_NUM_CR);
 	struct {
@@ -183,6 +184,7 @@ bool kvmi_msg_process(struct kvmi *ikvm);
 int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
 		    void *ev, size_t ev_size,
 		    void *rpl, size_t rpl_size, int *action);
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa, u8 insn_len);
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete,
 		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
@@ -252,6 +254,7 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access);
 bool kvmi_arch_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_arch_trap_event(struct kvm_vcpu *vcpu);
+void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len);
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index ee54d92b07ec..c7a1fa5f7245 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -1079,6 +1079,23 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm)
 	return kvmi_sock_write(ikvm, vec, n, msg_size);
 }
 
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa, u8 insn_len)
+{
+	struct kvmi_event_breakpoint e;
+	int err, action;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+	e.insn_len = insn_len;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_BREAKPOINT, &e, sizeof(e),
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete, u64 *ctx_addr,
 		     u8 *ctx_data, u32 *ctx_size)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 62/92] kvm: introspection: add KVMI_EVENT_HYPERCALL
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (60 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 61/92] kvm: introspection: add KVMI_EVENT_BREAKPOINT Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 63/92] kvm: introspection: add KVMI_EVENT_DESCRIPTOR Adalbert Lazăr
                   ` (32 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This event is sent on a specific user hypercall.

It is used by the code residing inside the introspected guest to call the
introspection tool and to report certain details about its operation. For
example, a classic antimalware remediation tool can report what it has
found during a scan.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/hypercalls.txt | 34 +++++++++++++++++++++++-
 Documentation/virtual/kvm/kvmi.rst       | 31 +++++++++++++++++++++
 arch/x86/kvm/kvmi.c                      | 33 +++++++++++++++++++++++
 arch/x86/kvm/x86.c                       | 16 ++++++++---
 include/linux/kvmi.h                     |  2 ++
 include/uapi/linux/kvm_para.h            |  2 ++
 virt/kvm/kvmi.c                          | 22 +++++++++++++++
 virt/kvm/kvmi_int.h                      |  3 +++
 virt/kvm/kvmi_msg.c                      | 12 +++++++++
 9 files changed, 151 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index da24c138c8d1..1ab59537b2fb 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -122,7 +122,7 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
 
-6. KVM_HC_SEND_IPI
+7. KVM_HC_SEND_IPI
 ------------------------
 Architecture: x86
 Status: active
@@ -141,3 +141,35 @@ a0 corresponds to the APIC ID in the third argument (a2), bit 1
 corresponds to the APIC ID a2+1, and so on.
 
 Returns the number of CPUs to which the IPIs were delivered successfully.
+
+8. KVM_HC_XEN_HVM_OP
+--------------------
+
+Architecture: x86
+Status: active
+Purpose: To enable communication between a guest agent and a VMI application
+Usage:
+
+An event will be sent to the VMI application (see kvmi.rst) if the following
+registers, which differ between 32bit and 64bit, have the following values:
+
+       32bit       64bit     value
+       ---------------------------
+       ebx (a0)    rdi       KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+       ecx (a1)    rsi       0
+
+This specification copies Xen's { __HYPERVISOR_hvm_op,
+HVMOP_guest_request_vm_event } hypercall and can originate from kernel or
+userspace.
+
+It returns 0 if successful, or a negative POSIX.1 error code if it fails. The
+absence of an active VMI application is not signaled in any way.
+
+The following registers are clobbered:
+
+  * 32bit: edx, esi, edi, ebp
+  * 64bit: rdx, r10, r8, r9
+
+In particular, for KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT, the last two
+registers can be poisoned deliberately and cannot be used for passing
+information.
diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index da216415bf32..2603813d1ee6 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1505,3 +1505,34 @@ trying to perform a certain operation (like creating a process).
 ``kvmi_event`` and the guest physical address are sent to the introspector.
 
 The *RETRY* action is used by the introspector for its own breakpoints.
+
+10. KVMI_EVENT_HYPERCALL
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent on a specific user hypercall when the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+The hypercall number must be ``KVM_HC_XEN_HVM_OP`` with the
+``KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT`` sub-function
+(see hypercalls.txt).
+
+It is used by the code residing inside the introspected guest to call the
+introspection tool and to report certain details about its operation. For
+example, a classic antimalware remediation tool can report what it has
+found during a scan.
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index e998223bca1e..02e026ef5ed7 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -448,6 +448,39 @@ void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
 	}
 }
 
+#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
+bool kvmi_arch_is_agent_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long subfunc1, subfunc2;
+	bool longmode = is_64_bit_mode(vcpu);
+
+	if (longmode) {
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RDI);
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RSI);
+	} else {
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RBX);
+		subfunc1 &= 0xFFFFFFFF;
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RCX);
+		subfunc2 &= 0xFFFFFFFF;
+	}
+
+	return (subfunc1 == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+		&& subfunc2 == 0);
+}
+
+void kvmi_arch_hypercall_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	action = kvmi_msg_send_hypercall(vcpu);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "HYPERCALL");
+	}
+}
+
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 			u8 access)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9da8ac0d2b3..d568e60ae568 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7231,11 +7231,14 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
 	int op_64_bit;
+	bool kvmi_hc;
 
-	if (kvm_hv_hypercall_enabled(vcpu->kvm))
+	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
+	kvmi_hc = (u32)nr == KVM_HC_XEN_HVM_OP;
+
+	if (kvm_hv_hypercall_enabled(vcpu->kvm) && !kvmi_hc)
 		return kvm_hv_hypercall(vcpu);
 
-	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
 	a0 = kvm_register_read(vcpu, VCPU_REGS_RBX);
 	a1 = kvm_register_read(vcpu, VCPU_REGS_RCX);
 	a2 = kvm_register_read(vcpu, VCPU_REGS_RDX);
@@ -7252,7 +7255,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
+	if (kvm_x86_ops->get_cpl(vcpu) != 0 && !kvmi_hc) {
 		ret = -KVM_EPERM;
 		goto out;
 	}
@@ -7273,6 +7276,13 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_SEND_IPI:
 		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
 		break;
+#ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_HC_XEN_HVM_OP:
+		ret = 0;
+		if (!kvmi_hypercall_event(vcpu))
+			ret = -KVM_ENOSYS;
+		break;
+#endif /* CONFIG_KVM_INTROSPECTION */
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 13b58b3202bb..59d83d2d0cca 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -17,6 +17,7 @@ int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset);
 int kvmi_vcpu_init(struct kvm_vcpu *vcpu);
 void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
 bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len);
+bool kvmi_hypercall_event(struct kvm_vcpu *vcpu);
 bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
@@ -36,6 +37,7 @@ static inline bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva,
 			{ return true; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
+static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) { return false; }
 static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) { return true; }
 static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 553f168331a4..592bda92b6d5 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -33,6 +33,8 @@
 #define KVM_HC_CLOCK_PAIRING		9
 #define KVM_HC_SEND_IPI		10
 
+#define KVM_HC_XEN_HVM_OP		34 /* Xen's __HYPERVISOR_hvm_op */
+
 /*
  * hypercalls use architecture specific
  */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 4c868a94ac37..d04e13a0b244 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1120,6 +1120,28 @@ bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
 }
 EXPORT_SYMBOL(kvmi_breakpoint_event);
 
+bool kvmi_hypercall_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm;
+	bool ret = false;
+
+	if (!kvmi_arch_is_agent_hypercall(vcpu))
+		return ret;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return ret;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_HYPERCALL)) {
+		kvmi_arch_hypercall_event(vcpu);
+		ret = true;
+	}
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 /*
  * This function returns false if there is an exception or interrupt pending.
  * It returns true in all other cases including KVMI not being initialized.
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index d039446922e6..793ec269b9fa 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -185,6 +185,7 @@ int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
 		    void *ev, size_t ev_size,
 		    void *rpl, size_t rpl_size, int *action);
 u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa, u8 insn_len);
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete,
 		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
@@ -255,6 +256,8 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 bool kvmi_arch_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_arch_trap_event(struct kvm_vcpu *vcpu);
 void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len);
+bool kvmi_arch_is_agent_hypercall(struct kvm_vcpu *vcpu);
+void kvmi_arch_hypercall_event(struct kvm_vcpu *vcpu);
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index c7a1fa5f7245..89f63f40f5cc 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -1096,6 +1096,18 @@ u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa, u8 insn_len)
 	return action;
 }
 
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu)
+{
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_HYPERCALL, NULL, 0,
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete, u64 *ctx_addr,
 		     u8 *ctx_data, u32 *ctx_size)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 63/92] kvm: introspection: add KVMI_EVENT_DESCRIPTOR
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (61 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 62/92] kvm: introspection: add KVMI_EVENT_HYPERCALL Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 64/92] kvm: introspection: add single-stepping Adalbert Lazăr
                   ` (31 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Nicușor Cîțu <ncitu@bitdefender.com>

This event is sent when IDTR, GDTR, LDTR or TR are accessed.

These could be used to implement a tiny agent which runs in the context
of an introspected guest and uses virtualized exceptions (#VE) and
alternate EPT views (VMFUNC #0) to filter converted VMEXITS. The events
of interested will be suppressed (after some appropriate guest-side
handling) while the rest will be sent to the introspector via a VMCALL.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 38 +++++++++++++++
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/include/uapi/asm/kvmi.h   | 11 +++++
 arch/x86/kvm/kvmi.c                | 70 ++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c                 | 74 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c             | 59 +++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h             |  2 +
 arch/x86/kvm/x86.c                 |  6 +++
 include/linux/kvm_host.h           |  1 +
 include/linux/kvmi.h               |  4 ++
 virt/kvm/kvmi.c                    |  2 +-
 virt/kvm/kvmi_int.h                |  3 ++
 virt/kvm/kvmi_msg.c                | 17 +++++++
 13 files changed, 285 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 2603813d1ee6..8721a470de87 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1536,3 +1536,41 @@ It is used by the code residing inside the introspected guest to call the
 introspection tool and to report certain details about its operation. For
 example, a classic antimalware remediation tool can report what it has
 found during a scan.
+
+11. KVMI_EVENT_DESCRIPTOR
+-------------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Actions: CONTINUE, RETRY, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_descriptor {
+		__u8 descriptor;
+		__u8 write;
+		__u8 padding[6];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent when a descriptor table register is accessed and the
+introspection has been enabled for this event (see **KVMI_CONTROL_EVENTS**).
+
+``kvmi_event`` and ``kvmi_event_descriptor`` are sent to the introspector.
+
+``descriptor`` can be one of::
+
+	KVMI_DESC_IDTR
+	KVMI_DESC_GDTR
+	KVMI_DESC_LDTR
+	KVMI_DESC_TR
+
+``write`` is 1 if the descriptor was written, 0 otherwise.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91cd43a7a7bf..ad36a5fc2048 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1015,6 +1015,7 @@ struct kvm_x86_ops {
 
 	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
+	bool (*desc_intercept)(struct kvm_vcpu *vcpu, bool enable);
 	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
 	bool (*spt_fault)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
index c3c96e6e2a26..0fa4ac3ed5d1 100644
--- a/arch/x86/include/uapi/asm/kvmi.h
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -110,4 +110,15 @@ struct kvmi_get_mtrr_type_reply {
 	__u8 padding[7];
 };
 
+#define KVMI_DESC_IDTR	1
+#define KVMI_DESC_GDTR	2
+#define KVMI_DESC_LDTR	3
+#define KVMI_DESC_TR	4
+
+struct kvmi_event_descriptor {
+	__u8 descriptor;
+	__u8 write;
+	__u8 padding[6];
+};
+
 #endif /* _UAPI_ASM_X86_KVMI_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 02e026ef5ed7..04cac5b8a4d0 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -161,6 +161,38 @@ bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr)
 }
 EXPORT_SYMBOL(kvmi_monitored_msr);
 
+static int kvmi_control_event_desc(struct kvm_vcpu *vcpu, bool enable)
+{
+	int err = 0;
+
+	if (enable) {
+		if (!is_event_enabled(vcpu, KVMI_EVENT_DESCRIPTOR))
+			if (!kvm_arch_vcpu_intercept_desc(vcpu, true))
+				err = -KVM_EOPNOTSUPP;
+	} else if (is_event_enabled(vcpu, KVMI_EVENT_DESCRIPTOR)) {
+		kvm_arch_vcpu_intercept_desc(vcpu, false);
+	}
+
+	return err;
+}
+
+int kvmi_arch_cmd_control_event(struct kvm_vcpu *vcpu, unsigned int event_id,
+				bool enable)
+{
+	int err;
+
+	switch (event_id) {
+	case KVMI_EVENT_DESCRIPTOR:
+		err = kvmi_control_event_desc(vcpu, enable);
+		break;
+	default:
+		err = 0;
+		break;
+	}
+
+	return err;
+}
+
 static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
 				       const struct kvmi_get_registers *req,
 				       size_t *rpl_size)
@@ -604,6 +636,44 @@ void kvmi_arch_trap_event(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool __kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
+				    u8 write)
+{
+	u32 action;
+	bool ret = false;
+
+	action = kvmi_msg_send_descriptor(vcpu, descriptor, write);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ret = true;
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "DESC");
+	}
+
+	return ret;
+}
+
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write)
+{
+	struct kvmi *ikvm;
+	bool ret = true;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return true;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_DESCRIPTOR))
+		ret = __kvmi_descriptor_event(vcpu, descriptor, write);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+EXPORT_SYMBOL(kvmi_descriptor_event);
+
 int kvmi_arch_cmd_get_cpuid(struct kvm_vcpu *vcpu,
 			    const struct kvmi_get_cpuid *req,
 			    struct kvmi_get_cpuid_reply *rpl)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b4e59ef040b7..b178b8900660 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -52,6 +52,7 @@
 #include <asm/kvm_para.h>
 #include <asm/irq_remapping.h>
 #include <asm/spec-ctrl.h>
+#include <asm/kvmi.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -4754,6 +4755,41 @@ static int avic_unaccelerated_access_interception(struct vcpu_svm *svm)
 	return ret;
 }
 
+#ifdef CONFIG_KVM_INTROSPECTION
+static int descriptor_access_interception(struct vcpu_svm *svm)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct vmcb_control_area *c = &svm->vmcb->control;
+
+	switch (c->exit_code) {
+	case SVM_EXIT_IDTR_READ:
+	case SVM_EXIT_IDTR_WRITE:
+		kvmi_descriptor_event(vcpu, KVMI_DESC_IDTR,
+				      c->exit_code == SVM_EXIT_IDTR_WRITE);
+		break;
+	case SVM_EXIT_GDTR_READ:
+	case SVM_EXIT_GDTR_WRITE:
+		kvmi_descriptor_event(vcpu, KVMI_DESC_GDTR,
+				      c->exit_code == SVM_EXIT_GDTR_WRITE);
+		break;
+	case SVM_EXIT_LDTR_READ:
+	case SVM_EXIT_LDTR_WRITE:
+		kvmi_descriptor_event(vcpu, KVMI_DESC_LDTR,
+				      c->exit_code == SVM_EXIT_LDTR_WRITE);
+		break;
+	case SVM_EXIT_TR_READ:
+	case SVM_EXIT_TR_WRITE:
+		kvmi_descriptor_event(vcpu, KVMI_DESC_TR,
+				      c->exit_code == SVM_EXIT_TR_WRITE);
+		break;
+	default:
+		break;
+	}
+
+	return 1;
+}
+#endif /* CONFIG_KVM_INTROSPECTION */
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -4819,6 +4855,16 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_RSM]                          = rsm_interception,
 	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
 	[SVM_EXIT_AVIC_UNACCELERATED_ACCESS]	= avic_unaccelerated_access_interception,
+#ifdef CONFIG_KVM_INTROSPECTION
+	[SVM_EXIT_IDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_TR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_IDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_TR_WRITE]			= descriptor_access_interception,
+#endif /* CONFIG_KVM_INTROSPECTION */
 };
 
 static void dump_vmcb(struct kvm_vcpu *vcpu)
@@ -7141,6 +7187,33 @@ static void svm_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
 {
 }
 
+static bool svm_desc_intercept(struct kvm_vcpu *vcpu, bool enable)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (enable) {
+		set_intercept(svm, INTERCEPT_STORE_IDTR);
+		set_intercept(svm, INTERCEPT_STORE_GDTR);
+		set_intercept(svm, INTERCEPT_STORE_LDTR);
+		set_intercept(svm, INTERCEPT_STORE_TR);
+		set_intercept(svm, INTERCEPT_LOAD_IDTR);
+		set_intercept(svm, INTERCEPT_LOAD_GDTR);
+		set_intercept(svm, INTERCEPT_LOAD_LDTR);
+		set_intercept(svm, INTERCEPT_LOAD_TR);
+	} else {
+		clr_intercept(svm, INTERCEPT_STORE_IDTR);
+		clr_intercept(svm, INTERCEPT_STORE_GDTR);
+		clr_intercept(svm, INTERCEPT_STORE_LDTR);
+		clr_intercept(svm, INTERCEPT_STORE_TR);
+		clr_intercept(svm, INTERCEPT_LOAD_IDTR);
+		clr_intercept(svm, INTERCEPT_LOAD_GDTR);
+		clr_intercept(svm, INTERCEPT_LOAD_LDTR);
+		clr_intercept(svm, INTERCEPT_LOAD_TR);
+	}
+
+	return true;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7154,6 +7227,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 
 	.cr3_write_exiting = svm_cr3_write_exiting,
 	.msr_intercept = svm_msr_intercept,
+	.desc_intercept = svm_desc_intercept,
 	.nested_pagefault = svm_nested_pagefault,
 	.spt_fault = svm_spt_fault,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d560b583bf30..7d1e341b51ad 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -22,6 +22,7 @@
 #include <linux/kernel.h>
 #include <linux/kvm_host.h>
 #include <asm/kvmi_host.h>
+#include <uapi/linux/kvmi.h>
 #include <linux/kvmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -2922,8 +2923,9 @@ int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			hw_cr4 &= ~X86_CR4_UMIP;
 		} else if (!is_guest_mode(vcpu) ||
 			!nested_cpu_has2(get_vmcs12(vcpu), SECONDARY_EXEC_DESC))
-			vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
-					SECONDARY_EXEC_DESC);
+			if (!to_vmx(vcpu)->tracking_desc)
+				vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
+						SECONDARY_EXEC_DESC);
 	}
 
 	if (cr4 & X86_CR4_VMXE) {
@@ -4691,7 +4693,43 @@ static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
 
 static int handle_desc(struct kvm_vcpu *vcpu)
 {
+#ifdef CONFIG_KVM_INTROSPECTION
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exit_reason = vmx->exit_reason;
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	u8 store = (vmx_instruction_info >> 29) & 0x1;
+	u8 descriptor = 0;
+
+	if (!vmx->tracking_desc)
+		goto emulate;
+
+	if (exit_reason == EXIT_REASON_GDTR_IDTR) {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_IDTR;
+		else
+			descriptor = KVMI_DESC_GDTR;
+	} else {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_TR;
+		else
+			descriptor = KVMI_DESC_LDTR;
+	}
+
+	/*
+	 * For now, this function returns false only when the guest
+	 * is ungracefully stopped (crashed) or the current instruction
+	 * is skipped by the introspection tool.
+	 */
+	if (!kvmi_descriptor_event(vcpu, descriptor, store))
+		return 1;
+emulate:
+	/*
+	 * We are here because X86_CR4_UMIP was set or
+	 * KVMI enabled the interception.
+	 */
+#else
 	WARN_ON(!(vcpu->arch.cr4 & X86_CR4_UMIP));
+#endif /* CONFIG_KVM_INTROSPECTION */
 	return kvm_emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
@@ -7840,6 +7878,22 @@ static bool vmx_spt_fault(struct kvm_vcpu *vcpu)
 	return (vmx->exit_reason == EXIT_REASON_EPT_VIOLATION);
 }
 
+static bool vmx_desc_intercept(struct kvm_vcpu *vcpu, bool enable)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!cpu_has_secondary_exec_ctrls())
+		return false;
+
+	if (enable)
+		vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_DESC);
+	else
+		vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_DESC);
+
+	vmx->tracking_desc = enable;
+	return true;
+}
+
 static bool vmx_get_spp_status(void)
 {
 	return spp_supported;
@@ -7875,6 +7929,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 
 	.msr_intercept = vmx_msr_intercept,
 	.cr3_write_exiting = vmx_cr3_write_exiting,
+	.desc_intercept = vmx_desc_intercept,
 	.nested_pagefault = vmx_nested_pagefault,
 	.spt_fault = vmx_spt_fault,
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 0ac0a64c7790..580b02f86011 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -268,6 +268,8 @@ struct vcpu_vmx {
 	u64 msr_ia32_feature_control_valid_bits;
 	u64 ept_pointer;
 
+	bool tracking_desc;
+
 	struct pt_desc pt_desc;
 };
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d568e60ae568..38aaddadb93a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10140,6 +10140,12 @@ bool kvm_spt_fault(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL(kvm_spt_fault);
 
+bool kvm_arch_vcpu_intercept_desc(struct kvm_vcpu *vcpu, bool enable)
+{
+	return kvm_x86_ops->desc_intercept(vcpu, enable);
+}
+EXPORT_SYMBOL(kvm_arch_vcpu_intercept_desc);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b77914e944a4..6c57291414d0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -806,6 +806,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_set_guest_debug(struct kvm_vcpu *vcpu,
 				  struct kvm_guest_debug *dbg);
+bool kvm_arch_vcpu_intercept_desc(struct kvm_vcpu *vcpu, bool enable);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
 void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 				  struct kvm_xsave *guest_xsave);
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 59d83d2d0cca..5d162b9e67f2 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -20,6 +20,7 @@ bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len);
 bool kvmi_hypercall_event(struct kvm_vcpu *vcpu);
 bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 void kvmi_init_emulate(struct kvm_vcpu *vcpu);
 void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
@@ -35,6 +36,9 @@ static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
 static inline bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva,
 					 u8 insn_len)
 			{ return true; }
+static inline bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
+					 u8 write)
+			{ return true; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
 static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) { return false; }
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index d04e13a0b244..d47a725a4045 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1529,7 +1529,7 @@ int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 		err = kvmi_control_event_breakpoint(vcpu, enable);
 		break;
 	default:
-		err = 0;
+		err = kvmi_arch_cmd_control_event(vcpu, event_id, enable);
 		break;
 	}
 
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 793ec269b9fa..d7f9858d3e97 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -189,6 +189,7 @@ u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
 u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 		     bool *singlestep, bool *rep_complete,
 		     u64 *ctx_addr, u8 *ctx, u32 *ctx_size);
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
 u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
 u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
 int kvmi_msg_send_unhook(struct kvmi *ikvm);
@@ -228,6 +229,8 @@ void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 void kvmi_arch_update_page_tracking(struct kvm *kvm,
 				    struct kvm_memory_slot *slot,
 				    struct kvmi_mem_access *m);
+int kvmi_arch_cmd_control_event(struct kvm_vcpu *vcpu, unsigned int event_id,
+				bool enable);
 int kvmi_arch_cmd_get_registers(struct kvm_vcpu *vcpu,
 				const struct kvmi_msg_hdr *msg,
 				const struct kvmi_get_registers *req,
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 89f63f40f5cc..3e381f95b686 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -1163,6 +1163,23 @@ u32 kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u8 access,
 	return action;
 }
 
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u8 descriptor, u8 write)
+{
+	struct kvmi_event_descriptor e;
+	int err, action;
+
+	memset(&e, 0, sizeof(e));
+	e.descriptor = descriptor;
+	e.write = write;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_DESCRIPTOR, &e, sizeof(e),
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
 u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
 {
 	int err, action;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 64/92] kvm: introspection: add single-stepping
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (62 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 63/92] kvm: introspection: add KVMI_EVENT_DESCRIPTOR Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-12 20:50   ` Sean Christopherson
  2019-08-09 16:00 ` [RFC PATCH v6 65/92] kvm: introspection: add KVMI_EVENT_SINGLESTEP Adalbert Lazăr
                   ` (30 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu, Jim Mattson,
	Sean Christopherson, Joerg Roedel

From: Nicușor Cîțu <ncitu@bitdefender.com>

This would be used either if the introspection tool request it as a
reply to a KVMI_EVENT_PF event or to cope with instructions that cannot
be handled by the x86 emulator during the handling of a VMEXIT. In
these situations, all other vCPU-s are kicked and held, the EPT-based
protection is removed and the guest is single stepped by the vCPU that
triggered the initial VMEXIT. Upon completion the EPT-base protection
is reinstalled and all vCPU-s all allowed to return to the guest.

This is a rather slow workaround that kicks in occasionally. In the
future, the most frequently single-stepped instructions should be added
to the emulator (usually, stores to and from memory - SSE/AVX).

For the moment it works only on Intel.

CC: Jim Mattson <jmattson@google.com>
CC: Sean Christopherson <sean.j.christopherson@intel.com>
CC: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/kvmi.c             |  47 ++++++++++-
 arch/x86/kvm/svm.c              |   5 ++
 arch/x86/kvm/vmx/vmx.c          |  17 ++++
 arch/x86/kvm/x86.c              |  19 +++++
 include/linux/kvmi.h            |   4 +
 virt/kvm/kvmi.c                 | 145 +++++++++++++++++++++++++++++++-
 virt/kvm/kvmi_int.h             |  16 ++++
 8 files changed, 253 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ad36a5fc2048..60e2c298d469 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1016,6 +1016,7 @@ struct kvm_x86_ops {
 	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
 	bool (*desc_intercept)(struct kvm_vcpu *vcpu, bool enable);
+	void (*set_mtf)(struct kvm_vcpu *vcpu, bool enable);
 	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
 	bool (*spt_fault)(struct kvm_vcpu *vcpu);
@@ -1628,6 +1629,8 @@ void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
 bool kvm_spt_fault(struct kvm_vcpu *vcpu);
+void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable);
+void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
 void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 04cac5b8a4d0..f0ab4bd9eb37 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -520,7 +520,6 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	u32 ctx_size;
 	u64 ctx_addr;
 	u32 action;
-	bool singlestep_ignored;
 	bool ret = false;
 
 	if (!kvm_spt_fault(vcpu))
@@ -533,7 +532,7 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	if (ivcpu->effective_rep_complete)
 		return true;
 
-	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &singlestep_ignored,
+	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &ivcpu->ss_requested,
 				  &ivcpu->rep_complete, &ctx_addr,
 				  ivcpu->ctx_data, &ctx_size);
 
@@ -547,6 +546,8 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 		ret = true;
 		break;
 	case KVMI_EVENT_ACTION_RETRY:
+		if (ivcpu->ss_requested && !kvmi_start_ss(vcpu, gpa, access))
+			ret = true;
 		break;
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "PF");
@@ -758,6 +759,48 @@ int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu)
+{
+	kvm_set_mtf(vcpu, true);
+
+	/*
+	 * Set block by STI only if the RFLAGS.IF = 1.
+	 * Blocking by both STI and MOV/POP SS is not possible.
+	 */
+	if (kvm_arch_interrupt_allowed(vcpu))
+		kvm_set_interrupt_shadow(vcpu, KVM_X86_SHADOW_INT_STI);
+
+}
+
+void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu)
+{
+	kvm_set_mtf(vcpu, false);
+	/*
+	 * The blocking by STI is cleared after the guest
+	 * executes one instruction or incurs an exception.
+	 * However we migh stop the SS before entering to guest,
+	 * so be sure we are clearing the STI blocking.
+	 */
+	kvm_set_interrupt_shadow(vcpu, 0);
+}
+
+u8 kvmi_arch_relax_page_access(u8 old, u8 new)
+{
+	u8 ret = old | new;
+
+	/*
+	 * An SPTE entry with just the -wx bits set can trigger a
+	 * misconfiguration error from the hardware, as it's the case
+	 * for x86 where this access mode is used to mark I/O memory.
+	 * Thus, we make sure that -wx accesses are translated to rwx.
+	 */
+	if ((ret & (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X)) ==
+	    (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X))
+		ret |= KVMI_PAGE_ACCESS_R;
+
+	return ret;
+}
+
 static const struct {
 	unsigned int allow_bit;
 	enum kvm_page_track_mode track_mode;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b178b8900660..3481c0247680 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7183,6 +7183,10 @@ static bool svm_spt_fault(struct kvm_vcpu *vcpu)
 	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
 }
 
+static void svm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+}
+
 static void svm_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
 {
 }
@@ -7225,6 +7229,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
 	.has_emulated_msr = svm_has_emulated_msr,
 
+	.set_mtf = svm_set_mtf,
 	.cr3_write_exiting = svm_cr3_write_exiting,
 	.msr_intercept = svm_msr_intercept,
 	.desc_intercept = svm_desc_intercept,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7d1e341b51ad..f0369d0574dc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5384,6 +5384,7 @@ static int handle_invalid_op(struct kvm_vcpu *vcpu)
 
 static int handle_monitor_trap(struct kvm_vcpu *vcpu)
 {
+	kvmi_stop_ss(vcpu);
 	return 1;
 }
 
@@ -5992,6 +5993,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (kvmi_vcpu_enabled_ss(vcpu)
+			&& exit_reason != EXIT_REASON_EPT_VIOLATION
+			&& exit_reason != EXIT_REASON_MONITOR_TRAP_FLAG)
+		kvmi_stop_ss(vcpu);
+
 	if (exit_reason < kvm_vmx_max_exit_handlers
 	    && kvm_vmx_exit_handlers[exit_reason])
 		return kvm_vmx_exit_handlers[exit_reason](vcpu);
@@ -7842,6 +7848,16 @@ static __exit void hardware_unsetup(void)
 	free_kvm_area();
 }
 
+static void vmx_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+	if (enable)
+		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
+			      CPU_BASED_MONITOR_TRAP_FLAG);
+	else
+		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+				CPU_BASED_MONITOR_TRAP_FLAG);
+}
+
 static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 			      bool enable)
 {
@@ -7927,6 +7943,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.set_mtf = vmx_set_mtf,
 	.msr_intercept = vmx_msr_intercept,
 	.cr3_write_exiting = vmx_cr3_write_exiting,
 	.desc_intercept = vmx_desc_intercept,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 38aaddadb93a..65855340249a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7358,6 +7358,13 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 {
 	int r;
 
+	if (kvmi_vcpu_enabled_ss(vcpu))
+		/*
+		 * We cannot inject events during single-stepping.
+		 * Try again later.
+		 */
+		return -1;
+
 	/* try to reinject previous events if any */
 
 	if (vcpu->arch.exception.injected)
@@ -10134,6 +10141,18 @@ void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
 }
 EXPORT_SYMBOL(kvm_control_cr3_write_exiting);
 
+void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
+{
+	kvm_x86_ops->set_mtf(vcpu, enable);
+}
+EXPORT_SYMBOL(kvm_set_mtf);
+
+void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+	kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
+}
+EXPORT_SYMBOL(kvm_set_interrupt_shadow);
+
 bool kvm_spt_fault(struct kvm_vcpu *vcpu)
 {
 	return kvm_x86_ops->spt_fault(vcpu);
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 5d162b9e67f2..1dc90284dc3a 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -22,6 +22,8 @@ bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
 bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
+void kvmi_stop_ss(struct kvm_vcpu *vcpu);
+bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu);
 void kvmi_init_emulate(struct kvm_vcpu *vcpu);
 void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
 bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg);
@@ -44,6 +46,8 @@ static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
 static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) { return false; }
 static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) { return true; }
 static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
+static inline void kvmi_stop_ss(struct kvm_vcpu *vcpu) { }
+static inline bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu) { return false; }
 static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu) { }
 static inline bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg)
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index d47a725a4045..a3a5af9080a9 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1260,11 +1260,19 @@ void kvmi_run_jobs(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool need_to_wait_for_ss(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+	return atomic_read(&ikvm->ss_active) && !ivcpu->ss_owner;
+}
+
 static bool need_to_wait(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
 
-	return ivcpu->reply_waiting;
+	return ivcpu->reply_waiting || need_to_wait_for_ss(vcpu);
 }
 
 static bool done_waiting(struct kvm_vcpu *vcpu)
@@ -1572,6 +1580,141 @@ int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait)
 	return 0;
 }
 
+void kvmi_stop_ss(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi *ikvm;
+	int i;
+
+	ikvm = kvmi_get(kvm);
+	if (!ikvm)
+		return;
+
+	if (unlikely(!ivcpu->ss_owner)) {
+		kvmi_warn(ikvm, "%s\n", __func__);
+		goto out;
+	}
+
+	for (i = ikvm->ss_level; i--;)
+		kvmi_set_gfn_access(kvm,
+				    ikvm->ss_context[i].gfn,
+				    ikvm->ss_context[i].old_access,
+				    ikvm->ss_context[i].old_write_bitmap);
+
+	ikvm->ss_level = 0;
+
+	kvmi_arch_stop_single_step(vcpu);
+
+	atomic_set(&ikvm->ss_active, false);
+	/*
+	 * Make ss_active update visible
+	 * before resuming all the other vCPUs.
+	 */
+	smp_mb__after_atomic();
+	kvm_make_all_cpus_request(kvm, 0);
+
+	ivcpu->ss_owner = false;
+
+out:
+	kvmi_put(kvm);
+}
+EXPORT_SYMBOL(kvmi_stop_ss);
+
+static bool kvmi_acquire_ss(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+	if (ivcpu->ss_owner)
+		return true;
+
+	if (atomic_cmpxchg(&ikvm->ss_active, false, true) != false)
+		return false;
+
+	kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_INTROSPECTION |
+						KVM_REQUEST_WAIT);
+
+	ivcpu->ss_owner = true;
+
+	return true;
+}
+
+static bool kvmi_run_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	u8 old_access, new_access;
+	u32 old_write_bitmap;
+	gfn_t gfn = gpa_to_gfn(gpa);
+	int err;
+
+	kvmi_arch_start_single_step(vcpu);
+
+	err = kvmi_get_gfn_access(ikvm, gfn, &old_access, &old_write_bitmap);
+	/* likely was removed from radix tree due to rwx */
+	if (err) {
+		kvmi_warn(ikvm, "%s: gfn 0x%llx not found in the radix tree\n",
+			  __func__, gfn);
+		return true;
+	}
+
+	if (ikvm->ss_level == SINGLE_STEP_MAX_DEPTH - 1) {
+		kvmi_err(ikvm, "single step limit reached\n");
+		return false;
+	}
+
+	ikvm->ss_context[ikvm->ss_level].gfn = gfn;
+	ikvm->ss_context[ikvm->ss_level].old_access = old_access;
+	ikvm->ss_context[ikvm->ss_level].old_write_bitmap = old_write_bitmap;
+	ikvm->ss_level++;
+
+	new_access = kvmi_arch_relax_page_access(old_access, access);
+
+	kvmi_set_gfn_access(vcpu->kvm, gfn, new_access, old_write_bitmap);
+
+	return true;
+}
+
+bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
+{
+	bool ret = false;
+
+	while (!kvmi_acquire_ss(vcpu)) {
+		int err = kvmi_run_jobs_and_wait(vcpu);
+
+		if (err) {
+			kvmi_err(IKVM(vcpu->kvm), "kvmi_acquire_ss() has failed\n");
+			goto out;
+		}
+	}
+
+	if (kvmi_run_ss(vcpu, gpa, access))
+		ret = true;
+	else
+		kvmi_stop_ss(vcpu);
+
+out:
+	return ret;
+}
+
+bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi *ikvm;
+	bool ret;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	ret = ivcpu->ss_owner;
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+EXPORT_SYMBOL(kvmi_vcpu_enabled_ss);
+
 static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index d7f9858d3e97..1550fe33ed48 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -126,6 +126,9 @@ struct kvmi_vcpu {
 		DECLARE_BITMAP(high, KVMI_NUM_MSR);
 	} msr_mask;
 
+	bool ss_owner;
+	bool ss_requested;
+
 	struct list_head job_list;
 	spinlock_t job_lock;
 
@@ -151,6 +154,15 @@ struct kvmi {
 	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
 	DECLARE_BITMAP(vm_ev_mask, KVMI_NUM_EVENTS);
 
+#define SINGLE_STEP_MAX_DEPTH 8
+	struct {
+		gfn_t gfn;
+		u8 old_access;
+		u32 old_write_bitmap;
+	} ss_context[SINGLE_STEP_MAX_DEPTH];
+	u8 ss_level;
+	atomic_t ss_active;
+
 	struct {
 		bool initialized;
 		atomic_t enabled;
@@ -224,6 +236,7 @@ int kvmi_add_job(struct kvm_vcpu *vcpu,
 		 void *ctx, void (*free_fct)(void *ctx));
 void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
 				      const char *str);
+bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access);
 
 /* arch */
 void kvmi_arch_update_page_tracking(struct kvm *kvm,
@@ -274,6 +287,9 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 				   u64 address);
 int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 			     const struct kvmi_control_cr *req);
+void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu);
+void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu);
+u8 kvmi_arch_relax_page_access(u8 old, u8 new);
 int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
 			      const struct kvmi_control_msr *req);
 int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8 *type);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 65/92] kvm: introspection: add KVMI_EVENT_SINGLESTEP
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (63 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 64/92] kvm: introspection: add single-stepping Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 66/92] kvm: introspection: add custom input when single-stepping a vCPU Adalbert Lazăr
                   ` (29 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Nicușor Cîțu <ncitu@bitdefender.com>

This event is sent when the current instruction has been single stepped
as a result of a KVMI_EVENT_PF event to which the introspection tool
set the singlestep field and responded with CONTINUE.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 25 +++++++++++++++++++
 virt/kvm/kvmi.c                    | 40 ++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 8721a470de87..572abab1f6ef 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1574,3 +1574,28 @@ introspection has been enabled for this event (see **KVMI_CONTROL_EVENTS**).
 	KVMI_DESC_TR
 
 ``write`` is 1 if the descriptor was written, 0 otherwise.
+
+12. KVMI_EVENT_SINGLESTEP
+-------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_vcpu_hdr;
+	struct kvmi_event_reply;
+
+This event is sent when the current instruction has been executed
+(as a result of a *KVMI_EVENT_PF* event to which the introspection
+tool set the ``singlestep`` field and responded with *CONTINUE*)
+and the introspection has been enabled for this event
+(see **KVMI_CONTROL_EVENTS**).
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index a3a5af9080a9..3dfedf3ae739 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1182,6 +1182,44 @@ void kvmi_trap_event(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+static u32 kvmi_send_singlestep(struct kvm_vcpu *vcpu)
+{
+	int err, action;
+
+	err = kvmi_send_event(vcpu, KVMI_EVENT_SINGLESTEP, NULL, 0,
+			      NULL, 0, &action);
+	if (err)
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return action;
+}
+
+static void __kvmi_singlestep_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	action = kvmi_send_singlestep(vcpu);
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		kvmi_handle_common_event_actions(vcpu, action, "SINGLESTEP");
+	}
+}
+
+static void kvmi_singlestep_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (!ivcpu->ss_requested)
+		return;
+
+	if (is_event_enabled(vcpu, KVMI_EVENT_SINGLESTEP))
+		__kvmi_singlestep_event(vcpu);
+
+	ivcpu->ss_requested = false;
+}
+
 static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
@@ -1616,6 +1654,8 @@ void kvmi_stop_ss(struct kvm_vcpu *vcpu)
 
 	ivcpu->ss_owner = false;
 
+	kvmi_singlestep_event(vcpu);
+
 out:
 	kvmi_put(kvm);
 }

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 66/92] kvm: introspection: add custom input when single-stepping a vCPU
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (64 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 65/92] kvm: introspection: add KVMI_EVENT_SINGLESTEP Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 67/92] kvm: introspection: use single stepping on unimplemented instructions Adalbert Lazăr
                   ` (28 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

The introspection tool can respond to a KVMI_EVENT_PF event with custom
input for the current instruction. This input is used to trick the guest
software into believing it has read certain data, in order to hide the
content of certain memory areas (eg. hide injected code from integrity
checkers). There are cases when this can happen while the vCPU has to
be single stepped, Either the current instruction is not supported by
the KVM emulator or the introspection tool requested single-stepping.

This patch saves the old data, write the custom input, start the single
stepping and restore the old data.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 virt/kvm/kvmi.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h |   3 ++
 2 files changed, 122 insertions(+)

diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 3dfedf3ae739..06dc23f40ded 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1618,6 +1618,116 @@ int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait)
 	return 0;
 }
 
+static int write_custom_data_to_page(struct kvm_vcpu *vcpu, gva_t gva,
+					u8 *backup, size_t bytes)
+{
+	u8 *ptr_page, *ptr;
+	struct page *page;
+	gpa_t gpa;
+
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	if (gpa == UNMAPPED_GVA)
+		return -KVM_EINVAL;
+
+	ptr_page = get_page_ptr(vcpu->kvm, gpa, &page, true);
+	if (!ptr_page)
+		return -KVM_EINVAL;
+
+	ptr = ptr_page + (gpa & ~PAGE_MASK);
+
+	memcpy(backup, ptr, bytes);
+	use_custom_input(vcpu, gva, ptr, bytes);
+
+	put_page_ptr(ptr_page, page);
+
+	return 0;
+}
+
+static int write_custom_data(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	size_t bytes = ivcpu->ctx_size;
+	gva_t gva = ivcpu->ctx_addr;
+	u8 *backup;
+
+	if (ikvm->ss_custom_size)
+		return 0;
+
+	if (!bytes)
+		return 0;
+
+	backup = ikvm->ss_custom_data;
+
+	while (bytes) {
+		size_t offset = gva & ~PAGE_MASK;
+		size_t chunk = min(bytes, PAGE_SIZE - offset);
+
+		if (write_custom_data_to_page(vcpu, gva, backup, chunk))
+			return -KVM_EINVAL;
+
+		bytes -= chunk;
+		backup += chunk;
+		gva += chunk;
+		ikvm->ss_custom_size += chunk;
+	}
+
+	return 0;
+}
+
+static int restore_backup_data_to_page(struct kvm_vcpu *vcpu, gva_t gva,
+					u8 *src, size_t bytes)
+{
+	u8 *ptr_page, *ptr;
+	struct page *page;
+	gpa_t gpa;
+
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	if (gpa == UNMAPPED_GVA)
+		return -KVM_EINVAL;
+
+	ptr_page = get_page_ptr(vcpu->kvm, gpa, &page, true);
+	if (!ptr_page)
+		return -KVM_EINVAL;
+
+	ptr = ptr_page + (gpa & ~PAGE_MASK);
+
+	memcpy(ptr, src, bytes);
+
+	put_page_ptr(ptr_page, page);
+
+	return 0;
+}
+
+static void restore_backup_data(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	size_t bytes = ikvm->ss_custom_size;
+	gva_t gva = ivcpu->ctx_addr;
+	u8 *backup;
+
+	if (!bytes)
+		return;
+
+	backup = ikvm->ss_custom_data;
+
+	while (bytes) {
+		size_t offset = gva & ~PAGE_MASK;
+		size_t chunk = min(bytes, PAGE_SIZE - offset);
+
+		if (restore_backup_data_to_page(vcpu, gva, backup, chunk))
+			goto out;
+
+		bytes -= chunk;
+		backup += chunk;
+		gva += chunk;
+	}
+
+out:
+	ikvm->ss_custom_size = 0;
+}
+
 void kvmi_stop_ss(struct kvm_vcpu *vcpu)
 {
 	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
@@ -1642,6 +1752,8 @@ void kvmi_stop_ss(struct kvm_vcpu *vcpu)
 
 	ikvm->ss_level = 0;
 
+	restore_backup_data(vcpu);
+
 	kvmi_arch_stop_single_step(vcpu);
 
 	atomic_set(&ikvm->ss_active, false);
@@ -1676,6 +1788,7 @@ static bool kvmi_acquire_ss(struct kvm_vcpu *vcpu)
 						KVM_REQUEST_WAIT);
 
 	ivcpu->ss_owner = true;
+	ikvm->ss_custom_size = 0;
 
 	return true;
 }
@@ -1690,6 +1803,12 @@ static bool kvmi_run_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
 
 	kvmi_arch_start_single_step(vcpu);
 
+	err = write_custom_data(vcpu);
+	if (err) {
+		kvmi_err(ikvm, "writing custom data failed, err %d\n", err);
+		return false;
+	}
+
 	err = kvmi_get_gfn_access(ikvm, gfn, &old_access, &old_write_bitmap);
 	/* likely was removed from radix tree due to rwx */
 	if (err) {
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 1550fe33ed48..5485529db06b 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -160,6 +160,9 @@ struct kvmi {
 		u8 old_access;
 		u32 old_write_bitmap;
 	} ss_context[SINGLE_STEP_MAX_DEPTH];
+	u8 ss_custom_data[KVMI_CTX_DATA_SIZE];
+	size_t ss_custom_size;
+	gpa_t ss_custom_addr;
 	u8 ss_level;
 	atomic_t ss_active;
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 67/92] kvm: introspection: use single stepping on unimplemented instructions
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (65 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 66/92] kvm: introspection: add custom input when single-stepping a vCPU Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 68/92] kvm: x86: emulate a guest page table walk on SPT violations due to A/D bit updates Adalbert Lazăr
                   ` (27 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Mihai Donțu <mdontu@bitdefender.com>

On emulation failures, we notify the introspection tool for read/write
operations if needed. Unless it responds with RETRY (to re-enter guest),
we continue single stepping the vCPU.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++
 arch/x86/include/asm/vmx.h      |  2 ++
 arch/x86/kvm/kvmi.c             | 21 ++++++++++++
 arch/x86/kvm/mmu.c              |  5 +++
 arch/x86/kvm/svm.c              |  8 +++++
 arch/x86/kvm/vmx/vmx.c          | 13 ++++++--
 arch/x86/kvm/x86.c              | 57 ++++++++++++++++++++++++++++++++-
 include/linux/kvmi.h            |  4 +++
 virt/kvm/kvmi.c                 | 56 ++++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h             |  1 +
 10 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 60e2c298d469..2392678dde46 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -770,6 +770,9 @@ struct kvm_vcpu_arch {
 	/* set at EPT violation at this point */
 	unsigned long exit_qualification;
 
+	/* #PF translated error code from EPT/NPT exit reason */
+	u64 error_code;
+
 	/* pv related host specific info */
 	struct {
 		bool pv_unhalted;
@@ -1016,6 +1019,7 @@ struct kvm_x86_ops {
 	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
 	bool (*desc_intercept)(struct kvm_vcpu *vcpu, bool enable);
+	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
 	void (*set_mtf)(struct kvm_vcpu *vcpu, bool enable);
 	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
 	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
@@ -1627,6 +1631,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 
 void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
 bool kvm_spt_fault(struct kvm_vcpu *vcpu);
 void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 11ca64ced578..bc0f5bbd692c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -538,6 +538,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE_BIT	3
 #define EPT_VIOLATION_WRITABLE_BIT	4
 #define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_GLA_VALID_BIT	7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
@@ -545,6 +546,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
 #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
 /*
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index f0ab4bd9eb37..9d66c7d6c953 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -759,6 +759,27 @@ int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+bool is_ud2_instruction(struct kvm_vcpu *vcpu, int *emulation_type)
+{
+	u8 ud2[] = {0x0F, 0x0B};
+	u8 insn_len = vcpu->arch.emulate_ctxt.fetch.ptr -
+		      vcpu->arch.emulate_ctxt.fetch.data;
+
+	if (insn_len != sizeof(ud2))
+		return false;
+
+	if (memcmp(vcpu->arch.emulate_ctxt.fetch.data, ud2, insn_len))
+		return false;
+
+	/* Do not reexecute the UD2 instruction, else we might enter to an
+	 * endless emulation loop. Let the emulator fall down through the
+	 * handle_emulation_failure() which shall inject the #UD exception.
+	 */
+	*emulation_type &= ~EMULTYPE_ALLOW_RETRY;
+
+	return true;
+}
+
 void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu)
 {
 	kvm_set_mtf(vcpu, true);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0b859b1797f6..c2f863797495 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -6667,6 +6667,11 @@ void kvm_mmu_module_exit(void)
 	mmu_audit_disable();
 }
 
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->fault_gla(vcpu);
+}
+
 bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu)
 {
 	return kvm_x86_ops->nested_pagefault(vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3481c0247680..cb536a2611f6 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2675,6 +2675,8 @@ static int pf_interception(struct vcpu_svm *svm)
 	u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2);
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
+	svm->vcpu.arch.error_code = error_code;
+
 	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
 			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
 			svm->vmcb->control.insn_bytes : NULL,
@@ -7171,6 +7173,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	set_msr_interception(svm, msrpm, msr, enable, enable);
 }
 
+static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return ~0ull;
+}
+
 static bool svm_nested_pagefault(struct kvm_vcpu *vcpu)
 {
 	return false;
@@ -7233,6 +7240,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cr3_write_exiting = svm_cr3_write_exiting,
 	.msr_intercept = svm_msr_intercept,
 	.desc_intercept = svm_desc_intercept,
+	.fault_gla = svm_fault_gla,
 	.nested_pagefault = svm_nested_pagefault,
 	.spt_fault = svm_spt_fault,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0369d0574dc..dc648ba47df3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5171,10 +5171,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 			EPT_VIOLATION_EXECUTABLE))
 		      ? PFERR_PRESENT_MASK : 0;
 
-	error_code |= (exit_qualification & 0x100) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
+		      ? PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
 	vcpu->arch.exit_qualification = exit_qualification;
+	vcpu->arch.error_code = error_code;
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7880,6 +7881,13 @@ static void vmx_cr3_write_exiting(struct kvm_vcpu *vcpu,
 	/* TODO: nested ? vmcs12->cpu_based_vm_exec_control */
 }
 
+static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
+		return vmcs_readl(GUEST_LINEAR_ADDRESS);
+	return ~0ull;
+}
+
 static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
@@ -7947,6 +7955,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.msr_intercept = vmx_msr_intercept,
 	.cr3_write_exiting = vmx_cr3_write_exiting,
 	.desc_intercept = vmx_desc_intercept,
+	.fault_gla = vmx_fault_gla,
 	.nested_pagefault = vmx_nested_pagefault,
 	.spt_fault = vmx_spt_fault,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65855340249a..dd10f9e0c054 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6526,6 +6526,53 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
 	return false;
 }
 
+/*
+ * With introspection enabled, emulation failures translate in events being
+ * missed because the read/write callbacks are not invoked. All we have is
+ * the fetch event (kvm_page_track_preexec). Below we use the EPT/NPT VMEXIT
+ * information to generate the events, but without providing accurate
+ * data and size (the emulator would have computed those). If an instruction
+ * would happen to read and write in the same page, the second event will
+ * initially be missed and we rely on the page tracking mechanism to bring
+ * us back here to send it.
+ */
+static bool kvm_page_track_emulation_failure(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	u64 error_code = vcpu->arch.error_code;
+	bool data_ready = false;
+	u8 data = 0;
+	gva_t gva;
+	bool ret;
+
+	/* MMIO emulation failures should be treated the normal way */
+	if (unlikely(error_code & PFERR_RSVD_MASK))
+		return true;
+
+	/* EPT/NTP must be enabled */
+	if (unlikely(!vcpu->arch.mmu->direct_map))
+		return true;
+
+	/*
+	 * The A/D bit emulation should make this test unneeded, but just
+	 * in case
+	 */
+	if (unlikely((error_code & PFERR_NESTED_GUEST_PAGE) ==
+		     PFERR_NESTED_GUEST_PAGE))
+		return true;
+
+	gva = kvm_mmu_fault_gla(vcpu);
+
+	if (error_code & PFERR_WRITE_MASK)
+		ret = kvm_page_track_prewrite(vcpu, gpa, gva, &data, 0);
+	else if (error_code & PFERR_USER_MASK)
+		ret = kvm_page_track_preread(vcpu, gpa, gva, &data, 0,
+					     &data_ready);
+	else
+		ret = true;
+
+	return ret;
+}
+
 int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 			    unsigned long cr2,
 			    int emulation_type,
@@ -6574,9 +6621,13 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		++vcpu->stat.insn_emulation;
 		if (r == EMULATION_RETRY_INSTR)
 			return EMULATE_DONE;
-		if (r != EMULATION_OK)  {
+		if (r != EMULATION_OK) {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
+			if (!kvm_page_track_emulation_failure(vcpu, cr2))
+				return EMULATE_DONE;
+			if (kvmi_single_step(vcpu, cr2, &emulation_type))
+				return EMULATE_DONE;
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
 						emulation_type))
 				return EMULATE_DONE;
@@ -6621,6 +6672,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		return EMULATE_DONE;
 
 	if (r == EMULATION_FAILED) {
+		if (!kvm_page_track_emulation_failure(vcpu, cr2))
+			return EMULATE_DONE;
+		if (kvmi_single_step(vcpu, cr2, &emulation_type))
+			return EMULATE_DONE;
 		if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
 					emulation_type))
 			return EMULATE_DONE;
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 1dc90284dc3a..69db02795fc0 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -21,6 +21,7 @@ bool kvmi_hypercall_event(struct kvm_vcpu *vcpu);
 bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
 bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
+bool kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa, int *emulation_type);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 void kvmi_stop_ss(struct kvm_vcpu *vcpu);
 bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu);
@@ -41,6 +42,9 @@ static inline bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva,
 static inline bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
 					 u8 write)
 			{ return true; }
+static inline bool kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa,
+				    int *emulation_type)
+			{ return false; }
 static inline void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu) { }
 static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
 static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) { return false; }
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 06dc23f40ded..14eadc3b9ca9 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -1018,6 +1018,62 @@ void kvmi_destroy_vm(struct kvm *kvm)
 	wait_for_completion_killable(&kvm->kvmi_completed);
 }
 
+static u8 kvmi_translate_pf_error_code(u64 error_code)
+{
+	u8 access = 0;
+
+	if (error_code & PFERR_USER_MASK)
+		access |= KVMI_PAGE_ACCESS_R;
+	if (error_code & PFERR_WRITE_MASK)
+		access |= KVMI_PAGE_ACCESS_W;
+	if (error_code & PFERR_FETCH_MASK)
+		access |= KVMI_PAGE_ACCESS_X;
+
+	return access;
+}
+
+static bool __kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       int *emulation_type)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi *ikvm = IKVM(kvm);
+	u8 allowed_access, pf_access;
+	u32 ignored_write_bitmap;
+	gfn_t gfn = gpa_to_gfn(gpa);
+	int err;
+
+	if (is_ud2_instruction(vcpu, emulation_type))
+		return false;
+
+	err = kvmi_get_gfn_access(ikvm, gfn, &allowed_access,
+				  &ignored_write_bitmap);
+	if (err) {
+		kvmi_warn(ikvm, "%s: gfn 0x%llx not found in the radix tree\n",
+			  __func__, gpa_to_gfn(gpa));
+		return false;
+	}
+
+	pf_access = kvmi_translate_pf_error_code(vcpu->arch.error_code);
+
+	return kvmi_start_ss(vcpu, gpa, pf_access);
+}
+
+bool kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa, int *emulation_type)
+{
+	struct kvmi *ikvm;
+	bool ret = false;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	ret = __kvmi_single_step(vcpu, gpa, emulation_type);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
 {
 	int err = -ESRCH;
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index 5485529db06b..c96fa2b1e9b7 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -290,6 +290,7 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 				   u64 address);
 int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
 			     const struct kvmi_control_cr *req);
+bool is_ud2_instruction(struct kvm_vcpu *vcpu, int *emulation_type);
 void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu);
 void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu);
 u8 kvmi_arch_relax_page_access(u8 old, u8 new);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 68/92] kvm: x86: emulate a guest page table walk on SPT violations due to A/D bit updates
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (66 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 67/92] kvm: introspection: use single stepping on unimplemented instructions Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool Adalbert Lazăr
                   ` (26 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Sean Christopherson

From: Mihai Donțu <mdontu@bitdefender.com>

On SPT page faults caused by guest page table walks, use the existing
guest page table walk code to make the necessary adjustments to the A/D
bits and return to guest. This effectively bypasses the x86 emulator
who was making the wrong modifications leading one OS (Windows 8.1 x64)
to triple-fault very early in the boot process with the introspection
enabled.

With introspection disabled, these faults are handled by simply removing
the protection from the affected guest page and returning to guest.

CC: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h  |  2 +-
 arch/x86/include/asm/kvmi_host.h |  6 ++++++
 arch/x86/kvm/kvmi.c              | 34 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu.c               | 11 +++++++++--
 arch/x86/kvm/x86.c               |  6 +++---
 include/linux/kvmi.h             |  3 +++
 virt/kvm/kvmi.c                  | 31 +++++++++++++++++++++++++++--
 7 files changed, 84 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2392678dde46..79f3aa6928e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1425,7 +1425,7 @@ gpa_t kvm_mmu_gva_to_gpa_fetch(struct kvm_vcpu *vcpu, gva_t gva,
 gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
 			       struct x86_exception *exception);
 gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
-				struct x86_exception *exception);
+				u32 access, struct x86_exception *exception);
 
 void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
index 3f066e7feee2..73369874f3a8 100644
--- a/arch/x86/include/asm/kvmi_host.h
+++ b/arch/x86/include/asm/kvmi_host.h
@@ -16,6 +16,7 @@ bool kvmi_monitored_msr(struct kvm_vcpu *vcpu, u32 msr);
 bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 		   unsigned long old_value, unsigned long *new_value);
 void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
+bool kvmi_update_ad_flags(struct kvm_vcpu *vcpu);
 
 #else /* CONFIG_KVM_INTROSPECTION */
 
@@ -40,6 +41,11 @@ static inline void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
 {
 }
 
+static inline bool kvmi_update_ad_flags(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
+
 #endif /* CONFIG_KVM_INTROSPECTION */
 
 #endif /* _ASM_X86_KVMI_HOST_H */
diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 9d66c7d6c953..5312f179af9c 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -465,7 +465,7 @@ void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
 	u32 action;
 	u64 gpa;
 
-	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, 0, NULL);
 
 	action = kvmi_msg_send_bp(vcpu, gpa, insn_len);
 	switch (action) {
@@ -822,6 +822,38 @@ u8 kvmi_arch_relax_page_access(u8 old, u8 new)
 	return ret;
 }
 
+bool kvmi_update_ad_flags(struct kvm_vcpu *vcpu)
+{
+	struct x86_exception exception = { };
+	struct kvmi *ikvm;
+	bool ret = false;
+	gva_t gva;
+	gpa_t gpa;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	gva = kvm_mmu_fault_gla(vcpu);
+
+	if (gva == ~0ull) {
+		kvmi_warn_once(ikvm, "%s: cannot perform translation\n",
+			       __func__);
+		goto out;
+	}
+
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, PFERR_WRITE_MASK, NULL);
+	if (gpa == UNMAPPED_GVA)
+		gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, 0, &exception);
+
+	ret = (gpa != UNMAPPED_GVA);
+
+out:
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 static const struct {
 	unsigned int allow_bit;
 	enum kvm_page_track_mode track_mode;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c2f863797495..65b6acba82da 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -40,7 +40,9 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
+#include <linux/kvmi.h>
 
+#include <asm/kvmi_host.h>
 #include <asm/page.h>
 #include <asm/pat.h>
 #include <asm/cmpxchg.h>
@@ -5960,8 +5962,13 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 */
 	if (vcpu->arch.mmu->direct_map &&
 	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
-		return 1;
+		if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(cr2))) {
+			if (kvmi_update_ad_flags(vcpu))
+				return 1;
+		} else {
+			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
+			return 1;
+		}
 	}
 
 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd10f9e0c054..2c06de73a784 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5175,9 +5175,9 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
 
 /* uses this to access any guest's mapped memory without checking CPL */
 gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
-				struct x86_exception *exception)
+				u32 access, struct x86_exception *exception)
 {
-	return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
+	return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
 static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
@@ -8904,7 +8904,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 	vcpu_load(vcpu);
 
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
-	gpa = kvm_mmu_gva_to_gpa_system(vcpu, vaddr, NULL);
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, vaddr, 0, NULL);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	tr->physical_address = gpa;
 	tr->valid = gpa != UNMAPPED_GVA;
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 69db02795fc0..10cd6c6412d2 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -21,6 +21,7 @@ bool kvmi_hypercall_event(struct kvm_vcpu *vcpu);
 bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
 void kvmi_trap_event(struct kvm_vcpu *vcpu);
 bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
+bool kvmi_tracked_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 bool kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa, int *emulation_type);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
 void kvmi_stop_ss(struct kvm_vcpu *vcpu);
@@ -36,6 +37,8 @@ static inline void kvmi_uninit(void) { }
 static inline void kvmi_create_vm(struct kvm *kvm) { }
 static inline void kvmi_destroy_vm(struct kvm *kvm) { }
 static inline int kvmi_vcpu_init(struct kvm_vcpu *vcpu) { return 0; }
+static inline bool kvmi_tracked_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+			{ return false; }
 static inline bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva,
 					 u8 insn_len)
 			{ return true; }
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 14eadc3b9ca9..ca146ffec061 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -193,6 +193,33 @@ static bool kvmi_restricted_access(struct kvmi *ikvm, gpa_t gpa, u8 access)
 	return false;
 }
 
+bool is_tracked_gfn(struct kvmi *ikvm, gfn_t gfn)
+{
+	struct kvmi_mem_access *m;
+
+	read_lock(&ikvm->access_tree_lock);
+	m = __kvmi_get_gfn_access(ikvm, gfn);
+	read_unlock(&ikvm->access_tree_lock);
+
+	return !!m;
+}
+
+bool kvmi_tracked_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	struct kvmi *ikvm;
+	bool ret;
+
+	ikvm = kvmi_get(vcpu->kvm);
+	if (!ikvm)
+		return false;
+
+	ret = is_tracked_gfn(ikvm, gfn);
+
+	kvmi_put(vcpu->kvm);
+
+	return ret;
+}
+
 static void kvmi_clear_mem_access(struct kvm *kvm)
 {
 	void **slot;
@@ -1681,7 +1708,7 @@ static int write_custom_data_to_page(struct kvm_vcpu *vcpu, gva_t gva,
 	struct page *page;
 	gpa_t gpa;
 
-	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, 0, NULL);
 	if (gpa == UNMAPPED_GVA)
 		return -KVM_EINVAL;
 
@@ -1738,7 +1765,7 @@ static int restore_backup_data_to_page(struct kvm_vcpu *vcpu, gva_t gva,
 	struct page *page;
 	gpa_t gpa;
 
-	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, 0, NULL);
 	if (gpa == UNMAPPED_GVA)
 		return -KVM_EINVAL;
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (67 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 68/92] kvm: x86: emulate a guest page table walk on SPT violations due to A/D bit updates Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-09-10 14:26   ` Konrad Rzeszutek Wilk
  2019-08-09 16:00 ` [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when " Adalbert Lazăr
                   ` (25 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

This patch might be obsolete thanks to single-stepping.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c06de73a784..06f44ce8ed07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6311,7 +6311,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 
-		if (indirect_shadow_pages)
+		if (indirect_shadow_pages
+		    && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 		return true;
@@ -6322,7 +6323,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 	 * and it failed try to unshadow page and re-enter the
 	 * guest to let CPU execute the instruction.
 	 */
-	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	/*
 	 * If the access faults on its page table, it can not
@@ -6374,6 +6376,9 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (!vcpu->arch.mmu->direct_map)
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
 
+	if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
+		return false;
+
 	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	return true;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when tracked by the introspection tool
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (68 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-13  9:08   ` Paolo Bonzini
  2019-08-09 16:00 ` [RFC PATCH v6 71/92] mm: add support for remote mapping Adalbert Lazăr
                   ` (24 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

It should complete the commit fd34a9518173 ("kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map()")

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 65b6acba82da..fd64cf1115da 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2660,6 +2660,9 @@ static void clear_sp_write_flooding_count(u64 *spte)
 static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
 					   unsigned int acc)
 {
+	if (!kvmi_tracked_gfn(vcpu, gfn))
+		return acc;
+
 	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
 		acc &= ~ACC_USER_MASK;
 	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 71/92] mm: add support for remote mapping
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (69 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when " Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:24   ` DANGER WILL ROBINSON, DANGER Matthew Wilcox
  2019-08-09 16:00 ` [RFC PATCH v6 72/92] kvm: introspection: add memory map/unmap support on the guest side Adalbert Lazăr
                   ` (23 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

The following two new mm exports are introduced:
 * mm_remote_map(struct mm_struct *req_mm,
                 unsigned long req_hva,
                 unsigned long map_hva)
 * mm_remote_unmap(unsigned long map_hva)
 * mm_remote_reset(void)
 * rmap_walk_remote(struct page *page,
                    struct rmap_walk_control *rwc)

This patch allows one process to map into its address space a page from
another process. The previous page (if it exists) is dropped. There is
no corresponding pair of system calls as this API is meant to be used
by the kernel itself only.

The targeted user is the upcoming KVM VM introspection subsystem (KVMI),
where an introspector running in its own VM will map pages from the
introspected guest in order to eliminate round trips to the host kernel
(read/write guest pages).

The flow is as follows: the introspector identifies a guest physical
address where some information of interest is located. It creates a one
page anonymous mapping with MAP_LOCKED | MAP_POPULATE and calls the
kernel via an IOCTL on /dev/kvmmem giving the map virtual address and
the guest physical address as arguments. The kernel converts the map
va into a physical page (gpa in KVM-speak) and passes it to the host
kernel via a hypercall, along with the introspected guest gpa. The host
kernel converts the two gpa-s into their appropriate hva-s (host virtual
addresses) and makes sure the vma backing up the page belonging to the VM
in which the introspector runs, points to the indicated page into the
introspected guest. I have not included here the use of the mapping
token described in the KVMI documentation.

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/page-flags.h          |    9 +-
 include/linux/remote_mapping.h      |  167 +++
 include/uapi/linux/remote_mapping.h |   18 +
 mm/Kconfig                          |    8 +
 mm/Makefile                         |    1 +
 mm/memory-failure.c                 |   69 +-
 mm/migrate.c                        |    9 +-
 mm/remote_mapping.c                 | 1834 +++++++++++++++++++++++++++
 mm/rmap.c                           |   13 +-
 mm/vmscan.c                         |    3 +-
 10 files changed, 2108 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 39b4494e29f1..3f65b2833562 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
  */
 #define PAGE_MAPPING_ANON	0x1
 #define PAGE_MAPPING_MOVABLE	0x2
+#define PAGE_MAPPING_REMOTE	0x4
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | \
+				 PAGE_MAPPING_REMOTE)
 
 static __always_inline int PageMappingFlags(struct page *page)
 {
@@ -431,6 +433,11 @@ static __always_inline int PageAnon(struct page *page)
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
+static __always_inline int PageRemote(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_REMOTE) != 0;
+}
+
 static __always_inline int __PageMovable(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h
new file mode 100644
index 000000000000..d30d0d10e51d
--- /dev/null
+++ b/include/linux/remote_mapping.h
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _REMOTE_MAPPING_H
+#define _REMOTE_MAPPING_H
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/rbtree.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mmdebug.h>
+
+struct page_db {
+	struct mm_struct *target;		/* target for this mapping */
+	unsigned long req_hva;			/* HVA in target */
+	unsigned long map_hva;			/* HVA in client */
+
+	refcount_t refcnt;			/* client-side sharing */
+	int flags;
+
+	/* target links - serialized by target_db->lock */
+	struct list_head target_link;		/* target-side link */
+
+	/* client links - serialized by client_db->lock */
+	struct rb_node file_link;		/* uses map_hva as key */
+
+	/* rmap components - serialized by page lock */
+	struct anon_vma *req_anon_vma;
+	struct anon_vma *map_anon_vma;
+};
+
+struct target_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	spinlock_t lock;		/* lock for the following */
+	struct mm_struct *client;	/* client for this target */
+	struct list_head pages_list;	/* mapped HVAs for this target */
+};
+
+struct file_db;
+struct client_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	struct file_db *pseudo;		/* kernel interface */
+};
+
+struct file_db {
+	struct client_db *cdb;
+
+	spinlock_t lock;		/* lock for the following */
+	struct rb_root rb_root;		/* mappings indexed by map_hva */
+};
+
+static inline void *PageMapping(struct page_db *pdb)
+{
+	return (void *)pdb + (PAGE_MAPPING_ANON | PAGE_MAPPING_REMOTE);
+}
+
+static inline struct page_db *RemoteMapping(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageRemote(page), page);
+	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+}
+
+/*
+ * Template for keyed RB tree.
+ *
+ * RBCTYPE	type of container structure
+ * _rb_root	name of rb_root element
+ * RBNTYPE	type of node structure
+ * _rb_node	name of rb_node element
+ * _key		name of key element
+ */
+
+#define KEYED_RB_TREE(RBPREFIX, RBCTYPE, _rb_root, RBNTYPE, _rb_node, _key)\
+									\
+static bool RBPREFIX ## _insert(RBCTYPE * _container, RBNTYPE * _node)	\
+{									\
+	struct rb_root *root = &_container->_rb_root;			\
+	struct rb_node **new = &root->rb_node;				\
+	struct rb_node *parent = NULL;					\
+									\
+	/* Figure out where to put new node */				\
+	while (*new) {							\
+		RBNTYPE *this = rb_entry(*new, RBNTYPE, _rb_node);	\
+									\
+		parent = *new;						\
+		if (_node->_key < this->_key)				\
+			new = &((*new)->rb_left);			\
+		else if (_node->_key > this->_key)			\
+			new = &((*new)->rb_right);			\
+		else							\
+			return false;					\
+	}								\
+									\
+	/* Add new node and rebalance tree. */				\
+	rb_link_node(&_node->_rb_node, parent, new);			\
+	rb_insert_color(&_node->_rb_node, root);			\
+									\
+	return true;							\
+}									\
+									\
+static RBNTYPE *							\
+RBPREFIX ## _search(RBCTYPE * _container, unsigned long _key)		\
+{									\
+	struct rb_root *root = &_container->_rb_root;			\
+	struct rb_node *node = root->rb_node;				\
+									\
+	while (node) {							\
+		RBNTYPE *_node = rb_entry(node, RBNTYPE, _rb_node);	\
+									\
+		if (_key < _node->_key)					\
+			node = node->rb_left;				\
+		else if (_key > _node->_key)				\
+			node = node->rb_right;				\
+		else							\
+			return _node;					\
+	}								\
+									\
+	return NULL;							\
+}									\
+									\
+static void RBPREFIX ## _remove(RBCTYPE *_container, RBNTYPE *_node)	\
+{									\
+	rb_erase(&_node->_rb_node, &_container->_rb_root);		\
+	RB_CLEAR_NODE(&_node->_rb_node);				\
+}									\
+									\
+static bool RBPREFIX ## _empty(const RBCTYPE *_container)		\
+{									\
+	return RB_EMPTY_ROOT(&_container->_rb_root);			\
+}									\
+
+#ifdef CONFIG_REMOTE_MAPPING
+extern int mm_remote_map(struct mm_struct *req_mm,
+			 unsigned long req_hva, unsigned long map_hva);
+extern int mm_remote_unmap(unsigned long map_hva);
+extern void mm_remote_reset(void);
+extern void rmap_walk_remote(struct page *page, struct rmap_walk_control *rwc);
+#else /* CONFIG_REMOTE_MAPPING */
+static inline int mm_remote_map(struct mm_struct *req_mm,
+				unsigned long req_hva, unsigned long map_hva)
+{
+	return -EINVAL;
+}
+static inline int mm_remote_unmap(unsigned long map_hva)
+{
+	return -EINVAL;
+}
+static inline void mm_remote_reset(void)
+{
+}
+static inline void rmap_walk_remote(struct page *page,
+				    struct rmap_walk_control *rwc)
+{
+}
+#endif /* CONFIG_REMOTE_MAPPING */
+
+#endif /* _REMOTE_MAPPING_H */
diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h
new file mode 100644
index 000000000000..d8b544dd5add
--- /dev/null
+++ b/include/uapi/linux/remote_mapping.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __UAPI_REMOTE_MAPPING_H__
+#define __UAPI_REMOTE_MAPPING_H__
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+struct remote_map_request {
+	__u32 req_pid;
+	__u64 req_hva;
+	__u64 map_hva;
+};
+
+#define REMOTE_MAP       _IOW('r', 0x01, struct remote_map_request)
+#define REMOTE_UNMAP     _IOW('r', 0x02, unsigned long)
+
+#endif /* __UAPI_REMOTE_MAPPING_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..8451dafd3c91 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -758,4 +758,12 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+config REMOTE_MAPPING
+	bool "Remote memory mapping"
+	depends on MMU && !KSM
+	default n
+	help
+	  Allows a given application to map pages of another application in its own
+	  address space.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e69a3b15627a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 831be5ff5f4d..40066271c411 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -9,16 +9,16 @@
  * High level machine check handler. Handles pages reported by the
  * hardware as being corrupted usually due to a multi-bit ECC memory or cache
  * failure.
- * 
+ *
  * In addition there is a "soft offline" entry point that allows stop using
  * not-yet-corrupted-by-suspicious pages without killing anything.
  *
  * Handles page cache pages in various states.	The tricky part
- * here is that we can access any page asynchronously in respect to 
- * other VM users, because memory failures could happen anytime and 
- * anywhere. This could violate some of their assumptions. This is why 
- * this code has to be extremely careful. Generally it tries to use 
- * normal locking rules, as in get the standard locks, even if that means 
+ * here is that we can access any page asynchronously in respect to
+ * other VM users, because memory failures could happen anytime and
+ * anywhere. This could violate some of their assumptions. This is why
+ * this code has to be extremely careful. Generally it tries to use
+ * normal locking rules, as in get the standard locks, even if that means
  * the error handling takes potentially a long time.
  *
  * It can be very tempting to add handling for obscure cases here.
@@ -28,12 +28,12 @@
  *   https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/
  * - The case actually shows up as a frequent (top 10) page state in
  *   tools/vm/page-types when running a real workload.
- * 
+ *
  * There are several operations here with exponential complexity because
- * of unsuitable VM data structures. For example the operation to map back 
- * from RMAP chains to processes has to walk the complete process list and 
+ * of unsuitable VM data structures. For example the operation to map back
+ * from RMAP chains to processes has to walk the complete process list and
  * has non linear complexity with the number. But since memory corruptions
- * are rare we hope to get away with this. This avoids impacting the core 
+ * are rare we hope to get away with this. This avoids impacting the core
  * VM.
  */
 #include <linux/kernel.h>
@@ -59,6 +59,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/remote_mapping.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -467,6 +468,45 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	page_unlock_anon_vma_read(av);
 }
 
+/*
+ * Collect processes when the error hit a remote mapped page.
+ */
+static void collect_procs_remote(struct page *page, struct list_head *to_kill,
+				struct to_kill **tkc, int force_early)
+{
+	struct page_db *pdb;
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av;
+	pgoff_t pgoff;
+
+	pdb = RemoteMapping(page);
+	av = pdb->req_anon_vma;
+	if (av == NULL)			/* Target has left */
+		return;
+
+	pgoff = page_to_pgoff(page);	/* Offset in target */
+	anon_vma_lock_read(av);
+	read_lock(&tasklist_lock);
+	for_each_process(tsk) {
+		struct anon_vma_chain *vmac;
+		struct task_struct *t = task_early_kill(tsk, force_early);
+
+		if (!t)
+			continue;
+		anon_vma_interval_tree_foreach(vmac, &av->rb_root,
+			pgoff, pgoff) {
+			vma = vmac->vma;
+			if (!page_mapped_in_vma(page, vma))
+				continue;
+			if (vma->vm_mm == t->mm)
+				add_to_kill(t, page, vma, to_kill, tkc);
+		}
+	}
+	read_unlock(&tasklist_lock);
+	anon_vma_unlock_read(av);
+}
+
 /*
  * Collect processes when the error hit a file mapped page.
  */
@@ -519,9 +559,12 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
 	if (!tk)
 		return;
-	if (PageAnon(page))
-		collect_procs_anon(page, tokill, &tk, force_early);
-	else
+	if (PageAnon(page)) {
+		if (PageRemote(page))
+			collect_procs_remote(page, tokill, &tk, force_early);
+		else
+			collect_procs_anon(page, tokill, &tk, force_early);
+	} else
 		collect_procs_file(page, tokill, &tk, force_early);
 	kfree(tk);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..4d18a8115ffc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -47,6 +47,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/remote_mapping.h>
 
 #include <asm/tlbflush.h>
 
@@ -215,7 +216,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	while (page_vma_mapped_walk(&pvmw)) {
-		if (PageKsm(page))
+		if (PageKsm(page) || PageRemote(page))
 			new = page;
 		else
 			new = page - pvmw.page->index +
@@ -1065,7 +1066,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * because that implies that the anon page is no longer mapped
 	 * (and cannot be remapped so long as we hold the page lock).
 	 */
-	if (PageAnon(page) && !PageKsm(page))
+	if (PageAnon(page) && !PageKsm(page) && !PageRemote(page))
 		anon_vma = page_get_anon_vma(page);
 
 	/*
@@ -1104,8 +1105,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
-		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
-				page);
+		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) &&
+				!PageRemote(page) && !anon_vma, page);
 		try_to_unmap(page,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 		page_was_mapped = 1;
diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
new file mode 100644
index 000000000000..14b0db89c425
--- /dev/null
+++ b/mm/remote_mapping.c
@@ -0,0 +1,1834 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Remote memory mapping.
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/mm.h>
+#include <linux/pid.h>
+#include <linux/oom.h>
+#include <linux/huge_mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched/mm.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/hashtable.h>
+#include <linux/refcount.h>
+#include <linux/debugfs.h>
+#include <linux/miscdevice.h>
+#include <linux/remote_mapping.h>
+#include <uapi/linux/remote_mapping.h>
+
+#include "internal.h"
+
+#define ASSERT(exp) BUG_ON(!(exp))
+#define BUSY_BIT 0
+#define MAPPED_BIT 1
+
+#define TDB_HASH_BITS 4
+#define CDB_HASH_BITS 2
+
+static int mm_remote_do_unmap(struct mm_struct *map_mm, unsigned long map_hva);
+static int mm_remote_do_unmap_target(struct page_db *pdb);
+static int mm_remote_make_stale(struct page_db *pdb);
+
+static void mm_remote_db_target_release(struct target_db *tdb);
+static void mm_remote_db_client_release(struct client_db *cdb);
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+static void cdb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+
+static const struct mmu_notifier_ops tdb_notifier_ops = {
+	.release = tdb_release,
+};
+
+static const struct mmu_notifier_ops cdb_notifier_ops = {
+	.release = cdb_release,
+};
+
+static DEFINE_HASHTABLE(tdb_hash, TDB_HASH_BITS);
+static DEFINE_SPINLOCK(tdb_lock);
+
+static DEFINE_HASHTABLE(cdb_hash, CDB_HASH_BITS);
+static DEFINE_SPINLOCK(cdb_lock);
+
+static struct kmem_cache *pdb_cache;
+static atomic_t pdb_count = ATOMIC_INIT(0);
+static atomic_t map_count = ATOMIC_INIT(0);
+static atomic_t rpg_count = ATOMIC_INIT(0);
+
+static atomic_t stat_empty_pte = ATOMIC_INIT(0);
+static atomic_t stat_mapped_pte = ATOMIC_INIT(0);
+static atomic_t stat_swap_pte = ATOMIC_INIT(0);
+static atomic_t stat_refault = ATOMIC_INIT(0);
+
+static struct dentry *mm_remote_debugfs_dir;
+
+static void target_db_init(struct target_db *tdb)
+{
+	tdb->mn.ops = &tdb_notifier_ops;
+	refcount_set(&tdb->refcnt, 0);
+
+	tdb->client = NULL;
+	INIT_LIST_HEAD(&tdb->pages_list);
+	spin_lock_init(&tdb->lock);
+}
+
+static struct target_db *target_db_alloc(void)
+{
+	struct target_db *tdb;
+
+	tdb = kzalloc(sizeof(*tdb), GFP_KERNEL);
+	if (tdb != NULL)
+		target_db_init(tdb);
+
+	return tdb;
+}
+
+static void target_db_free(struct target_db *tdb)
+{
+	ASSERT(refcount_read(&tdb->refcnt) == 0);
+	ASSERT(list_empty(&tdb->pages_list));
+
+	kfree(tdb);
+}
+
+static void target_db_insert(struct target_db *tdb, struct page_db *pdb)
+{
+	list_add(&pdb->target_link, &tdb->pages_list);
+}
+
+static bool target_db_empty(const struct target_db *tdb)
+{
+	return list_empty(&tdb->pages_list);
+}
+
+static void target_db_remove(struct target_db *tdb, struct page_db *pdb)
+{
+	list_del(&pdb->target_link);
+}
+
+static void target_db_free_delayed(struct rcu_head *rcu)
+{
+	struct target_db *tdb = container_of(rcu, struct target_db, rcu);
+
+	pr_debug("%s: for mm %016lx\n", __func__, (unsigned long)tdb->mm);
+
+	target_db_free(tdb);
+}
+
+static void target_db_put(struct target_db *tdb)
+{
+	if (refcount_dec_and_test(&tdb->refcnt)) {
+		pr_debug("%s: mm %016lx\n", __func__, (unsigned long)tdb->mm);
+
+		spin_lock(&tdb_lock);
+		hash_del(&tdb->db_link);
+		spin_unlock(&tdb_lock);
+
+		mm_remote_db_target_release(tdb);
+
+		ASSERT(target_db_empty(tdb));
+
+		mmu_notifier_call_srcu(&tdb->rcu, target_db_free_delayed);
+	}
+}
+
+static struct target_db *target_db_lookup(const struct mm_struct *mm)
+{
+	struct target_db *tdb;
+
+	spin_lock(&tdb_lock);
+
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm && refcount_inc_not_zero(&tdb->refcnt))
+			break;
+
+	spin_unlock(&tdb_lock);
+
+	return tdb;
+}
+
+static struct target_db *target_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct target_db *tdb, *allocated;
+	bool found = false;
+	int result;
+
+	allocated = target_db_alloc();	/* may be NULL */
+
+	spin_lock(&tdb_lock);
+
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm && refcount_inc_not_zero(&tdb->refcnt)) {
+			found = true;
+			break;
+		}
+
+	if (!found && allocated != NULL) {
+		tdb = allocated;
+		allocated = NULL;
+
+		tdb->mm = mm;
+		hash_add(tdb_hash, &tdb->db_link, (unsigned long)mm);
+		refcount_set(&tdb->refcnt, 1);
+	}
+
+	spin_unlock(&tdb_lock);
+
+	if (allocated != NULL)
+		target_db_free(allocated);
+
+	if (found || tdb == NULL)
+		return tdb;
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this tdb
+	 */
+	result = mmu_notifier_register(&tdb->mn, mm);
+	if (IS_ERR_VALUE((long) result)) {
+		pr_err("mmu_notifier_register() failed: %d\n", result);
+
+		target_db_put(tdb);
+		return ERR_PTR((long) result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)tdb->mm);
+
+	refcount_inc(&tdb->refcnt);
+	return tdb;
+}
+
+static void client_db_init(struct client_db *cdb)
+{
+	cdb->mm = NULL;
+	INIT_HLIST_NODE(&cdb->db_link);
+
+	cdb->mn.ops = &cdb_notifier_ops;
+	refcount_set(&cdb->refcnt, 0);
+
+	cdb->pseudo = NULL;
+}
+
+static struct client_db *client_db_alloc(void)
+{
+	struct client_db *cdb;
+
+	cdb = kzalloc(sizeof(*cdb), GFP_KERNEL);
+	if (cdb != NULL)
+		client_db_init(cdb);
+
+	return cdb;
+}
+
+static void client_db_free(struct client_db *cdb)
+{
+	ASSERT(refcount_read(&cdb->refcnt) == 0);
+
+	kfree(cdb);
+}
+
+static void client_db_free_delayed(struct rcu_head *rcu)
+{
+	struct client_db *cdb = container_of(rcu, struct client_db, rcu);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)cdb->mm);
+
+	client_db_free(cdb);
+}
+
+static void client_db_put(struct client_db *cdb)
+{
+	if (refcount_dec_and_test(&cdb->refcnt)) {
+		pr_debug("%s: mm %016lx\n", __func__, (unsigned long)cdb->mm);
+
+		spin_lock(&cdb_lock);
+		hash_del(&cdb->db_link);
+		spin_unlock(&cdb_lock);
+
+		mm_remote_db_client_release(cdb);
+
+		mmu_notifier_call_srcu(&cdb->rcu, client_db_free_delayed);
+	}
+}
+
+static struct client_db *client_db_lookup(const struct mm_struct *mm)
+{
+	struct client_db *cdb;
+
+	spin_lock(&cdb_lock);
+
+	hash_for_each_possible(cdb_hash, cdb, db_link, (unsigned long)mm)
+		if (cdb->mm == mm && refcount_inc_not_zero(&cdb->refcnt))
+			break;
+
+	spin_unlock(&cdb_lock);
+
+	return cdb;
+}
+
+// TODO: each mapping request by direct kernel interface calls this function
+// to find its mm association. Temporary allocating a struct client_db for each
+// mapping attempt may pose a performance problem.
+static struct client_db *client_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct client_db *cdb, *allocated;
+	bool found = false;
+	int result;
+
+	allocated = client_db_alloc();	/* may be NULL */
+
+	spin_lock(&cdb_lock);
+
+	hash_for_each_possible(cdb_hash, cdb, db_link, (unsigned long)mm)
+		if (cdb->mm == mm && refcount_inc_not_zero(&cdb->refcnt)) {
+			found = true;
+			break;
+		}
+
+	if (!found && allocated != NULL) {
+		cdb = allocated;
+		allocated = NULL;
+
+		cdb->mm = mm;
+		hash_add(cdb_hash, &cdb->db_link, (unsigned long)mm);
+		refcount_set(&cdb->refcnt, 1);
+	}
+
+	spin_unlock(&cdb_lock);
+
+	if (allocated != NULL)
+		client_db_free(allocated);
+
+	if (found || cdb == NULL)
+		return cdb;
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this cdb
+	 */
+	result = mmu_notifier_register(&cdb->mn, mm);
+	if (IS_ERR_VALUE((long)result)) {
+		pr_err("mmu_notifier_register() failed: %d\n", result);
+
+		client_db_put(cdb);
+		return ERR_PTR((long)result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)cdb->mm);
+
+	refcount_inc(&cdb->refcnt);
+	return cdb;
+}
+
+KEYED_RB_TREE(client_db_hva, struct file_db, rb_root,
+	struct page_db, file_link, map_hva)
+
+static void file_db_init(struct file_db *fdb)
+{
+	fdb->cdb = NULL;
+
+	spin_lock_init(&fdb->lock);
+	fdb->rb_root = RB_ROOT;
+}
+
+static struct file_db *file_db_alloc(void)
+{
+	struct file_db *fdb;
+
+	fdb = kmalloc(sizeof(*fdb), GFP_KERNEL);
+	if (fdb != NULL)
+		file_db_init(fdb);
+
+	return fdb;
+}
+
+static void file_db_free(struct file_db *fdb)
+{
+	ASSERT(client_db_hva_empty(fdb));
+
+	kfree(fdb);
+}
+
+static struct file_db *client_db_pseudo_file(struct client_db *cdb)
+{
+	struct file_db *allocated;
+
+	if (cdb->pseudo == NULL) {
+		allocated = file_db_alloc();
+		if (cmpxchg(&cdb->pseudo, NULL, allocated))
+			file_db_free(allocated);
+	}
+
+	return cdb->pseudo;
+}
+
+static struct page_db *page_db_alloc(void)
+{
+	struct page_db *result;
+
+	result = kmem_cache_alloc(pdb_cache, GFP_KERNEL);
+	if (result == NULL)
+		return NULL;
+
+	memset(result, 0, sizeof(*result));
+
+	atomic_inc(&pdb_count);
+
+	return result;
+}
+
+static void page_db_free(struct page_db *pdb)
+{
+	kmem_cache_free(pdb_cache, pdb);
+
+	BUG_ON(atomic_add_negative(-1, &pdb_count));
+}
+
+static void page_db_put(struct page_db *pdb)
+{
+	if (refcount_dec_and_test(&pdb->refcnt)) {
+
+		/* this case is possible if both target and client are
+		 * OOM-killed in quick succession and the release functions
+		 * can't get to the remote mapped page
+		 */
+		if (pdb->map_anon_vma)
+			put_anon_vma(pdb->map_anon_vma);
+		if (pdb->req_anon_vma)
+			put_anon_vma(pdb->req_anon_vma);
+
+		page_db_free(pdb);
+	}
+}
+
+static void page_db_release(struct page_db *pdb)
+{
+	clear_bit(BUSY_BIT, (unsigned long *)&pdb->flags);
+	/* see comments of wake_up_bit(), set_bit() is atomic */
+	smp_mb__after_atomic();
+	wake_up_bit(&pdb->flags, BUSY_BIT);
+}
+
+/* Reserve a mapping entry indexed by map_hva in the file database. */
+static struct page_db *
+page_db_reserve(struct file_db *fdb, struct mm_struct *req_mm,
+	unsigned long req_hva, unsigned long map_hva)
+{
+	struct page_db *pdb;
+
+	pdb = page_db_alloc();
+	if (unlikely(pdb == NULL))
+		return ERR_PTR(-ENOMEM);
+
+	/* fill pdb */
+	pdb->target = req_mm;
+	pdb->req_hva = req_hva;
+	pdb->map_hva = map_hva;
+	refcount_set(&pdb->refcnt, 1);
+	__set_bit(BUSY_BIT, (unsigned long *)&pdb->flags);
+
+	/* insert mapping entry into the client if not already there */
+	spin_lock(&fdb->lock);
+
+	if (likely(client_db_hva_insert(fdb, pdb)))
+		refcount_inc(&pdb->refcnt);
+	else {
+		page_db_free(pdb);
+		pdb = ERR_PTR(-EALREADY);
+	}
+
+	spin_unlock(&fdb->lock);
+
+	return pdb;
+}
+
+/* Reverse of page_db_reserve(), to be called in case of error. */
+static void
+page_db_unreserve(struct file_db *fdb, struct page_db *pdb)
+{
+	spin_lock(&fdb->lock);
+
+	client_db_hva_remove(fdb, pdb);
+	page_db_put(pdb);
+
+	spin_unlock(&fdb->lock);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+/* Marks as mapped & drops reference. */
+static void
+page_db_got_mapped(struct page_db *pdb)
+{
+	__set_bit(MAPPED_BIT, (unsigned long *)&pdb->flags);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+/* Gets exclusive access for unmapping. */
+static struct page_db *
+page_db_begin_unmap(struct file_db *fdb, unsigned long map_hva)
+{
+	struct page_db *pdb;
+	int result;
+
+	spin_lock(&fdb->lock);
+
+	pdb = client_db_hva_search(fdb, map_hva);
+	if (likely(pdb != NULL))
+		refcount_inc(&pdb->refcnt);
+
+	spin_unlock(&fdb->lock);
+
+	if (pdb == NULL)
+		return NULL;
+
+retry:
+	result = wait_on_bit((unsigned long *)&pdb->flags, BUSY_BIT,
+			     TASK_KILLABLE);
+	/* non-zero if interrupted by a signal */
+	if (unlikely(result != 0))
+		return ERR_PTR(-EINTR);
+
+	/* try set bit & spin if failed */
+	if (test_and_set_bit(BUSY_BIT, (unsigned long *)&pdb->flags))
+		goto retry;
+
+	return pdb;
+}
+
+/* Marks as unmapped, removes from tree & drops reference. */
+static void
+page_db_end_unmap(struct file_db *fdb, struct page_db *pdb)
+{
+	__clear_bit(MAPPED_BIT, (unsigned long *)&pdb->flags);
+
+	spin_lock(&fdb->lock);
+
+	client_db_hva_remove(fdb, pdb);
+	page_db_put(pdb);
+
+	spin_unlock(&fdb->lock);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+static int
+page_db_add_target(struct page_db *pdb, struct mm_struct *target,
+		   struct mm_struct *client)
+{
+	struct target_db *tdb;
+	int result = 0;
+
+	/*
+	 * returns a valid pointer or an error value, never NULL
+	 * also gets reference to entry
+	 */
+	tdb = target_db_lookup_or_add(target);
+	if (IS_ERR_VALUE(tdb))
+		return PTR_ERR(tdb);
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* check that target is not introspected by someone else */
+	if (tdb->client != NULL && tdb->client != client)
+		result = -EINVAL;
+	else {
+		tdb->client = client;
+		target_db_insert(tdb, pdb);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	target_db_put(tdb);
+
+	return result;
+}
+
+static int
+page_db_remove_target(struct page_db *pdb)
+{
+	struct target_db *tdb;
+	int result = 0;
+
+	/* find target entry in the database */
+	tdb = target_db_lookup(pdb->target);
+	if (tdb == NULL)
+		return -ENOENT;
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* remove mapping from target */
+	target_db_remove(tdb, pdb);
+
+	/* clear the client if no more mappings */
+	if (target_db_empty(tdb)) {
+		tdb->client = NULL;
+		pr_debug("%s: all mappings gone for target mm %016lx\n",
+			__func__, (unsigned long)pdb->target);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	target_db_put(tdb);
+
+	return result;
+}
+
+/* Last resort call if memory of client got unmapped before ioctl(REM_UNMAP) */
+static bool not_mapped_in_client(struct page_db *pdb)
+{
+	int numpages;
+	struct page *req_page;
+	bool result = false;
+
+	numpages = __get_user_pages_fast(pdb->map_hva, 1, 0, &req_page);
+	if (numpages == 0)
+		return true;
+
+	/* page was munmapped & replaced by a normal page */
+	if (!PageRemote(req_page))
+		result = true;
+
+	put_page(req_page);
+	return result;
+}
+
+/*
+ * Clear all the links to a target at once.
+ */
+static void mm_remote_db_cleanup_target(struct client_db *cdb,
+					struct target_db *tdb)
+{
+	struct page_db *pdb, *npdb;
+
+	/* if we ended up here the target must be introspected */
+	ASSERT(tdb->client != NULL);
+	tdb->client = NULL;
+
+	/*
+	 * walk the tree & clear links to target - this function is serialized
+	 * with respect to the main loop in mm_remote_db_client_release() so
+	 * there will be no race on pdb->target
+	 */
+	list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+		if (mm_is_oom_victim(cdb->mm) || not_mapped_in_client(pdb))
+			mm_remote_do_unmap_target(pdb);
+
+		list_del(&pdb->target_link);
+		pdb->target = NULL;
+	}
+}
+
+/*
+ * A client file is closing. No race with operations of file is possible.
+ */
+static void mm_remote_db_file_release(struct file_db *fdb)
+{
+	struct client_db *cdb = fdb->cdb;
+	struct page_db *pdb, *npdb;
+	struct target_db *tdb;
+
+	if (!client_db_hva_empty(fdb))
+		pr_debug("%s: client file %016lx has mappings\n",
+			__func__, (unsigned long)fdb);
+
+	/* iterate the tree of mappings */
+	rbtree_postorder_for_each_entry_safe(pdb, npdb, &fdb->rb_root, file_link) {
+		/* pdb->target is cleared in the func above, store in var */
+		struct mm_struct *req_mm = pdb->target;
+
+		/* see comments in function above */
+		if (req_mm == NULL)
+			goto just_free;
+
+		/* pin target to avoid race with mm_remote_db_target_release() */
+		if (mmget_not_zero(req_mm)) {
+
+			/* pin entry for target - maybe it has been released */
+			tdb = target_db_lookup(req_mm);
+			if (tdb != NULL) {
+				/* see comments of this function */
+				mm_remote_db_cleanup_target(cdb, tdb);
+
+				/* unpin entry for target */
+				target_db_put(tdb);
+			}
+
+			mmput(req_mm);
+		}
+
+	just_free:
+		/* invalidate links to client */
+		RB_CLEAR_NODE(&pdb->file_link);
+
+		if (!mm_is_oom_victim(cdb->mm))
+			mm_remote_do_unmap(cdb->mm, pdb->map_hva);
+
+		page_db_put(pdb);
+	}
+
+	/* clear root of tree */
+	fdb->rb_root = RB_ROOT;
+}
+
+/*
+ * The client is closing. This means the normal mapping/unmapping logic
+ * does not work anymore. No more locking needed.
+ */
+static void mm_remote_db_client_release(struct client_db *cdb)
+{
+	struct file_db *fdb = cdb->pseudo;
+
+	if (fdb == NULL)
+		return;
+
+	pr_debug("%s: client %016lx has special file\n",
+		__func__, (unsigned long) cdb);
+
+	mm_remote_db_file_release(fdb);
+	file_db_free(fdb);
+}
+
+/*
+ * Called when a target exits and the page must be marked as stale and the
+ * target-side anon-vma released.
+ * This function will not race with mm_remote_remap(), since a reference to the
+ * target MM is taken before mapping being done.
+ * This function may race with mm_remote_do_unmap(), so a check must be
+ * done under page lock to make sure the page is still remote mapped.
+ * After this is run, the pages are still remote mapped pages, but the rmap
+ * only points to the client.
+ */
+static int mm_remote_make_stale(struct page_db *pdb)
+{
+	struct mm_struct *req_mm = pdb->target;
+	struct vm_area_struct *req_vma;
+	struct page *req_page;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&req_mm->mmap_sem);
+
+	/* find VMA containing address */
+	req_vma = find_vma(req_mm, pdb->req_hva);
+	if (unlikely(req_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no remote VMA found for stalling\n");
+		goto out_unlock;
+	}
+
+	/* should be available & unevictable */
+	req_page = follow_page(req_vma, pdb->req_hva, FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		pr_err("follow_page() failed: %d\n", result);
+		goto out_unlock;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out_unlock;
+	}
+
+	/* access to RMAP components of PDB can only be done under page lock */
+	lock_page(req_page);
+
+	if (likely(PageRemote(req_page))) {
+		ASSERT(pdb->req_anon_vma == req_vma->anon_vma);
+		/* just release target anon_vma - the page will be temporarily
+		 * left with increased mapcount & refcount, which will be
+		 * decremented when the page is unmapped from the target mm
+		 */
+		put_anon_vma(pdb->req_anon_vma);
+		pdb->req_anon_vma = NULL;
+	}
+
+	unlock_page(req_page);
+
+	put_page(req_page);	/* follow_page(... FOLL_GET) */
+
+out_unlock:
+	up_read(&req_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_make_stale_client(struct mm_struct *map_mm,
+				       struct page_db *pdb)
+{
+	struct vm_area_struct *map_vma;
+	struct page *req_page;
+
+	int result = 0;
+
+	down_read(&map_mm->mmap_sem);
+
+	map_vma = find_vma(map_mm, pdb->map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no client VMA found for stalling\n");
+		goto out_unlock;
+	}
+
+	/* should be available & unevictable */
+	req_page = follow_page(map_vma, pdb->map_hva, FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		pr_err("follow_page() failed: %d\n", result);
+		goto out_unlock;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out_unlock;
+	}
+
+	/* access to RMAP components of PDB can only be done under page lock */
+	lock_page(req_page);
+
+	if (likely(PageRemote(req_page))) {
+		/* just release target anon_vma - the page will be temporarily
+		 * left with increased mapcount & refcount, which will be
+		 * decremented when the page is unmapped from the target mm
+		 */
+		put_anon_vma(pdb->req_anon_vma);
+		pdb->req_anon_vma = NULL;
+	}
+
+	unlock_page(req_page);
+
+	put_page(req_page);	/* follow_page(... FOLL_GET) */
+
+out_unlock:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+/*
+ * The target MM is closing. This means the pages are unmapped by the default
+ * kernel logic on the target side, but we must mark the entries as stale.
+ * This function won't race with the mapping function since we get here
+ * on target MM teardown and the mapping function won't be able to get a
+ * reference to the target MM.
+ * This function may race with the unmapping function, but
+ * access will be done only on the target-side components.
+ */
+static void mm_remote_db_target_release(struct target_db *tdb)
+{
+	struct mm_struct *map_mm;
+	struct page_db *pdb, *npdb;
+
+	/* no client, nothing to do */
+	if (tdb->client == NULL) {
+		ASSERT(target_db_empty(tdb));
+		return;
+	}
+
+	map_mm = tdb->client;
+	tdb->client = NULL;
+
+	/* if the target is killed by OOM, try to pin the client */
+	if (mm_is_oom_victim(tdb->mm) && !mmget_not_zero(map_mm)) {
+		/* out of luck, just unlink from the list */
+		list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+			list_del(&pdb->target_link);
+			pdb->target = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * all the entries in this tree must be made stale,
+	 * but not removed from the client tree
+	 */
+	list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+		if (!mm_is_oom_victim(tdb->mm))
+			mm_remote_make_stale(pdb);
+		else
+			mm_remote_make_stale_client(map_mm, pdb);
+
+		list_del(&pdb->target_link);
+		pdb->target = NULL;
+	}
+
+	/* client has been pinned before */
+	if (mm_is_oom_victim(tdb->mm))
+		mmput(map_mm);
+}
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct target_db *tdb = container_of(mn, struct target_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/* at this point other threads may already have hold of this tdb */
+	target_db_put(tdb);
+}
+
+static void cdb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct client_db *cdb = container_of(mn, struct client_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/* at this point other threads may already have hold of this cdb */
+	client_db_put(cdb);
+}
+
+static void mm_remote_page_unevictable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+static void mm_remote_page_evictable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+	else {
+		if (PageUnevictable(page))
+			count_vm_event(UNEVICTABLE_PGSTRANDED);
+	}
+}
+
+void rmap_walk_remote(struct page *page, struct rmap_walk_control *rwc)
+{
+	struct page_db *pdb;
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	struct vm_area_struct *vma;
+	pgoff_t pgoff_start, pgoff_end;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(!PageRemote(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	pdb = (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+
+	/* iterate on original anon_vma */
+	anon_vma = pdb->req_anon_vma;
+	if (anon_vma != NULL) {
+		anon_vma_lock_read(anon_vma);
+		pgoff_start = page_to_pgoff(page);
+		pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+					       pgoff_start, pgoff_end) {
+			vma = avc->vma;
+			address = vma_address(page, vma);
+
+			cond_resched();
+
+			if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
+				continue;
+
+			if (!rwc->rmap_one(page, vma, address, rwc->arg))
+				break;
+
+			if (rwc->done && rwc->done(page))
+				break;
+		}
+		anon_vma_unlock_read(anon_vma);
+	}
+
+	/* iterare on client anon_vma */
+	anon_vma = pdb->map_anon_vma;
+	if (anon_vma != NULL) {
+		anon_vma_lock_read(anon_vma);
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+					       0, ULONG_MAX) {
+			vma = avc->vma;
+			address = pdb->map_hva;
+
+			cond_resched();
+
+			if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
+				continue;
+
+			if (!rwc->rmap_one(page, vma, address, rwc->arg))
+				break;
+
+			if (rwc->done && rwc->done(page))
+				break;
+		}
+		anon_vma_unlock_read(anon_vma);
+	}
+}
+
+static int mm_remote_invalidate_pte(struct vm_area_struct *map_vma,
+	unsigned long map_hva, pmd_t *map_pmd, struct page *map_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	struct mmu_notifier_range range;
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	swp_entry_t entry;
+	int result = 0;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_range_init(&range, map_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptep = pte_offset_map_lock(map_mm, map_pmd, map_hva, &ptl);
+
+	/* remove reverse mapping - the caller needs to hold the pte lock */
+	if (likely(map_page != NULL)) {
+		page_remove_rmap(map_page, false);
+
+		/* the zero_page is not anonymous */
+		if (!is_zero_pfn(pte_pfn(*ptep)))
+			dec_mm_counter(map_mm, MM_ANONPAGES);
+
+		/* clear old PTE entry */
+		flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+		ptep_clear_flush_notify(map_vma, map_hva, ptep);
+
+		atomic_inc(&stat_mapped_pte);
+	} else {
+		/* fresh PTE or has been cleared before */
+		if (likely(pte_none(*ptep))) {
+			atomic_inc(&stat_empty_pte);
+			goto out_unlock;
+		}
+
+		/* a page was faulted in after follow_page() returned NULL */
+		if (unlikely(pte_present(*ptep))) {
+			atomic_inc(&stat_refault);
+			result = -EAGAIN;
+			goto out_unlock;
+		}
+
+		/* must be swap entry */
+		entry = pte_to_swp_entry(*ptep);
+		/* follow_page(... | FOLL_MIGRATION | ...) */
+		ASSERT(!is_migration_entry(entry));
+		free_swap_and_cache(entry);
+		ptep_clear_flush(map_vma, map_hva, ptep);
+
+		atomic_inc(&stat_swap_pte);
+	}
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return result;
+}
+
+static int mm_remote_install_pte(struct vm_area_struct *map_vma,
+	unsigned long map_hva, pmd_t *map_pmd, struct page *req_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
+	int result = 0;
+
+	ptep = pte_offset_map_lock(map_mm, map_pmd, map_hva, &ptl);
+
+	/* a page was faulted in */
+	if (unlikely(pte_present(*ptep))) {
+		atomic_inc(&stat_refault);
+		result = -EAGAIN;
+		goto out_unlock;
+	}
+
+	/* create new PTE based on requested page */
+	pte = mk_pte(req_page, map_vma->vm_page_prot);
+	if (map_vma->vm_flags & VM_WRITE)
+		pte = pte_mkwrite(pte_mkdirty(pte));
+	set_pte_at_notify(map_mm, map_hva, ptep, pte);
+
+	inc_mm_counter(map_mm, MM_ANONPAGES);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+
+	return result;
+}
+
+static void mm_remote_put_req(struct page *req_page,
+			      struct anon_vma *req_anon_vma)
+{
+	if (req_anon_vma)
+		put_anon_vma(req_anon_vma);
+
+	if (req_page)
+		put_page(req_page);
+}
+
+static int mm_remote_get_req(struct mm_struct *req_mm, unsigned long req_hva,
+			     struct page **preq_page,
+			     struct anon_vma **preq_anon_vma)
+{
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	long nrpages;
+	int result = 0;
+
+	/* for now we-re using both pointers */
+	ASSERT(preq_page != NULL);
+	ASSERT(preq_anon_vma != NULL);
+
+	if (check_stable_address_space(req_mm)) {
+		pr_err("address space of target not stable");
+		return -EINVAL;
+	}
+
+	down_read(&req_mm->mmap_sem);
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm, req_hva, 1,
+		FOLL_WRITE | FOLL_FORCE | FOLL_SPLIT | FOLL_MIGRATION,
+		&req_page, NULL, NULL);
+	if (unlikely(nrpages == 0)) {
+		pr_err("no page for req_hva %016lx\n", req_hva);
+		result = -ENOENT;
+		goto out;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		if (result == -EBUSY)
+			pr_debug("get_user_pages_remote() failed: %d\n", result);
+		else
+			pr_err("get_user_pages_remote() failed: %d\n", result);
+		goto out;
+	}
+
+	/* limit introspection to anon memory (this also excludes zero-page) */
+	if (!PageAnon(req_page)) {
+		result = -EINVAL;
+		pr_err("page at req_hva %016lx not anon\n", req_hva);
+		goto out;
+	}
+
+	/* make sure the application doesn't want remote-double-mapping */
+	if (PageRemote(req_page)) {
+		result = -EALREADY;
+		pr_err("page at req_hva %016lx already mapped\n", req_hva);
+		goto out;
+	}
+
+	/* take & lock this anon vma */
+	req_anon_vma = page_get_anon_vma(req_page);
+	if (unlikely(req_anon_vma == NULL)) {
+		result = -EINVAL;
+		pr_err("no anon vma for req_hva %016lx\n", req_hva);
+		goto out;
+	}
+
+	/* output these values only if successful */
+	*preq_page = req_page;
+	*preq_anon_vma = req_anon_vma;
+
+out:
+	up_read(&req_mm->mmap_sem);
+
+	if (result)
+		mm_remote_put_req(req_page, req_anon_vma);
+
+	return result;
+}
+
+static int mm_remote_remap(struct mm_struct *map_mm, unsigned long map_hva,
+			   struct page *req_page, struct anon_vma *req_anon_vma,
+			   struct page_db *pdb)
+{
+	struct vm_area_struct *map_vma;
+	pmd_t *map_pmd;
+	struct page *map_page = NULL;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&map_mm->mmap_sem);
+
+	/* find VMA containing address */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no local VMA found for remapping\n");
+		goto out_unlock;
+	}
+
+	if (unlikely(!vma_is_anonymous(map_vma))) {
+		result = -EINVAL;
+		pr_err("local VMA is not anonymous\n");
+		goto out_unlock;
+	}
+	ASSERT(map_vma->anon_vma != NULL);
+
+retry:
+	/*
+	 * get reference to local page corresponding to target address;
+	 * the result may be NULL in case of swap entry or mapping not present
+	 */
+	map_page = follow_page(map_vma, map_hva,
+			       FOLL_SPLIT | FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(map_page)) {
+		result = PTR_ERR(map_page);
+		pr_debug("%s: follow_page() failed: %d\n", __func__, result);
+		goto out_unlock;
+	}
+
+	/* in case of THP, the huge page must be split before the PMD exists */
+	map_pmd = mm_find_pmd(map_mm, map_hva);
+	if (unlikely(!map_pmd)) {
+		/* follow_page(... | FOLL_GET) */
+		if (map_page != NULL)
+			put_page(map_page);
+		result = -EFAULT;
+		pr_err("local PMD not found");
+		goto out_unlock;
+	}
+
+	/* unmap map_page from current page tables */
+	if (map_page != NULL)
+		lock_page(map_page);
+
+	/* the only possible error is -EAGAIN when map_page == NULL */
+	result = mm_remote_invalidate_pte(map_vma, map_hva, map_pmd, map_page);
+	if (IS_ERR_VALUE((long)result))
+		goto retry;
+
+	if (map_page != NULL)
+		unlock_page(map_page);
+
+	/* we're done with this page */
+	if (map_page != NULL) {
+		/* reference acquired in follow_page(... | FOLL_GET) */
+		put_page(map_page);
+		free_page_and_swap_cache(map_page);
+	}
+
+	/* map req_page at the same address - page is already PageRemote() */
+	lock_page(req_page);
+
+	/* the only possible error is -EAGAIN when PTE != pte_none() */
+	result = mm_remote_install_pte(map_vma, map_hva, map_pmd, req_page);
+	if (IS_ERR_VALUE((long)result)) {
+		unlock_page(req_page);
+		goto retry;
+	}
+
+	/* increment its reference to outlive OOM */
+	get_anon_vma(map_vma->anon_vma);
+	pdb->map_anon_vma = map_vma->anon_vma;
+
+	/* will only increment the mapcount of this page */
+	page_add_anon_rmap(req_page, map_vma, map_hva, false);
+
+	unlock_page(req_page);
+
+	/* local accounting */
+	atomic_inc(&map_count);
+
+out_unlock:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_promote_page(struct page *req_page,
+				  struct anon_vma *req_anon_vma,
+				  struct page_db *pdb)
+{
+	int result = 0;
+
+	lock_page(req_page);
+
+	/*
+	 * maybe some other thread mapping the same page in another file
+	 * reached here before us
+	 */
+	if (PageRemote(req_page)) {
+		result = -EALREADY;
+		goto out_unlock;
+	}
+
+	/* make this page remote, mapped only under the target */
+	pdb->req_anon_vma = req_anon_vma;
+	req_page->mapping = PageMapping(pdb);
+
+	mm_remote_page_unevictable(req_page);
+	atomic_inc(&rpg_count);
+
+out_unlock:
+	unlock_page(req_page);
+
+	return result;
+}
+
+static void mm_remote_revert_promote(struct page *req_page)
+{
+	struct page_db *pdb;
+
+	/* the page must have been made remote by this thread */
+	ASSERT(PageRemote(req_page));
+
+	lock_page(req_page);
+
+	pdb = RemoteMapping(req_page);
+
+	/* revert the mapping back to anon page mapped under target */
+	req_page->mapping = (void *)pdb->req_anon_vma + PAGE_MAPPING_ANON;
+	pdb->req_anon_vma = NULL;
+
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	unlock_page(req_page);
+}
+
+static int mm_remote_do_map(struct mm_struct *req_mm, unsigned long req_hva,
+			    struct mm_struct *map_mm, unsigned long map_hva,
+			    struct page_db *pdb)
+{
+	struct page *req_page;
+	struct anon_vma *req_anon_vma;
+	int result;
+
+	result = mm_remote_get_req(req_mm, req_hva, &req_page, &req_anon_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	result = mm_remote_promote_page(req_page, req_anon_vma, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_put;
+
+	result = mm_remote_remap(map_mm, map_hva, req_page, req_anon_vma, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_revert;
+
+	return 0;
+
+out_revert:
+	mm_remote_revert_promote(req_page);
+out_put:
+	mm_remote_put_req(req_page, req_anon_vma);
+
+	return result;
+}
+
+static int mm_remote_map_file(struct file_db *fdb, struct mm_struct *req_mm,
+			      unsigned long req_hva, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct page_db *pdb;
+	int result = 0;
+
+	/* tries to add the entry in the tree */
+	pdb = page_db_reserve(fdb, req_mm, req_hva, map_hva);
+	if (IS_ERR_VALUE(pdb))
+		return PTR_ERR(pdb);
+
+	/* do the actual memory mapping */
+	result = mm_remote_do_map(req_mm, req_hva, map_mm, map_hva, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_pdb;
+
+	/* add mapping to target database */
+	result = page_db_add_target(pdb, req_mm, map_mm);
+	if (IS_ERR_VALUE((long)result)) {
+		mm_remote_do_unmap(map_mm, map_hva);
+		goto out_pdb;
+	}
+
+	/* marks as mapped & drops reference */
+	page_db_got_mapped(pdb);
+
+	return 0;
+
+out_pdb:
+	/* removes the entry from the tree & drops reference */
+	page_db_unreserve(fdb, pdb);
+
+	return result;
+}
+
+int mm_remote_map(struct mm_struct *req_mm,
+		  unsigned long req_hva, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+	struct file_db *fdb;
+	int result = 0;
+
+	pr_debug("%s: req_mm %016lx, req_hva %016lx, map_hva %016lx\n",
+		__func__, (unsigned long)req_mm, req_hva, map_hva);
+
+	cdb = client_db_lookup_or_add(map_mm);
+	if (IS_ERR_OR_NULL(cdb))
+		return (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+
+	fdb = client_db_pseudo_file(cdb);
+	if (fdb == NULL) {
+		result = -ENOMEM;
+		goto out_cdb;
+	}
+
+	/* try to pin the target MM so it won't go away */
+	if (!mmget_not_zero(req_mm)) {
+		result = -EINVAL;
+		goto out_cdb;
+	}
+
+	result = mm_remote_map_file(fdb, req_mm, req_hva, map_hva);
+	mmput(req_mm);
+
+out_cdb:
+	client_db_put(cdb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_map);
+
+static int mm_remote_do_unmap(struct mm_struct *map_mm, unsigned long map_hva)
+{
+	struct vm_area_struct *map_vma;
+	pmd_t *map_pmd;
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	struct page_db *pdb;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&map_mm->mmap_sem);
+
+	/* find destination VMA for mapping */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no local VMA found for unmapping\n");
+		goto out;
+	}
+
+	map_pmd = mm_find_pmd(map_mm, map_hva);
+	if (unlikely(!map_pmd)) {
+		result = -EFAULT;
+		pr_err("local PMD not found");
+		goto out;
+	}
+
+	/* get page mapped to destination address - we know it is there */
+	req_page = follow_page(map_vma, map_hva, FOLL_GET | FOLL_MIGRATION);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		req_page = NULL;
+		pr_err("follow_page() failed: %d\n", result);
+		goto out;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out;
+	}
+
+	ASSERT(PageRemote(req_page));
+	pdb = RemoteMapping(req_page);
+
+	lock_page(req_page);
+
+	/* also calls page_remove_rmap() */
+	mm_remote_invalidate_pte(map_vma, map_hva, map_pmd, req_page);
+
+	req_anon_vma = pdb->req_anon_vma;
+	pdb->req_anon_vma = NULL;
+
+	/* restore original rmap */
+	req_page->mapping = (void *)req_anon_vma + PAGE_MAPPING_ANON;
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	/* refcount was increased in mm_remote_remap() */
+	put_anon_vma(pdb->map_anon_vma);
+	pdb->map_anon_vma = NULL;
+
+	unlock_page(req_page);
+
+	/* follow_page(..., FOLL_GET...) */
+	put_page(req_page);
+
+	BUG_ON(atomic_add_negative(-1, &map_count));
+
+	/* reference count was inc during mm_remote_get_req() */
+	mm_remote_put_req(req_page, req_anon_vma);
+
+out:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+/*
+ * In case the client's memory is reaped by the OOM killer, the remote pages'
+ * reference count + mapcount is dropped and they belong just to the target.
+ */
+static int mm_remote_do_unmap_target(struct page_db *pdb)
+{
+	struct mm_struct *req_mm = pdb->target;
+	struct vm_area_struct *req_vma;
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	int result = 0;
+
+	down_read(&req_mm->mmap_sem);
+
+	req_vma = find_vma(req_mm, pdb->req_hva);
+	if (unlikely(req_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no source VMA found for unmapping\n");
+		goto out;
+	}
+
+	/* page is unevictable - should be mapped */
+	req_page = follow_page(req_vma, pdb->req_hva, FOLL_GET | FOLL_MIGRATION);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		req_page = NULL;
+		pr_err("follow_page() failed: %d\n", result);
+		goto out;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out;
+	}
+
+	ASSERT(PageRemote(req_page));
+	ASSERT(pdb == RemoteMapping(req_page));
+
+	/*
+	 * page_remove_rmap() must have been called when the page was unmapped
+	 * from the client, now we must have a higher refcount from
+	 * follow_page(...FOLL_GET...)
+	 */
+
+	lock_page(req_page);
+
+	req_anon_vma = pdb->req_anon_vma;
+	pdb->req_anon_vma = NULL;
+
+	/* restore original rmap */
+	req_page->mapping = (void *)req_anon_vma + PAGE_MAPPING_ANON;
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	/* refcount was increased in mm_remote_remap() */
+	put_anon_vma(pdb->map_anon_vma);
+	pdb->map_anon_vma = NULL;
+
+	unlock_page(req_page);
+
+	BUG_ON(atomic_add_negative(-1, &map_count));
+
+	/* client doesn't map this page anymore, a single refcount to drop */
+	mm_remote_put_req(req_page, req_anon_vma);
+
+out:
+	up_read(&req_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_unmap_file(struct file_db *fdb, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct mm_struct *req_mm;
+	struct page_db *pdb;
+	int result;
+
+	/* take exclusive access to this pdb */
+	pdb = page_db_begin_unmap(fdb, map_hva);
+	if (IS_ERR_OR_NULL(pdb))
+		return (pdb == NULL) ? -ENOENT : PTR_ERR(pdb);
+
+	/* test if other thread unmapped this address before us */
+	if (!test_bit(MAPPED_BIT, (unsigned long *)&pdb->flags)) {
+		result = -EALREADY;
+		goto just_release;
+	}
+
+	/* also disconnect from target - can fail if target exited */
+	result = page_db_remove_target(pdb);
+	if (IS_ERR_VALUE((long)result))
+		pr_debug("%s: page_db_remove_target() failed: %d\n",
+			__func__, result);
+
+	/* the unmapping is done on local mm only */
+	result = mm_remote_do_unmap(map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result)) {
+		pr_debug("%s: mm_remote_do_unmap() failed: %d, trying target\n",
+			__func__, result);
+
+		req_mm = pdb->target;
+		if (mmget_not_zero(req_mm)) {
+			result = mm_remote_do_unmap_target(pdb);
+
+			mmput(req_mm);
+		}
+	}
+
+just_release:
+	/* marks as unmapped & drops reference */
+	page_db_end_unmap(fdb, pdb);
+
+	return result;
+}
+
+int mm_remote_unmap(unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+	struct file_db *fdb;
+	int result;
+
+	pr_debug("%s: map_hva %016lx\n", __func__, map_hva);
+
+	cdb = client_db_lookup_or_add(map_mm);
+	if (IS_ERR_OR_NULL(cdb))
+		return (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+
+	fdb = client_db_pseudo_file(cdb);
+	if (fdb == NULL) {
+		result = -ENOMEM;
+		goto out_cdb;
+	}
+
+	result = mm_remote_unmap_file(fdb, map_hva);
+
+out_cdb:
+	client_db_put(cdb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_unmap);
+
+/* called on behalf of the client */
+void mm_remote_reset(void)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+
+	pr_debug("%s\n", __func__);
+
+	/* also gets reference to entry */
+	cdb = client_db_lookup(map_mm);
+	if (cdb == NULL)
+		return;
+
+	/* no locking here, we have exclusive access */
+	mm_remote_db_client_release(cdb);
+
+	client_db_put(cdb);
+}
+EXPORT_SYMBOL_GPL(mm_remote_reset);
+
+static int remmap_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct file_db *fdb;
+	struct client_db *cdb;
+	int result = 0;
+
+	fdb = file_db_alloc();
+	if (fdb == NULL)
+		return -ENOMEM;
+
+	/* we need the mm to exist at file closing time */
+	mmget(current->mm);
+
+	cdb = client_db_lookup_or_add(current->mm);
+	if (IS_ERR_OR_NULL(cdb)) {
+		result = (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+		goto out_err;
+	}
+
+	fdb->cdb = cdb;
+	filp->private_data = fdb;
+
+	/* by pinning the mm we also make sure the cdb does not get released */
+	client_db_put(cdb);
+
+	return 0;
+
+out_err:
+	mmput(current->mm);
+	file_db_free(fdb);
+
+	return result;
+}
+
+static long remmap_dev_ioctl(struct file *filp, unsigned int ioctl,
+			     unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_db *fdb = filp->private_data;
+	struct client_db *cdb = fdb->cdb;
+	long result = 0;
+
+	if (current->mm != cdb->mm) {
+		pr_err("ioctl request by different process\n");
+		return -EINVAL;
+	}
+
+	switch (ioctl) {
+	case REMOTE_MAP: {
+		struct remote_map_request req;
+		struct task_struct *req_task;
+		struct mm_struct *req_mm;
+
+		result = -EFAULT;
+		if (copy_from_user(&req, argp, sizeof(req)))
+			break;
+
+		result = -EINVAL;
+		if (!access_ok(req.map_hva, PAGE_SIZE))
+			break;
+		if (req.req_hva & ~PAGE_MASK)
+			break;
+		if (req.map_hva & ~PAGE_MASK)
+			break;
+
+		result = -ESRCH;
+		req_task = find_get_task_by_vpid(req.req_pid);
+		if (req_task == NULL)
+			break;
+
+		result = -EINVAL;
+		req_mm = get_task_mm(req_task);
+		put_task_struct(req_task);
+		if (req_mm == NULL)
+			break;
+
+		result = mm_remote_map_file(fdb, req_mm, req.req_hva, req.map_hva);
+		mmput(req_mm);
+
+		break;
+	}
+
+	case REMOTE_UNMAP: {
+		unsigned long map_hva = (unsigned long) arg;
+
+		result = -EINVAL;
+		if (!access_ok(map_hva, PAGE_SIZE))
+			break;
+		if (map_hva & ~PAGE_MASK)
+			break;
+
+		result = mm_remote_unmap_file(fdb, map_hva);
+
+		break;
+	}
+
+	default:
+		pr_err("ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int remmap_dev_release(struct inode *inodep, struct file *filp)
+{
+	struct file_db *fdb = filp->private_data;
+	struct client_db *cdb = fdb->cdb;
+	struct mm_struct *mm = cdb->mm;
+
+	mm_remote_db_file_release(fdb);
+	file_db_free(fdb);
+
+	/*
+	 * we may have reached here by killing the client process,
+	 * current->mm is not accessible anymore
+	 */
+	mmput(mm);
+
+	return 0;
+}
+
+static const struct file_operations remmap_ops = {
+	.open = remmap_dev_open,
+	.unlocked_ioctl = remmap_dev_ioctl,
+	.compat_ioctl = remmap_dev_ioctl,
+	.release = remmap_dev_release,
+};
+
+static struct miscdevice remmap_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "remote-map",
+	.fops = &remmap_ops,
+};
+
+builtin_misc_device(remmap_dev);
+
+#ifdef CONFIG_DEBUG_FS
+static void __init mm_remote_debugfs_init(void)
+{
+	mm_remote_debugfs_dir = debugfs_create_dir("remote_mapping", NULL);
+	if (mm_remote_debugfs_dir == NULL)
+		return;
+
+	debugfs_create_atomic_t("map_count", 0444, mm_remote_debugfs_dir,
+				&map_count);
+	debugfs_create_atomic_t("pdb_count", 0444, mm_remote_debugfs_dir,
+				&pdb_count);
+	debugfs_create_atomic_t("rpg_count", 0444, mm_remote_debugfs_dir,
+				&rpg_count);
+
+	debugfs_create_atomic_t("stat_empty_pte", 0444, mm_remote_debugfs_dir,
+				&stat_empty_pte);
+	debugfs_create_atomic_t("stat_mapped_pte", 0444, mm_remote_debugfs_dir,
+				&stat_mapped_pte);
+	debugfs_create_atomic_t("stat_swap_pte", 0444, mm_remote_debugfs_dir,
+				&stat_swap_pte);
+	debugfs_create_atomic_t("stat_refault", 0444, mm_remote_debugfs_dir,
+				&stat_refault);
+}
+#else /* CONFIG_DEBUG_FS */
+static void __init mm_remote_debugfs_init(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
+
+static int __init mm_remote_init(void)
+{
+	pdb_cache = KMEM_CACHE(page_db, SLAB_PANIC | SLAB_ACCOUNT);
+	if (!pdb_cache)
+		return -ENOMEM;
+
+	mm_remote_debugfs_init();
+
+	return 0;
+}
+device_initcall(mm_remote_init);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..352570d9ad22 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -65,6 +65,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/remote_mapping.h>
 
 #include <asm/tlbflush.h>
 
@@ -856,7 +857,7 @@ int page_referenced(struct page *page,
 	if (!page_rmapping(page))
 		return 0;
 
-	if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
+	if (!is_locked && (!PageAnon(page) || PageKsm(page) || PageRemote(page))) {
 		we_locked = trylock_page(page);
 		if (!we_locked)
 			return 1;
@@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
  * __page_set_anon_rmap - set up new anonymous rmap
  * @page:	Page or Hugepage to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1125,7 +1126,8 @@ void do_page_add_anon_rmap(struct page *page,
 			__inc_node_page_state(page, NR_ANON_THPS);
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
-	if (unlikely(PageKsm(page)))
+
+	if (unlikely(PageKsm(page) || PageRemote(page)))
 		return;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1897,6 +1899,8 @@ void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 {
 	if (unlikely(PageKsm(page)))
 		rmap_walk_ksm(page, rwc);
+	else if (unlikely(PageRemote(page)))
+		rmap_walk_remote(page, rwc);
 	else if (PageAnon(page))
 		rmap_walk_anon(page, rwc, false);
 	else
@@ -1906,8 +1910,9 @@ void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 /* Like rmap_walk, but caller holds relevant rmap lock */
 void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
 {
-	/* no ksm support for now */
+	/* no ksm/remote support for now */
 	VM_BUG_ON_PAGE(PageKsm(page), page);
+	VM_BUG_ON_PAGE(PageRemote(page), page);
 	if (PageAnon(page))
 		rmap_walk_anon(page, rwc, true);
 	else
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..63e4dfb477de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4277,7 +4277,8 @@ int page_evictable(struct page *page)
 
 	/* Prevent address_space of inode and swap cache from being freed */
 	rcu_read_lock();
-	ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+	ret = !mapping_unevictable(page_mapping(page)) &&
+		!PageMlocked(page) && !PageRemote(page);
 	rcu_read_unlock();
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 72/92] kvm: introspection: add memory map/unmap support on the guest side
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (70 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 71/92] mm: add support for remote mapping Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 73/92] kvm: introspection: use remote mapping Adalbert Lazăr
                   ` (22 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

An introspection tool running in a dedicated VM can use the new device
(/dev/kvmmem) to map memory from other introspected VM-s.

Two ioctl operations are supported:
  - KVM_HC_MEM_MAP/struct kvmi_mem_map
  - KVM_HC_MEM_UNMAP/unsigned long

In order to map an introspected gpa to the local gva, the process using
this device needs to obtain a token from the host KVMI subsystem (see
Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).

Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_HC_MEM_UNMAP)
to pass the requests to the host kernel/KVMi (see hypercalls.txt).

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/hypercalls.txt |  34 ++
 arch/x86/Kconfig                         |   9 +
 arch/x86/include/asm/kvmi_guest.h        |  10 +
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/kvmi_mem_guest.c         |  26 +
 include/uapi/linux/kvm_para.h            |   2 +
 include/uapi/linux/kvmi.h                |  21 +
 virt/kvm/kvmi_mem_guest.c                | 651 +++++++++++++++++++++++
 8 files changed, 754 insertions(+)
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c

diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index 1ab59537b2fb..a47fae926201 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -173,3 +173,37 @@ The following registers are clobbered:
 In particular, for KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT, the last two
 registers can be poisoned deliberately and cannot be used for passing
 information.
+
+9. KVM_HC_MEM_MAP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Map a guest physical page to another VM (the introspector).
+Usage:
+
+a0: pointer to a token obtained with a KVMI_GET_MAP_TOKEN command (see kvmi.rst)
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+a1: guest physical address to be mapped
+
+a2: guest physical address from introspector that will be replaced
+
+Both guest physical addresses will end up poiting to the same physical page.
+
+Returns any error that the memory manager can return.
+
+10. KVM_HC_MEM_UNMAP
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Unmap a previously mapped page.
+Usage:
+
+a0: guest physical address from introspector
+
+The address will stop pointing to the introspected page and a new physical
+page is allocated for this gpa.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..a7527c1f90a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -820,6 +820,15 @@ config KVM_DEBUG_FS
 	  Statistics are displayed in debugfs filesystem. Enabling this option
 	  may incur significant overhead.
 
+config KVM_INTROSPECTION_GUEST
+	bool "KVM Memory Introspection support on Guest"
+	depends on KVM_GUEST
+	default n
+	help
+	  This option enables functions and hypercalls for security applications
+	  running in a separate VM to control the execution of other VM-s, query
+	  the state of the vCPU-s (GPR-s, MSR-s etc.).
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	depends on PARAVIRT
diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
new file mode 100644
index 000000000000..c7ed53a938e0
--- /dev/null
+++ b/arch/x86/include/asm/kvmi_guest.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_GUEST_H__
+#define __KVMI_GUEST_H__
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+	gpa_t req_gpa, gpa_t map_gpa);
+long kvmi_arch_unmap_hc(gpa_t map_gpa);
+
+
+#endif
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 00b7e27bc2b7..995652ba53b3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -116,6 +116,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
+obj-$(CONFIG_KVM_INTROSPECTION_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 
diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
new file mode 100644
index 000000000000..c4e2613f90f3
--- /dev/null
+++ b/arch/x86/kernel/kvmi_mem_guest.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/kvm_types.h>
+#include <asm/kvm_para.h>
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+		       gpa_t req_gpa, gpa_t map_gpa)
+{
+	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
+			      req_gpa, map_gpa);
+}
+
+long kvmi_arch_unmap_hc(gpa_t map_gpa)
+{
+	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
+}
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 592bda92b6d5..a083e3e66de6 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -33,6 +33,8 @@
 #define KVM_HC_CLOCK_PAIRING		9
 #define KVM_HC_SEND_IPI		10
 
+#define KVM_HC_MEM_MAP			32
+#define KVM_HC_MEM_UNMAP		33
 #define KVM_HC_XEN_HVM_OP		34 /* Xen's __HYPERVISOR_hvm_op */
 
 /*
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
index b072e0a4f33d..8591c748524f 100644
--- a/include/uapi/linux/kvmi.h
+++ b/include/uapi/linux/kvmi.h
@@ -262,4 +262,25 @@ struct kvmi_event_breakpoint {
 	__u8 padding[7];
 };
 
+struct kvmi_map_mem_token {
+	__u64 token[4];
+};
+
+struct kvmi_get_map_token_reply {
+	struct kvmi_map_mem_token token;
+};
+
+/* Map other guest's gpa to local gva */
+struct kvmi_mem_map {
+	struct kvmi_map_mem_token token;
+	__u64 gpa;
+	__u64 gva;
+};
+
+/*
+ * ioctls for /dev/kvmmem
+ */
+#define KVM_INTRO_MEM_MAP       _IOW('i', 0x01, struct kvmi_mem_map)
+#define KVM_INTRO_MEM_UNMAP     _IOW('i', 0x02, unsigned long)
+
 #endif /* _UAPI__LINUX_KVMI_H */
diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
new file mode 100644
index 000000000000..bec473b45289
--- /dev/null
+++ b/virt/kvm/kvmi_mem_guest.c
@@ -0,0 +1,651 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/sched/mm.h>
+#include <linux/types.h>
+#include <linux/kvm_types.h>
+#include <linux/kvm_para.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rwlock.h>
+#include <linux/hashtable.h>
+#include <linux/refcount.h>
+#include <linux/ioctl.h>
+
+#include <uapi/linux/kvmi.h>
+#include <asm/kvmi_guest.h>
+
+#define ASSERT(exp) BUG_ON(!(exp))
+#define DB_HASH_BITS 4
+
+static struct kmem_cache *proc_map_cachep;
+static struct kmem_cache *file_map_cachep;
+static struct kmem_cache *page_map_cachep;
+
+/* process/mm to proc_map */
+static DEFINE_HASHTABLE(db_hash, DB_HASH_BITS);
+static DEFINE_SPINLOCK(db_hash_lock);
+
+struct proc_map {
+	struct mm_struct *mm;		/* database key */
+	struct hlist_node db_link;	/* database link */
+	refcount_t refcnt;
+
+	struct rb_root entries;		/* mapping entries for this mm */
+	rwlock_t entries_lock;
+};
+
+struct file_map {
+	struct proc_map *proc;
+
+	struct list_head entries;	/* mapping entries for this file */
+	spinlock_t entries_lock;
+};
+
+struct page_map {
+	struct rb_node proc_link;	/* link to struct proc_map */
+	struct list_head file_link;	/* link to struct file_map */
+
+	gpa_t gpa;			/* target GPA */
+	gva_t vaddr;			/* local GVA */
+};
+
+static void proc_map_init(struct proc_map *pmap)
+{
+	pmap->mm = NULL;
+	INIT_HLIST_NODE(&pmap->db_link);
+	refcount_set(&pmap->refcnt, 0);
+
+	pmap->entries = RB_ROOT;
+	rwlock_init(&pmap->entries_lock);
+}
+
+static struct proc_map *proc_map_alloc(void)
+{
+	struct proc_map *obj;
+
+	obj = kmem_cache_alloc(proc_map_cachep, GFP_KERNEL);
+	if (obj != NULL)
+		proc_map_init(obj);
+
+	return obj;
+}
+
+static void proc_map_free(struct proc_map *pmap)
+{
+	ASSERT(hlist_unhashed(&pmap->db_link));
+	ASSERT(refcount_read(&pmap->refcnt) == 0);
+	ASSERT(RB_EMPTY_ROOT(&pmap->entries));
+
+	kmem_cache_free(proc_map_cachep, pmap);
+}
+
+static void file_map_init(struct file_map *fmp)
+{
+	INIT_LIST_HEAD(&fmp->entries);
+	spin_lock_init(&fmp->entries_lock);
+}
+
+static struct file_map *file_map_alloc(void)
+{
+	struct file_map *obj;
+
+	obj = kmem_cache_alloc(file_map_cachep, GFP_KERNEL);
+	if (obj != NULL)
+		file_map_init(obj);
+
+	return obj;
+}
+
+static void file_map_free(struct file_map *fmp)
+{
+	ASSERT(list_empty(&fmp->entries));
+
+	kmem_cache_free(file_map_cachep, fmp);
+}
+
+static void page_map_init(struct page_map *pmp)
+{
+	memset(pmp, 0, sizeof(*pmp));
+
+	RB_CLEAR_NODE(&pmp->proc_link);
+	INIT_LIST_HEAD(&pmp->file_link);
+}
+
+static struct page_map *page_map_alloc(void)
+{
+	struct page_map *obj;
+
+	obj = kmem_cache_alloc(page_map_cachep, GFP_KERNEL);
+	if (obj != NULL)
+		page_map_init(obj);
+
+	return obj;
+}
+
+static void page_map_free(struct page_map *pmp)
+{
+	ASSERT(RB_EMPTY_NODE(&pmp->proc_link));
+
+	kmem_cache_free(page_map_cachep, pmp);
+}
+
+static struct proc_map *get_proc_map(void)
+{
+	struct proc_map *pmap, *allocated;
+	struct mm_struct *mm;
+	bool found = false;
+
+	if (!mmget_not_zero(current->mm))
+		return NULL;
+	mm = current->mm;
+
+	allocated = proc_map_alloc();	/* may be NULL */
+
+	spin_lock(&db_hash_lock);
+
+	hash_for_each_possible(db_hash, pmap, db_link, (unsigned long)mm)
+		if (pmap->mm == mm && refcount_inc_not_zero(&pmap->refcnt)) {
+			found = true;
+			break;
+		}
+
+	if (!found && allocated != NULL) {
+		pmap = allocated;
+		allocated = NULL;
+
+		pmap->mm = mm;
+		hash_add(db_hash, &pmap->db_link, (unsigned long)mm);
+		refcount_set(&pmap->refcnt, 1);
+	} else
+		mmput(mm);
+
+	spin_unlock(&db_hash_lock);
+
+	if (allocated != NULL)
+		proc_map_free(allocated);
+
+	return pmap;
+}
+
+static void put_proc_map(struct proc_map *pmap)
+{
+	if (refcount_dec_and_test(&pmap->refcnt)) {
+		mmput(pmap->mm);
+
+		/* remove from hash table */
+		spin_lock(&db_hash_lock);
+		hash_del(&pmap->db_link);
+		spin_unlock(&db_hash_lock);
+
+		proc_map_free(pmap);
+	}
+}
+
+static bool proc_map_insert(struct proc_map *pmap, struct page_map *pmp)
+{
+	struct rb_root *root = &pmap->entries;
+	struct rb_node **new = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct page_map *this;
+	bool inserted = true;
+
+	write_lock(&pmap->entries_lock);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		this = rb_entry(*new, struct page_map, proc_link);
+
+		parent = *new;
+		if (pmp->vaddr < this->vaddr)
+			new = &((*new)->rb_left);
+		else if (pmp->vaddr > this->vaddr)
+			new = &((*new)->rb_right);
+		else {
+			/* Already have this address */
+			inserted = false;
+			goto out;
+		}
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&pmp->proc_link, parent, new);
+	rb_insert_color(&pmp->proc_link, root);
+
+out:
+	write_unlock(&pmap->entries_lock);
+
+	return inserted;
+}
+
+#if 0 /* will use this later */
+static struct page_map *proc_map_search(struct proc_map *pmap,
+					unsigned long vaddr)
+{
+	struct rb_root *root = &pmap->entries;
+	struct rb_node *node;
+	struct page_map *pmp;
+
+	read_lock(&pmap->entries_lock);
+
+	node = root->rb_node;
+
+	while (node) {
+		pmp = rb_entry(node, struct page_map, proc_link);
+
+		if (vaddr < pmp->vaddr)
+			node = node->rb_left;
+		else if (vaddr > pmp->vaddr)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	if (!node)
+		pmp = NULL;
+
+	read_unlock(&pmap->entries_lock);
+
+	return pmp;
+}
+#endif
+
+static struct page_map *proc_map_search_extract(struct proc_map *pmap,
+						unsigned long vaddr)
+{
+	struct rb_root *root = &pmap->entries;
+	struct rb_node *node;
+	struct page_map *pmp;
+
+	write_lock(&pmap->entries_lock);
+
+	node = root->rb_node;
+
+	while (node) {
+		pmp = rb_entry(node, struct page_map, proc_link);
+
+		if (vaddr < pmp->vaddr)
+			node = node->rb_left;
+		else if (vaddr > pmp->vaddr)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	if (node) {
+		rb_erase(&pmp->proc_link, &pmap->entries);
+		RB_CLEAR_NODE(&pmp->proc_link);
+	} else
+		pmp = NULL;
+
+	write_unlock(&pmap->entries_lock);
+
+	return pmp;
+}
+
+static void proc_map_remove(struct proc_map *pmap, struct page_map *pmp)
+{
+	write_lock(&pmap->entries_lock);
+	rb_erase(&pmp->proc_link, &pmap->entries);
+	RB_CLEAR_NODE(&pmp->proc_link);
+	write_unlock(&pmap->entries_lock);
+}
+
+static void file_map_insert(struct file_map *fmp, struct page_map *pmp)
+{
+	spin_lock(&fmp->entries_lock);
+	list_add(&pmp->file_link, &fmp->entries);
+	spin_unlock(&fmp->entries_lock);
+}
+
+static void file_map_remove(struct file_map *fmp, struct page_map *pmp)
+{
+	spin_lock(&fmp->entries_lock);
+	list_del(&pmp->file_link);
+	spin_unlock(&fmp->entries_lock);
+}
+
+/*
+ * Opens the device for map/unmap operations. The mm of this process is
+ * associated with these files in a 1:many relationship.
+ * Operations on this file must be done within the same process that opened it.
+ */
+static int kvm_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct proc_map *pmap;
+	struct file_map *fmp;
+
+	pr_debug("kvmi: file %016lx opened by mm %016lx\n",
+		 (unsigned long) filp, (unsigned long)current->mm);
+
+	pmap = get_proc_map();
+	if (pmap == NULL)
+		return -ENOENT;
+
+	/* link the file 1:1 with such a structure */
+	fmp = file_map_alloc();
+	if (fmp == NULL)
+		return -ENOMEM;
+
+	fmp->proc = pmap;
+	filp->private_data = fmp;
+
+	return 0;
+}
+
+static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmp)
+{
+	struct page *page;
+	phys_addr_t paddr;
+	long nrpages;
+	long result = 0;
+
+	down_read(&current->mm->mmap_sem);
+
+	/* pin the page to be replaced (also swaps in the page) */
+	nrpages = get_user_pages_locked(map_req->gva, 1,
+					FOLL_SPLIT | FOLL_MIGRATION,
+					&page, NULL);
+	if (unlikely(nrpages == 0)) {
+		result = -ENOENT;
+		pr_err("kvmi: found no page for %016llx\n", map_req->gva);
+		goto out;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		pr_err("kvmi: get_user_pages_locked() failed (%ld)\n", result);
+		goto out;
+	}
+
+	paddr = page_to_phys(page);
+	pr_debug("%s: page phys addr %016llx\n", __func__, paddr);
+
+	/* last thing to do is host mapping */
+	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
+	if (IS_ERR_VALUE(result)) {
+		pr_err("kvmi: mapping failed for %016llx -> %016lx (%ld)\n",
+			pmp->gpa, pmp->vaddr, result);
+
+		/* don't need this page anymore */
+		put_page(page);
+	}
+
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return result;
+}
+
+static long _do_unmapping(struct mm_struct *mm, struct page_map *pmp)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	phys_addr_t paddr;
+	long result = 0;
+
+	down_read(&mm->mmap_sem);
+
+	/* find the VMA for the virtual address */
+	vma = find_vma(mm, pmp->vaddr);
+	if (vma == NULL) {
+		result = -ENOENT;
+		pr_err("kvmi: find_vma() found no VMA\n");
+		goto out;
+	}
+
+	/* the page is pinned, thus easy to access */
+	page = follow_page(vma, pmp->vaddr, 0);
+	if (IS_ERR_VALUE(page)) {
+		result = PTR_ERR(page);
+		pr_err("kvmi: follow_page() failed (%ld)\n", result);
+		goto out;
+	} else if (page == NULL) {
+		result = -ENOENT;
+		pr_err("kvmi: follow_page() found no page\n");
+		goto out;
+	}
+
+	paddr = page_to_phys(page);
+	pr_debug("%s: page phys addr %016llx\n", __func__, paddr);
+
+	/* last thing to do is host unmapping */
+	result = kvmi_arch_unmap_hc(paddr);
+	if (IS_ERR_VALUE(result))
+		pr_warn("kvmi: unmapping failed for %016lx (%ld)\n",
+			pmp->vaddr, result);
+
+	/* finally unpin the page */
+	put_page(page);
+
+out:
+	up_read(&mm->mmap_sem);
+
+	return result;
+}
+
+static noinline long kvm_dev_ioctl_map(struct file_map *fmp,
+				       struct kvmi_mem_map *map)
+{
+	struct proc_map *pmap = fmp->proc;
+	struct page_map *pmp;
+	bool added;
+	long result = 0;
+
+	pr_debug("kvmi: mm %016lx map request %016llx -> %016llx\n",
+		(unsigned long)current->mm, map->gpa, map->gva);
+
+	if (!access_ok(map->gva, PAGE_SIZE))
+		return -EINVAL;
+
+	/* prepare list entry */
+	pmp = page_map_alloc();
+	if (pmp == NULL)
+		return -ENOMEM;
+
+	pmp->gpa = map->gpa;
+	pmp->vaddr = map->gva;
+
+	added = proc_map_insert(pmap, pmp);
+	if (added == false) {
+		result = -EALREADY;
+		pr_err("kvmi: address %016llx already mapped into\n", map->gva);
+		goto out_free;
+	}
+	file_map_insert(fmp, pmp);
+
+	/* actual mapping here */
+	result = _do_mapping(map, pmp);
+	if (IS_ERR_VALUE(result))
+		goto out_remove;
+
+	return 0;
+
+out_remove:
+	proc_map_remove(pmap, pmp);
+	file_map_remove(fmp, pmp);
+
+out_free:
+	page_map_free(pmp);
+
+	return result;
+}
+
+static noinline long kvm_dev_ioctl_unmap(struct file_map *fmp,
+					 unsigned long vaddr)
+{
+	struct proc_map *pmap = fmp->proc;
+	struct page_map *pmp;
+	long result = 0;
+
+	pr_debug("kvmi: mm %016lx unmap request %016lx\n",
+		(unsigned long)current->mm, vaddr);
+
+	pmp = proc_map_search_extract(pmap, vaddr);
+	if (pmp == NULL) {
+		pr_err("kvmi: address %016lx not mapped\n", vaddr);
+		return -ENOENT;
+	}
+
+	/* actual unmapping here */
+	result = _do_unmapping(current->mm, pmp);
+
+	file_map_remove(fmp, pmp);
+	page_map_free(pmp);
+
+	return result;
+}
+
+/*
+ * Operations on this file must be done within the same process that opened it.
+ */
+static long kvm_dev_ioctl(struct file *filp,
+			  unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_map *fmp = filp->private_data;
+	struct proc_map *pmap = fmp->proc;
+	long result;
+
+	/* this helps keep my code simpler */
+	if (current->mm != pmap->mm) {
+		pr_err("kvmi: ioctl request by different process\n");
+		return -EINVAL;
+	}
+
+	switch (ioctl) {
+	case KVM_INTRO_MEM_MAP: {
+		struct kvmi_mem_map map;
+
+		result = -EFAULT;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			break;
+
+		result = kvm_dev_ioctl_map(fmp, &map);
+		break;
+	}
+	case KVM_INTRO_MEM_UNMAP: {
+		unsigned long vaddr = (unsigned long) arg;
+
+		result = kvm_dev_ioctl_unmap(fmp, vaddr);
+		break;
+	}
+	default:
+		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+/*
+ * No constraint on closing the device.
+ */
+static int kvm_dev_release(struct inode *inodep, struct file *filp)
+{
+	struct file_map *fmp = filp->private_data;
+	struct proc_map *pmap = fmp->proc;
+	struct page_map *pmp, *temp;
+
+	pr_debug("kvmi: file %016lx closed by mm %016lx\n",
+		 (unsigned long) filp, (unsigned long)current->mm);
+
+	/* this file_map has no more users, thus no more concurrent access */
+	list_for_each_entry_safe(pmp, temp, &fmp->entries, file_link) {
+		proc_map_remove(pmap, pmp);
+		list_del(&pmp->file_link);
+
+		_do_unmapping(pmap->mm, pmp);
+
+		page_map_free(pmp);
+	}
+
+	file_map_free(fmp);
+	put_proc_map(pmap);
+
+	return 0;
+}
+
+static const struct file_operations kvmmem_ops = {
+	.open		= kvm_dev_open,
+	.unlocked_ioctl = kvm_dev_ioctl,
+	.compat_ioctl   = kvm_dev_ioctl,
+	.release	= kvm_dev_release,
+};
+
+static struct miscdevice kvm_mem_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "kvmmem",
+	.fops		= &kvmmem_ops,
+};
+
+static int __init kvm_intro_guest_init(void)
+{
+	int result = 0;
+
+	if (!kvm_para_available()) {
+		pr_err("kvmi: paravirt not available\n");
+		return -EPERM;
+	}
+
+	proc_map_cachep = KMEM_CACHE(proc_map, SLAB_PANIC | SLAB_ACCOUNT);
+	if (proc_map_cachep == NULL) {
+		result = -ENOMEM;
+		goto out_err;
+	}
+
+	file_map_cachep = KMEM_CACHE(file_map, SLAB_PANIC | SLAB_ACCOUNT);
+	if (file_map_cachep == NULL) {
+		result = -ENOMEM;
+		goto out_err;
+	}
+
+	page_map_cachep = KMEM_CACHE(page_map, SLAB_PANIC | SLAB_ACCOUNT);
+	if (page_map_cachep == NULL) {
+		result = -ENOMEM;
+		goto out_err;
+	}
+
+	result = misc_register(&kvm_mem_dev);
+	if (result) {
+		pr_err("kvmi: misc device register failed (%d)\n", result);
+		goto out_err;
+	}
+
+	pr_debug("kvmi: guest memory introspection device created\n");
+
+	return 0;
+
+out_err:
+	kmem_cache_destroy(page_map_cachep);
+	kmem_cache_destroy(file_map_cachep);
+	kmem_cache_destroy(proc_map_cachep);
+
+	return result;
+}
+
+static void __exit kvm_intro_guest_exit(void)
+{
+	misc_deregister(&kvm_mem_dev);
+
+	kmem_cache_destroy(page_map_cachep);
+	kmem_cache_destroy(file_map_cachep);
+	kmem_cache_destroy(proc_map_cachep);
+}
+
+module_init(kvm_intro_guest_init)
+module_exit(kvm_intro_guest_exit)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 73/92] kvm: introspection: use remote mapping
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (71 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 72/92] kvm: introspection: add memory map/unmap support on the guest side Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation Adalbert Lazăr
                   ` (21 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Mircea Cîrjaliu

From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

This commit adds the missing KVMI_GET_MAP_TOKEN command and handle the
hypercalls used to map/unmap guest pages.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst |  39 ++++
 arch/x86/kvm/Makefile              |   2 +-
 arch/x86/kvm/x86.c                 |   6 +
 include/linux/kvmi.h               |   3 +
 virt/kvm/kvmi.c                    |  12 +-
 virt/kvm/kvmi_int.h                |  10 +
 virt/kvm/kvmi_mem.c                | 319 +++++++++++++++++++++++++++++
 virt/kvm/kvmi_msg.c                |  15 ++
 8 files changed, 404 insertions(+), 2 deletions(-)
 create mode 100644 virt/kvm/kvmi_mem.c

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
index 572abab1f6ef..b12e14f14c21 100644
--- a/Documentation/virtual/kvm/kvmi.rst
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -1144,6 +1144,45 @@ Returns the guest memory type for a specific physical address.
 * -KVM_EINVAL - padding is not zero
 * -KVM_EAGAIN - the selected vCPU can't be introspected yet
 
+25. KVMI_GET_MAP_TOKEN
+----------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_map_token_reply {
+		struct kvmi_map_mem_token token;
+	};
+
+Where::
+
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+Requests a token for a memory map operation.
+
+On this command, the host generates a random token to be used (once)
+to map a physical page from the introspected guest. The introspector
+could use the token with the KVM_INTRO_MEM_MAP ioctl (on /dev/kvmmem)
+to map a guest physical page to one of its memory pages. The ioctl,
+in turn, will use the KVM_HC_MEM_MAP hypercall (see hypercalls.txt).
+
+The guest kernel exposing /dev/kvmmem keeps a list with all the mappings
+(to all the guests introspected by the tool) in order to unmap them
+(using the KVM_HC_MEM_UNMAP hypercall) when /dev/kvmmem is closed or on
+demand (using the KVM_INTRO_MEM_UNMAP ioctl).
+
+:Errors:
+
+* -KVM_EAGAIN - too many tokens have accumulated
+* -KVM_ENOMEM - not enough memory to allocate a new token
+
 Events
 ======
 
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 673cf37c0747..5bea446219ca 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -7,7 +7,7 @@ KVM := ../../../virt/kvm
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
-kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o $(KVM)/kvmi_msg.o kvmi.o
+kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o $(KVM)/kvmi_msg.o $(KVM)/kvmi_mem.o kvmi.o
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 06f44ce8ed07..04b1d2916a0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7337,6 +7337,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
 		break;
 #ifdef CONFIG_KVM_INTROSPECTION
+	case KVM_HC_MEM_MAP:
+		ret = kvmi_host_mem_map(vcpu, (gva_t)a0, (gpa_t)a1, (gpa_t)a2);
+		break;
+	case KVM_HC_MEM_UNMAP:
+		ret = kvmi_host_mem_unmap(vcpu, (gpa_t)a0);
+		break;
 	case KVM_HC_XEN_HVM_OP:
 		ret = 0;
 		if (!kvmi_hypercall_event(vcpu))
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
index 10cd6c6412d2..dd980fb0ebcd 100644
--- a/include/linux/kvmi.h
+++ b/include/linux/kvmi.h
@@ -24,6 +24,9 @@ bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
 bool kvmi_tracked_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 bool kvmi_single_step(struct kvm_vcpu *vcpu, gpa_t gpa, int *emulation_type);
 void kvmi_handle_requests(struct kvm_vcpu *vcpu);
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+			     gpa_t req_gpa, gpa_t map_gpa);
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa);
 void kvmi_stop_ss(struct kvm_vcpu *vcpu);
 bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu);
 void kvmi_init_emulate(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index ca146ffec061..157f3a401d64 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -10,6 +10,7 @@
 #include "kvmi_int.h"
 #include <linux/kthread.h>
 #include <linux/bitmap.h>
+#include <linux/remote_mapping.h>
 
 #define MAX_PAUSE_REQUESTS 1001
 
@@ -320,11 +321,13 @@ static int kvmi_cache_create(void)
 
 int kvmi_init(void)
 {
+	kvmi_mem_init();
 	return kvmi_cache_create();
 }
 
 void kvmi_uninit(void)
 {
+	kvmi_mem_exit();
 	kvmi_cache_destroy();
 }
 
@@ -1647,6 +1650,11 @@ int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
 	return 0;
 }
 
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	return kvmi_mem_generate_token(kvm, token);
+}
+
 int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable)
 {
@@ -2015,7 +2023,9 @@ int kvmi_ioctl_unhook(struct kvm *kvm, bool force_reset)
 	if (!ikvm)
 		return -EFAULT;
 
-	if (!force_reset && !kvmi_unhook_event(kvm))
+	if (force_reset)
+		mm_remote_reset();
+	else if (!kvmi_unhook_event(kvm))
 		err = -ENOENT;
 
 	kvmi_put(kvm);
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
index c96fa2b1e9b7..2432377d6371 100644
--- a/virt/kvm/kvmi_int.h
+++ b/virt/kvm/kvmi_int.h
@@ -148,6 +148,8 @@ struct kvmi {
 	struct task_struct *recv;
 	atomic_t ev_seq;
 
+	atomic_t num_tokens;
+
 	uuid_t uuid;
 
 	DECLARE_BITMAP(cmd_allow_mask, KVMI_NUM_COMMANDS);
@@ -229,7 +231,9 @@ int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, unsigned int event_id,
 			    bool enable);
 int kvmi_cmd_control_vm_events(struct kvmi *ikvm, unsigned int event_id,
 			       bool enable);
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
 int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait);
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn);
 struct kvmi * __must_check kvmi_get(struct kvm *kvm);
 void kvmi_put(struct kvm *kvm);
 int kvmi_run_jobs_and_wait(struct kvm_vcpu *vcpu);
@@ -298,4 +302,10 @@ int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
 			      const struct kvmi_control_msr *req);
 int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8 *type);
 
+/* kvmi_mem.c */
+void kvmi_mem_init(void);
+void kvmi_mem_exit(void);
+int kvmi_mem_generate_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
+void kvmi_clear_vm_tokens(struct kvm *kvm);
+
 #endif
diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
new file mode 100644
index 000000000000..6244add60062
--- /dev/null
+++ b/virt/kvm/kvmi_mem.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection memory mapping implementation
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/kvmi.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/workqueue.h>
+#include <linux/remote_mapping.h>
+
+#include <uapi/linux/kvmi.h>
+
+#include "kvmi_int.h"
+
+#define KVMI_MEM_MAX_TOKENS 8
+#define KVMI_MEM_TOKEN_TIMEOUT 3
+#define TOKEN_TIMEOUT_NSEC (KVMI_MEM_TOKEN_TIMEOUT * NSEC_PER_SEC)
+
+static struct list_head token_list;
+static spinlock_t token_lock;
+static struct hrtimer token_timer;
+static struct work_struct token_work;
+
+struct token_entry {
+	struct list_head token_list;
+	struct kvmi_map_mem_token token;
+	struct kvm *kvm;
+	ktime_t timestamp;
+};
+
+void kvmi_clear_vm_tokens(struct kvm *kvm)
+{
+	struct token_entry *cur, *next;
+	struct kvmi *ikvm = IKVM(kvm);
+	struct list_head temp;
+
+	INIT_LIST_HEAD(&temp);
+
+	spin_lock(&token_lock);
+	list_for_each_entry_safe(cur, next, &token_list, token_list) {
+		if (cur->kvm == kvm) {
+			atomic_dec(&ikvm->num_tokens);
+
+			list_del(&cur->token_list);
+			list_add(&cur->token_list, &temp);
+		}
+	}
+	spin_unlock(&token_lock);
+
+	/* freeing a KVM may sleep */
+	list_for_each_entry_safe(cur, next, &temp, token_list) {
+		kvm_put_kvm(cur->kvm);
+		kfree(cur);
+	}
+}
+
+static void token_timeout_work(struct work_struct *work)
+{
+	struct token_entry *cur, *next;
+	ktime_t now = ktime_get();
+	struct kvmi *ikvm;
+	struct list_head temp;
+
+	INIT_LIST_HEAD(&temp);
+
+	spin_lock(&token_lock);
+	list_for_each_entry_safe(cur, next, &token_list, token_list)
+		if (ktime_sub(now, cur->timestamp) > TOKEN_TIMEOUT_NSEC) {
+			ikvm = kvmi_get(cur->kvm);
+			if (ikvm) {
+				atomic_dec(&ikvm->num_tokens);
+				kvmi_put(cur->kvm);
+			}
+
+			list_del(&cur->token_list);
+			list_add(&cur->token_list, &temp);
+		}
+	spin_unlock(&token_lock);
+
+	if (!list_empty(&temp))
+		kvm_info("kvmi: token(s) timed out\n");
+
+	/* freeing a KVM may sleep */
+	list_for_each_entry_safe(cur, next, &temp, token_list) {
+		kvm_put_kvm(cur->kvm);
+		kfree(cur);
+	}
+}
+
+static enum hrtimer_restart token_timer_fn(struct hrtimer *timer)
+{
+	schedule_work(&token_work);
+
+	hrtimer_add_expires_ns(timer, NSEC_PER_SEC);
+	return HRTIMER_RESTART;
+}
+
+int kvmi_mem_generate_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	struct kvmi *ikvm;
+	struct token_entry *tep;
+
+	/* too many tokens have accumulated, retry later */
+	ikvm = IKVM(kvm);
+	if (atomic_read(&ikvm->num_tokens) > KVMI_MEM_MAX_TOKENS)
+		return -KVM_EAGAIN;
+
+	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
+			     32, 1, token, sizeof(*token), false);
+
+	tep = kmalloc(sizeof(*tep), GFP_KERNEL);
+	if (tep == NULL)
+		return -KVM_ENOMEM;
+
+	/* pin KVM so it won't go away while we wait for HC */
+	kvm_get_kvm(kvm);
+	get_random_bytes(token, sizeof(*token));
+	atomic_inc(&ikvm->num_tokens);
+
+	/* init token entry */
+	INIT_LIST_HEAD(&tep->token_list);
+	memcpy(&tep->token, token, sizeof(*token));
+	tep->kvm = kvm;
+	tep->timestamp = ktime_get();
+
+	/* add to list */
+	spin_lock(&token_lock);
+	list_add_tail(&tep->token_list, &token_list);
+	spin_unlock(&token_lock);
+
+	return 0;
+}
+
+static struct kvm *find_machine_at(struct kvm_vcpu *vcpu, gva_t tkn_gva)
+{
+	long result;
+	gpa_t tkn_gpa;
+	struct kvmi_map_mem_token token;
+	struct list_head *cur;
+	struct token_entry *tep, *found = NULL;
+	struct kvm *target_kvm = NULL;
+	struct kvmi *ikvm;
+
+	/* machine token is passed as pointer */
+	tkn_gpa = kvm_mmu_gva_to_gpa_system(vcpu, tkn_gva, 0, NULL);
+	if (tkn_gpa == UNMAPPED_GVA)
+		return NULL;
+
+	/* copy token to local address space */
+	result = kvm_read_guest(vcpu->kvm, tkn_gpa, &token, sizeof(token));
+	if (IS_ERR_VALUE(result)) {
+		kvm_err("kvmi: failed copying token from user\n");
+		return ERR_PTR(result);
+	}
+
+	/* consume token & find the VM */
+	spin_lock(&token_lock);
+	list_for_each(cur, &token_list) {
+		tep = list_entry(cur, struct token_entry, token_list);
+
+		if (!memcmp(&token, &tep->token, sizeof(token))) {
+			list_del(&tep->token_list);
+			found = tep;
+			break;
+		}
+	}
+	spin_unlock(&token_lock);
+
+	if (found != NULL) {
+		target_kvm = found->kvm;
+		kfree(found);
+
+		ikvm = kvmi_get(target_kvm);
+		if (ikvm) {
+			atomic_dec(&ikvm->num_tokens);
+			kvmi_put(target_kvm);
+		}
+	}
+
+	return target_kvm;
+}
+
+
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+		      gpa_t req_gpa, gpa_t map_gpa)
+{
+	int result = 0;
+	struct kvm *target_kvm;
+
+	gfn_t req_gfn;
+	hva_t req_hva;
+	struct mm_struct *req_mm;
+
+	gfn_t map_gfn;
+	hva_t map_hva;
+
+	kvm_debug("kvmi: mapping request req_gpa %016llx, map_gpa %016llx\n",
+		  req_gpa, map_gpa);
+
+	/* get the struct kvm * corresponding to the token */
+	target_kvm = find_machine_at(vcpu, tkn_gva);
+	if (IS_ERR_VALUE(target_kvm)) {
+		return PTR_ERR(target_kvm);
+	} else if (target_kvm == NULL) {
+		kvm_err("kvmi: unable to find target machine\n");
+		return -KVM_ENOENT;
+	}
+	req_mm = target_kvm->mm;
+
+	/* translate source addresses */
+	req_gfn = gpa_to_gfn(req_gpa);
+	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
+	if (kvm_is_error_hva(req_hva)) {
+		kvm_err("kvmi: invalid req_gpa %016llx\n", req_gpa);
+		result = -KVM_EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: req_gpa %016llx -> req_hva %016lx\n",
+		  req_gpa, req_hva);
+
+	/* translate destination addresses */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		kvm_err("kvmi: invalid map_gpa %016llx\n", map_gpa);
+		result = -KVM_EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx -> map_hva %016lx\n",
+		map_gpa, map_hva);
+
+	/* actually do the mapping */
+	result = mm_remote_map(req_mm, req_hva, map_hva);
+	if (IS_ERR_VALUE((long)result)) {
+		if (result == -EBUSY)
+			kvm_debug("kvmi: mapping of req_gpa %016llx failed: %d.\n",
+				req_gpa, result);
+		else
+			kvm_err("kvmi: mapping of req_gpa %016llx failed: %d.\n",
+				req_gpa, result);
+		goto out;
+	}
+
+	/* all fine */
+	kvm_debug("kvmi: mapping of req_gpa %016llx successful\n", req_gpa);
+
+out:
+	kvm_put_kvm(target_kvm);
+
+	return result;
+}
+
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
+{
+	gfn_t map_gfn;
+	hva_t map_hva;
+	int result;
+
+	kvm_debug("kvmi: unmapping request for map_gpa %016llx\n", map_gpa);
+
+	/* convert GPA -> HVA */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		result = -KVM_EFAULT;
+		kvm_err("kvmi: invalid map_gpa %016llx\n", map_gpa);
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx -> map_hva %016lx\n",
+		map_gpa, map_hva);
+
+	/* actually do the unmapping */
+	result = mm_remote_unmap(map_hva);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	kvm_debug("kvmi: unmapping of map_gpa %016llx successful\n", map_gpa);
+
+out:
+	return result;
+}
+
+void kvmi_mem_init(void)
+{
+	ktime_t expire;
+
+	INIT_LIST_HEAD(&token_list);
+	spin_lock_init(&token_lock);
+	INIT_WORK(&token_work, token_timeout_work);
+
+	hrtimer_init(&token_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	token_timer.function = token_timer_fn;
+	expire = ktime_add_ns(ktime_get(), NSEC_PER_SEC);
+	hrtimer_start(&token_timer, expire, HRTIMER_MODE_ABS);
+
+	kvm_info("kvmi: initialized host memory introspection\n");
+}
+
+void kvmi_mem_exit(void)
+{
+	hrtimer_cancel(&token_timer);
+}
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index 3e381f95b686..a5f87aafa237 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -33,6 +33,7 @@ static const char *const msg_IDs[] = {
 	[KVMI_EVENT_REPLY]           = "KVMI_EVENT_REPLY",
 	[KVMI_GET_CPUID]             = "KVMI_GET_CPUID",
 	[KVMI_GET_GUEST_INFO]        = "KVMI_GET_GUEST_INFO",
+	[KVMI_GET_MAP_TOKEN]         = "KVMI_GET_MAP_TOKEN",
 	[KVMI_GET_MTRR_TYPE]         = "KVMI_GET_MTRR_TYPE",
 	[KVMI_GET_PAGE_ACCESS]       = "KVMI_GET_PAGE_ACCESS",
 	[KVMI_GET_PAGE_WRITE_BITMAP] = "KVMI_GET_PAGE_WRITE_BITMAP",
@@ -352,6 +353,19 @@ static int handle_write_physical(struct kvmi *ikvm,
 	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, NULL, 0);
 }
 
+static int handle_get_map_token(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	struct kvmi_get_map_token_reply rpl;
+	int ec;
+
+	memset(&rpl, 0, sizeof(rpl));
+	ec = kvmi_cmd_alloc_token(ikvm->kvm, &rpl.token);
+
+	return kvmi_msg_vm_maybe_reply(ikvm, msg, ec, &rpl, sizeof(rpl));
+}
+
 static bool enable_spp(struct kvmi *ikvm)
 {
 	if (!ikvm->spp.initialized) {
@@ -524,6 +538,7 @@ static int(*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
 	[KVMI_CONTROL_SPP]           = handle_control_spp,
 	[KVMI_CONTROL_VM_EVENTS]     = handle_control_vm_events,
 	[KVMI_GET_GUEST_INFO]        = handle_get_guest_info,
+	[KVMI_GET_MAP_TOKEN]         = handle_get_map_token,
 	[KVMI_GET_PAGE_ACCESS]       = handle_get_page_access,
 	[KVMI_GET_PAGE_WRITE_BITMAP] = handle_get_page_write_bitmap,
 	[KVMI_GET_VERSION]           = handle_get_version,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (72 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 73/92] kvm: introspection: use remote mapping Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-13  9:20   ` Paolo Bonzini
  2019-08-09 16:00 ` [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage() Adalbert Lazăr
                   ` (20 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Joerg Roedel

From: Mihai Donțu <mdontu@bitdefender.com>

It can happened for us to end up emulating the VMCALL instruction as a
result of the handling of an EPT write fault. In this situation, the
emulator will try to unconditionally patch the correct hypercall opcode
bytes using emulator_write_emulated(). However, this last call uses the
fault GPA (if available) or walks the guest page tables at RIP,
otherwise. The trouble begins when using KVMI, when we forbid the use of
the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
newer) the page that we try to write into is marked read-execute and as
such emulator_write_emulated() fails and we inject a write #PF, leading
to a guest crash.

The fix is rather simple: check the existing instruction bytes before
doing the patching. This does not change the normal KVM behaviour, but
does help when using KVMI as we no longer inject a write #PF.

CC: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04b1d2916a0a..965c4f0108eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7363,16 +7363,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
 
+#define KVM_HYPERCALL_INSN_LEN 3
+
 static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 {
+	int err;
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	char instruction[3];
+	char buf[KVM_HYPERCALL_INSN_LEN];
+	char instruction[KVM_HYPERCALL_INSN_LEN];
 	unsigned long rip = kvm_rip_read(vcpu);
 
+	err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
+				     &ctxt->exception);
+	if (err != X86EMUL_CONTINUE)
+		return err;
+
 	kvm_x86_ops->patch_hypercall(vcpu, instruction);
+	if (!memcmp(instruction, buf, sizeof(instruction)))
+		/*
+		 * The hypercall instruction is the correct one. Retry
+		 * its execution maybe we got here as a result of an
+		 * event other than #UD which has been resolved in the
+		 * mean time.
+		 */
+		return X86EMUL_CONTINUE;
 
-	return emulator_write_emulated(ctxt, rip, instruction, 3,
-		&ctxt->exception);
+	return emulator_write_emulated(ctxt, rip, instruction,
+				       sizeof(instruction), &ctxt->exception);
 }
 
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage()
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (73 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-13  8:47   ` Paolo Bonzini
  2019-08-09 16:00 ` [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present Adalbert Lazăr
                   ` (19 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

If the EPT violation was caused by an execute restriction imposed by the
introspection tool, gpa_available will point to the instruction pointer,
not the to the read/write location that has to be used to emulate the
current instruction.

This optimization should be disabled only when the VM is introspected,
not just because the introspection subsystem is present.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 965c4f0108eb..3975331230b9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5532,7 +5532,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 	 * operation using rep will only have the initial GPA from the NPF
 	 * occurred.
 	 */
-	if (vcpu->arch.gpa_available &&
+	if (vcpu->arch.gpa_available && !kvmi_is_present() &&
 	    emulator_can_use_gpa(ctxt) &&
 	    (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
 		gpa = vcpu->arch.gpa_val;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (74 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage() Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-13  9:18   ` Paolo Bonzini
  2019-08-09 16:00 ` [RFC PATCH v6 77/92] kvm: introspection: add trace functions Adalbert Lazăr
                   ` (18 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/vmx/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dc648ba47df3..152c58b63f69 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7718,7 +7718,7 @@ static __init int hardware_setup(void)
 	    !cpu_has_vmx_invept_global())
 		enable_ept = 0;
 
-	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
+	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
 		enable_ept_ad_bits = 0;
 
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 77/92] kvm: introspection: add trace functions
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (75 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 78/92] kvm: x86: add tracepoints for interrupt and exception injections Adalbert Lazăr
                   ` (17 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

Co-developed-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Co-developed-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Co-developed-by: Marian Rotariu <marian.c.rotariu@gmail.com>
Signed-off-by: Marian Rotariu <marian.c.rotariu@gmail.com>
Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/kvmi.c         |  63 ++++
 include/trace/events/kvmi.h | 680 ++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi.c             |  20 ++
 virt/kvm/kvmi_mem.c         |   5 +
 virt/kvm/kvmi_msg.c         |  16 +
 5 files changed, 784 insertions(+)
 create mode 100644 include/trace/events/kvmi.h

diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
index 5312f179af9c..171e76449271 100644
--- a/arch/x86/kvm/kvmi.c
+++ b/arch/x86/kvm/kvmi.c
@@ -9,6 +9,8 @@
 #include <asm/vmx.h>
 #include "../../../virt/kvm/kvmi_int.h"
 
+#include <trace/events/kvmi.h>
+
 static unsigned long *msr_mask(struct kvm_vcpu *vcpu, unsigned int *msr)
 {
 	switch (*msr) {
@@ -102,6 +104,9 @@ static bool __kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	if (old_msr.data == msr->data)
 		return true;
 
+	trace_kvmi_event_msr_send(vcpu->vcpu_id, msr->index, old_msr.data,
+				  msr->data);
+
 	action = kvmi_send_msr(vcpu, msr->index, old_msr.data, msr->data,
 			       &ret_value);
 	switch (action) {
@@ -113,6 +118,8 @@ static bool __kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		kvmi_handle_common_event_actions(vcpu, action, "MSR");
 	}
 
+	trace_kvmi_event_msr_recv(vcpu->vcpu_id, action, ret_value);
+
 	return ret;
 }
 
@@ -387,6 +394,8 @@ static bool __kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 	if (!test_bit(cr, IVCPU(vcpu)->cr_mask))
 		return true;
 
+	trace_kvmi_event_cr_send(vcpu->vcpu_id, cr, old_value, *new_value);
+
 	action = kvmi_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -397,6 +406,8 @@ static bool __kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
 		kvmi_handle_common_event_actions(vcpu, action, "CR");
 	}
 
+	trace_kvmi_event_cr_recv(vcpu->vcpu_id, action, ret_value);
+
 	return ret;
 }
 
@@ -437,6 +448,8 @@ static void __kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
 
+	trace_kvmi_event_xsetbv_send(vcpu->vcpu_id);
+
 	action = kvmi_send_xsetbv(vcpu);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -444,6 +457,8 @@ static void __kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "XSETBV");
 	}
+
+	trace_kvmi_event_xsetbv_recv(vcpu->vcpu_id, action);
 }
 
 void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
@@ -460,12 +475,26 @@ void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
 	kvmi_put(vcpu->kvm);
 }
 
+static u64 get_next_rip(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (ivcpu->have_delayed_regs)
+		return ivcpu->delayed_regs.rip;
+	else
+		return kvm_rip_read(vcpu);
+}
+
 void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
 {
 	u32 action;
 	u64 gpa;
+	u64 old_rip;
 
 	gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, 0, NULL);
+	old_rip = kvm_rip_read(vcpu);
+
+	trace_kvmi_event_bp_send(vcpu->vcpu_id, gpa, old_rip);
 
 	action = kvmi_msg_send_bp(vcpu, gpa, insn_len);
 	switch (action) {
@@ -478,6 +507,8 @@ void kvmi_arch_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva, u8 insn_len)
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "BP");
 	}
+
+	trace_kvmi_event_bp_recv(vcpu->vcpu_id, action, get_next_rip(vcpu));
 }
 
 #define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
@@ -504,6 +535,8 @@ void kvmi_arch_hypercall_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
 
+	trace_kvmi_event_hc_send(vcpu->vcpu_id);
+
 	action = kvmi_msg_send_hypercall(vcpu);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -511,6 +544,8 @@ void kvmi_arch_hypercall_event(struct kvm_vcpu *vcpu)
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "HYPERCALL");
 	}
+
+	trace_kvmi_event_hc_recv(vcpu->vcpu_id, action);
 }
 
 bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
@@ -532,6 +567,9 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 	if (ivcpu->effective_rep_complete)
 		return true;
 
+	trace_kvmi_event_pf_send(vcpu->vcpu_id, gpa, gva, access,
+				 kvm_rip_read(vcpu));
+
 	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &ivcpu->ss_requested,
 				  &ivcpu->rep_complete, &ctx_addr,
 				  ivcpu->ctx_data, &ctx_size);
@@ -553,6 +591,9 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
 		kvmi_handle_common_event_actions(vcpu, action, "PF");
 	}
 
+	trace_kvmi_event_pf_recv(vcpu->vcpu_id, action, get_next_rip(vcpu),
+				 ctx_size, ivcpu->ss_requested, ret);
+
 	return ret;
 }
 
@@ -628,6 +669,11 @@ void kvmi_arch_trap_event(struct kvm_vcpu *vcpu)
 		err = 0;
 	}
 
+	trace_kvmi_event_trap_send(vcpu->vcpu_id, vector,
+				   IVCPU(vcpu)->exception.nr,
+				   err, IVCPU(vcpu)->exception.error_code,
+				   vcpu->arch.cr2);
+
 	action = kvmi_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -635,6 +681,8 @@ void kvmi_arch_trap_event(struct kvm_vcpu *vcpu)
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "TRAP");
 	}
+
+	trace_kvmi_event_trap_recv(vcpu->vcpu_id, action);
 }
 
 static bool __kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
@@ -643,6 +691,8 @@ static bool __kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
 	u32 action;
 	bool ret = false;
 
+	trace_kvmi_event_desc_send(vcpu->vcpu_id, descriptor, write);
+
 	action = kvmi_msg_send_descriptor(vcpu, descriptor, write);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -654,6 +704,8 @@ static bool __kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
 		kvmi_handle_common_event_actions(vcpu, action, "DESC");
 	}
 
+	trace_kvmi_event_desc_recv(vcpu->vcpu_id, action);
+
 	return ret;
 }
 
@@ -718,6 +770,15 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
 				   bool error_code_valid,
 				   u32 error_code, u64 address)
 {
+	struct x86_exception e = {
+		.error_code_valid = error_code_valid,
+		.error_code = error_code,
+		.address = address,
+		.vector = vector,
+	};
+
+	trace_kvmi_cmd_inject_exception(vcpu, &e);
+
 	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
 		return -KVM_EINVAL;
 
@@ -876,6 +937,8 @@ void kvmi_arch_update_page_tracking(struct kvm *kvm,
 			return;
 	}
 
+	trace_kvmi_set_gfn_access(m->gfn, m->access, m->write_bitmap, slot->id);
+
 	for (i = 0; i < ARRAY_SIZE(track_modes); i++) {
 		unsigned int allow_bit = track_modes[i].allow_bit;
 		enum kvm_page_track_mode mode = track_modes[i].track_mode;
diff --git a/include/trace/events/kvmi.h b/include/trace/events/kvmi.h
new file mode 100644
index 000000000000..442189437fe7
--- /dev/null
+++ b/include/trace/events/kvmi.h
@@ -0,0 +1,680 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kvmi
+
+#if !defined(_TRACE_KVMI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KVMI_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __TRACE_KVMI_STRUCTURES
+#define __TRACE_KVMI_STRUCTURES
+
+#undef EN
+#define EN(x) { x, #x }
+
+static const struct trace_print_flags kvmi_msg_id_symbol[] = {
+	EN(KVMI_GET_VERSION),
+	EN(KVMI_CHECK_COMMAND),
+	EN(KVMI_CHECK_EVENT),
+	EN(KVMI_GET_GUEST_INFO),
+	EN(KVMI_GET_VCPU_INFO),
+	EN(KVMI_GET_REGISTERS),
+	EN(KVMI_SET_REGISTERS),
+	EN(KVMI_GET_PAGE_ACCESS),
+	EN(KVMI_SET_PAGE_ACCESS),
+	EN(KVMI_GET_PAGE_WRITE_BITMAP),
+	EN(KVMI_SET_PAGE_WRITE_BITMAP),
+	EN(KVMI_INJECT_EXCEPTION),
+	EN(KVMI_READ_PHYSICAL),
+	EN(KVMI_WRITE_PHYSICAL),
+	EN(KVMI_GET_MAP_TOKEN),
+	EN(KVMI_CONTROL_EVENTS),
+	EN(KVMI_CONTROL_CR),
+	EN(KVMI_CONTROL_MSR),
+	EN(KVMI_EVENT),
+	EN(KVMI_EVENT_REPLY),
+	EN(KVMI_GET_CPUID),
+	EN(KVMI_GET_XSAVE),
+	EN(KVMI_PAUSE_VCPU),
+	EN(KVMI_CONTROL_VM_EVENTS),
+	EN(KVMI_GET_MTRR_TYPE),
+	EN(KVMI_CONTROL_SPP),
+	EN(KVMI_CONTROL_CMD_RESPONSE),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_descriptor_symbol[] = {
+	EN(KVMI_DESC_IDTR),
+	EN(KVMI_DESC_GDTR),
+	EN(KVMI_DESC_LDTR),
+	EN(KVMI_DESC_TR),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_event_symbol[] = {
+	EN(KVMI_EVENT_UNHOOK),
+	EN(KVMI_EVENT_CR),
+	EN(KVMI_EVENT_MSR),
+	EN(KVMI_EVENT_XSETBV),
+	EN(KVMI_EVENT_BREAKPOINT),
+	EN(KVMI_EVENT_HYPERCALL),
+	EN(KVMI_EVENT_PF),
+	EN(KVMI_EVENT_TRAP),
+	EN(KVMI_EVENT_DESCRIPTOR),
+	EN(KVMI_EVENT_CREATE_VCPU),
+	EN(KVMI_EVENT_PAUSE_VCPU),
+	EN(KVMI_EVENT_SINGLESTEP),
+	{ -1, NULL }
+};
+
+static const struct trace_print_flags kvmi_action_symbol[] = {
+	{KVMI_EVENT_ACTION_CONTINUE, "continue"},
+	{KVMI_EVENT_ACTION_RETRY, "retry"},
+	{KVMI_EVENT_ACTION_CRASH, "crash"},
+	{-1, NULL}
+};
+
+#endif /* __TRACE_KVMI_STRUCTURES */
+
+TRACE_EVENT(
+	kvmi_vm_command,
+	TP_PROTO(__u16 id, __u32 seq),
+	TP_ARGS(id, seq),
+	TP_STRUCT__entry(
+		__field(__u16, id)
+		__field(__u32, seq)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+		__entry->seq = seq;
+	),
+	TP_printk("%s seq %d",
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->seq)
+);
+
+TRACE_EVENT(
+	kvmi_vm_reply,
+	TP_PROTO(__u16 id, __u32 seq, __s32 err),
+	TP_ARGS(id, seq, err),
+	TP_STRUCT__entry(
+		__field(__u16, id)
+		__field(__u32, seq)
+		__field(__s32, err)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+		__entry->seq = seq;
+		__entry->err = err;
+	),
+	TP_printk("%s seq %d err %d",
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->seq,
+		  __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_vcpu_command,
+	TP_PROTO(__u16 vcpu, __u16 id, __u32 seq),
+	TP_ARGS(vcpu, id, seq),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u16, id)
+		__field(__u32, seq)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->id = id;
+		__entry->seq = seq;
+	),
+	TP_printk("vcpu %d %s seq %d",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->seq)
+);
+
+TRACE_EVENT(
+	kvmi_vcpu_reply,
+	TP_PROTO(__u16 vcpu, __u16 id, __u32 seq, __s32 err),
+	TP_ARGS(vcpu, id, seq, err),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u16, id)
+		__field(__u32, seq)
+		__field(__s32, err)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->id = id;
+		__entry->seq = seq;
+		__entry->err = err;
+	),
+	TP_printk("vcpu %d %s seq %d err %d",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->seq,
+		  __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_event,
+	TP_PROTO(__u16 vcpu, __u32 id, __u32 seq),
+	TP_ARGS(vcpu, id, seq),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, id)
+		__field(__u32, seq)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->id = id;
+		__entry->seq = seq;
+	),
+	TP_printk("vcpu %d %s seq %d",
+		__entry->vcpu,
+		trace_print_symbols_seq(p, __entry->id, kvmi_event_symbol),
+		__entry->seq)
+);
+
+TRACE_EVENT(
+	kvmi_event_reply,
+	TP_PROTO(__u32 id, __u32 seq),
+	TP_ARGS(id, seq),
+	TP_STRUCT__entry(
+		__field(__u32, id)
+		__field(__u32, seq)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+		__entry->seq = seq;
+	),
+	TP_printk("%s seq %d",
+		trace_print_symbols_seq(p, __entry->id, kvmi_event_symbol),
+		__entry->seq)
+);
+
+#define KVMI_ACCESS_PRINTK() ({						\
+	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
+	static const char * const access_str[] = {			\
+		"---", "r--", "-w-", "rw-", "--x", "r-x", "-wx", "rwx"	\
+	};								\
+	trace_seq_printf(p, "%s", access_str[__entry->access & 7]);	\
+	saved_ptr;							\
+})
+
+TRACE_EVENT(
+	kvmi_set_gfn_access,
+	TP_PROTO(__u64 gfn, __u8 access, __u32 bitmap, __u16 slot),
+	TP_ARGS(gfn, access, bitmap, slot),
+	TP_STRUCT__entry(
+		__field(__u64, gfn)
+		__field(__u8, access)
+		__field(__u32, bitmap)
+		__field(__u16, slot)
+	),
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->access = access;
+		__entry->bitmap = bitmap;
+		__entry->slot = slot;
+	),
+	TP_printk("gfn %llx %s write bitmap %x slot %d",
+		  __entry->gfn, KVMI_ACCESS_PRINTK(),
+		  __entry->bitmap, __entry->slot)
+);
+
+DECLARE_EVENT_CLASS(
+	kvmi_event_send_template,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+	),
+	TP_printk("vcpu %d",
+		  __entry->vcpu
+	)
+);
+DECLARE_EVENT_CLASS(
+	kvmi_event_recv_template,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, action)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->action = action;
+	),
+	TP_printk("vcpu %d %s",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol)
+	)
+);
+
+TRACE_EVENT(
+	kvmi_event_cr_send,
+	TP_PROTO(__u16 vcpu, __u32 cr, __u64 old_value, __u64 new_value),
+	TP_ARGS(vcpu, cr, old_value, new_value),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, cr)
+		__field(__u64, old_value)
+		__field(__u64, new_value)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->cr = cr;
+		__entry->old_value = old_value;
+		__entry->new_value = new_value;
+	),
+	TP_printk("vcpu %d cr %x old_value %llx new_value %llx",
+		  __entry->vcpu,
+		  __entry->cr,
+		  __entry->old_value,
+		  __entry->new_value
+	)
+);
+TRACE_EVENT(
+	kvmi_event_cr_recv,
+	TP_PROTO(__u16 vcpu, __u32 action, __u64 new_value),
+	TP_ARGS(vcpu, action, new_value),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, action)
+		__field(__u64, new_value)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->action = action;
+		__entry->new_value = new_value;
+	),
+	TP_printk("vcpu %d %s new_value %llx",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->new_value
+	)
+);
+
+TRACE_EVENT(
+	kvmi_event_msr_send,
+	TP_PROTO(__u16 vcpu, __u32 msr, __u64 old_value, __u64 new_value),
+	TP_ARGS(vcpu, msr, old_value, new_value),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, msr)
+		__field(__u64, old_value)
+		__field(__u64, new_value)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->msr = msr;
+		__entry->old_value = old_value;
+		__entry->new_value = new_value;
+	),
+	TP_printk("vcpu %d msr %x old_value %llx new_value %llx",
+		  __entry->vcpu,
+		  __entry->msr,
+		  __entry->old_value,
+		  __entry->new_value
+	)
+);
+TRACE_EVENT(
+	kvmi_event_msr_recv,
+	TP_PROTO(__u16 vcpu, __u32 action, __u64 new_value),
+	TP_ARGS(vcpu, action, new_value),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, action)
+		__field(__u64, new_value)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->action = action;
+		__entry->new_value = new_value;
+	),
+	TP_printk("vcpu %d %s new_value %llx",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->new_value
+	)
+);
+
+DEFINE_EVENT(kvmi_event_send_template, kvmi_event_xsetbv_send,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_xsetbv_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+TRACE_EVENT(
+	kvmi_event_bp_send,
+	TP_PROTO(__u16 vcpu, __u64 gpa, __u64 old_rip),
+	TP_ARGS(vcpu, gpa, old_rip),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u64, gpa)
+		__field(__u64, old_rip)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->gpa = gpa;
+		__entry->old_rip = old_rip;
+	),
+	TP_printk("vcpu %d gpa %llx rip %llx",
+		  __entry->vcpu,
+		  __entry->gpa,
+		  __entry->old_rip
+	)
+);
+TRACE_EVENT(
+	kvmi_event_bp_recv,
+	TP_PROTO(__u16 vcpu, __u32 action, __u64 new_rip),
+	TP_ARGS(vcpu, action, new_rip),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, action)
+		__field(__u64, new_rip)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->action = action;
+		__entry->new_rip = new_rip;
+	),
+	TP_printk("vcpu %d %s rip %llx",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->new_rip
+	)
+);
+
+DEFINE_EVENT(kvmi_event_send_template, kvmi_event_hc_send,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_hc_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+TRACE_EVENT(
+	kvmi_event_pf_send,
+	TP_PROTO(__u16 vcpu, __u64 gpa, __u64 gva, __u8 access, __u64 rip),
+	TP_ARGS(vcpu, gpa, gva, access, rip),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u64, gpa)
+		__field(__u64, gva)
+		__field(__u8, access)
+		__field(__u64, rip)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->gpa = gpa;
+		__entry->gva = gva;
+		__entry->access = access;
+		__entry->rip = rip;
+	),
+	TP_printk("vcpu %d gpa %llx %s gva %llx rip %llx",
+		  __entry->vcpu,
+		  __entry->gpa,
+		  KVMI_ACCESS_PRINTK(),
+		  __entry->gva,
+		  __entry->rip
+	)
+);
+TRACE_EVENT(
+	kvmi_event_pf_recv,
+	TP_PROTO(__u16 vcpu, __u32 action, __u64 next_rip, size_t custom_data,
+		 bool singlestep, bool ret),
+	TP_ARGS(vcpu, action, next_rip, custom_data, singlestep, ret),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, action)
+		__field(__u64, next_rip)
+		__field(size_t, custom_data)
+		__field(bool, singlestep)
+		__field(bool, ret)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->action = action;
+		__entry->next_rip = next_rip;
+		__entry->custom_data = custom_data;
+		__entry->singlestep = singlestep;
+		__entry->ret = ret;
+	),
+	TP_printk("vcpu %d %s rip %llx custom %zu %s",
+		  __entry->vcpu,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->next_rip, __entry->custom_data,
+		  (__entry->singlestep ? (__entry->ret ? "singlestep failed" :
+							 "singlestep running")
+					: "")
+	)
+);
+
+TRACE_EVENT(
+	kvmi_event_trap_send,
+	TP_PROTO(__u16 vcpu, __u32 vector, __u8 nr, __u32 err, __u16 error_code,
+		 __u64 cr2),
+	TP_ARGS(vcpu, vector, nr, err, error_code, cr2),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u32, vector)
+		__field(__u8, nr)
+		__field(__u32, err)
+		__field(__u16, error_code)
+		__field(__u64, cr2)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->vector = vector;
+		__entry->nr = nr;
+		__entry->err = err;
+		__entry->error_code = error_code;
+		__entry->cr2 = cr2;
+	),
+	TP_printk("vcpu %d vector %x/%x err %x/%x address %llx",
+		  __entry->vcpu,
+		  __entry->vector, __entry->nr,
+		  __entry->err, __entry->error_code,
+		  __entry->cr2
+	)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_trap_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+TRACE_EVENT(
+	kvmi_event_desc_send,
+	TP_PROTO(__u16 vcpu, __u8 descriptor, __u8 write),
+	TP_ARGS(vcpu, descriptor, write),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+		__field(__u8, descriptor)
+		__field(__u8, write)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+		__entry->descriptor = descriptor;
+		__entry->write = write;
+	),
+	TP_printk("vcpu %d %s %s",
+		  __entry->vcpu,
+		  __entry->write ? "write" : "read",
+		  trace_print_symbols_seq(p, __entry->descriptor,
+					  kvmi_descriptor_symbol)
+	)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_desc_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+DEFINE_EVENT(kvmi_event_send_template, kvmi_event_create_vcpu_send,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_create_vcpu_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+DEFINE_EVENT(kvmi_event_send_template, kvmi_event_pause_vcpu_send,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_pause_vcpu_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+DEFINE_EVENT(kvmi_event_send_template, kvmi_event_singlestep_send,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu)
+);
+DEFINE_EVENT(kvmi_event_recv_template, kvmi_event_singlestep_recv,
+	TP_PROTO(__u16 vcpu, __u32 action),
+	TP_ARGS(vcpu, action)
+);
+
+TRACE_EVENT(
+	kvmi_run_singlestep,
+	TP_PROTO(struct kvm_vcpu *vcpu, __u64 gpa, __u8 access, __u8 level,
+		 size_t custom_data),
+	TP_ARGS(vcpu, gpa, access, level, custom_data),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu_id)
+		__field(__u64, gpa)
+		__field(__u8, access)
+		__field(size_t, len)
+		__array(__u8, insn, 15)
+		__field(__u8, level)
+		__field(size_t, custom_data)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->gpa = gpa;
+		__entry->access = access;
+		__entry->len = min_t(size_t, 15,
+				     vcpu->arch.emulate_ctxt.fetch.ptr
+				     - vcpu->arch.emulate_ctxt.fetch.data);
+		memcpy(__entry->insn, vcpu->arch.emulate_ctxt.fetch.data, 15);
+		__entry->level = level;
+		__entry->custom_data = custom_data;
+	),
+	TP_printk("vcpu %d gpa %llx %s insn %s level %x custom %zu",
+		  __entry->vcpu_id,
+		  __entry->gpa,
+		  KVMI_ACCESS_PRINTK(),
+		  __print_hex(__entry->insn, __entry->len),
+		  __entry->level,
+		  __entry->custom_data
+	)
+);
+
+TRACE_EVENT(
+	kvmi_stop_singlestep,
+	TP_PROTO(__u16 vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu)
+	),
+	TP_fast_assign(
+		__entry->vcpu = vcpu;
+	),
+	TP_printk("vcpu %d", __entry->vcpu
+	)
+);
+
+TRACE_EVENT(
+	kvmi_mem_map,
+	TP_PROTO(struct kvm *kvm, gpa_t req_gpa, gpa_t map_gpa),
+	TP_ARGS(kvm, req_gpa, map_gpa),
+	TP_STRUCT__entry(
+		__field_struct(uuid_t, uuid)
+		__field(gpa_t, req_gpa)
+		__field(gpa_t, map_gpa)
+	),
+	TP_fast_assign(
+		struct kvmi *ikvm = kvmi_get(kvm);
+
+		if (ikvm) {
+			memcpy(&__entry->uuid, &ikvm->uuid, sizeof(uuid_t));
+			kvmi_put(kvm);
+		} else
+			memset(&__entry->uuid, 0, sizeof(uuid_t));
+		__entry->req_gpa = req_gpa;
+		__entry->map_gpa = map_gpa;
+	),
+	TP_printk("vm %pU req_gpa %llx map_gpa %llx",
+		&__entry->uuid,
+		__entry->req_gpa,
+		__entry->map_gpa
+	)
+);
+
+TRACE_EVENT(
+	kvmi_mem_unmap,
+	TP_PROTO(gpa_t map_gpa),
+	TP_ARGS(map_gpa),
+	TP_STRUCT__entry(
+		__field(gpa_t, map_gpa)
+	),
+	TP_fast_assign(
+		__entry->map_gpa = map_gpa;
+	),
+	TP_printk("map_gpa %llx",
+		__entry->map_gpa
+	)
+);
+
+#define EXS(x) { x##_VECTOR, "#" #x }
+
+#define kvm_trace_sym_exc						\
+	EXS(DE), EXS(DB), EXS(BP), EXS(OF), EXS(BR), EXS(UD), EXS(NM),	\
+	EXS(DF), EXS(TS), EXS(NP), EXS(SS), EXS(GP), EXS(PF),		\
+	EXS(MF), EXS(AC), EXS(MC)
+
+TRACE_EVENT(
+	kvmi_cmd_inject_exception,
+	TP_PROTO(struct kvm_vcpu *vcpu, struct x86_exception *fault),
+	TP_ARGS(vcpu, fault),
+	TP_STRUCT__entry(
+		__field(__u16, vcpu_id)
+		__field(__u8, vector)
+		__field(__u64, address)
+		__field(__u16, error_code)
+		__field(bool, error_code_valid)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->vector = fault->vector;
+		__entry->address = fault->address;
+		__entry->error_code = fault->error_code;
+		__entry->error_code_valid = fault->error_code_valid;
+	),
+	TP_printk("vcpu %d %s address %llx error %x",
+		  __entry->vcpu_id,
+		  __print_symbolic(__entry->vector, kvm_trace_sym_exc),
+		  __entry->vector == PF_VECTOR ? __entry->address : 0,
+		  __entry->error_code_valid ? __entry->error_code : 0
+	)
+);
+
+#endif /* _TRACE_KVMI_H */
+
+#include <trace/define_trace.h>
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
index 157f3a401d64..ce28ca8c8d77 100644
--- a/virt/kvm/kvmi.c
+++ b/virt/kvm/kvmi.c
@@ -12,6 +12,9 @@
 #include <linux/bitmap.h>
 #include <linux/remote_mapping.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/kvmi.h>
+
 #define MAX_PAUSE_REQUESTS 1001
 
 static struct kmem_cache *msg_cache;
@@ -1284,6 +1287,8 @@ static void __kvmi_singlestep_event(struct kvm_vcpu *vcpu)
 {
 	u32 action;
 
+	trace_kvmi_event_singlestep_send(vcpu->vcpu_id);
+
 	action = kvmi_send_singlestep(vcpu);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -1291,6 +1296,8 @@ static void __kvmi_singlestep_event(struct kvm_vcpu *vcpu)
 	default:
 		kvmi_handle_common_event_actions(vcpu, action, "SINGLESTEP");
 	}
+
+	trace_kvmi_event_singlestep_recv(vcpu->vcpu_id, action);
 }
 
 static void kvmi_singlestep_event(struct kvm_vcpu *vcpu)
@@ -1311,6 +1318,8 @@ static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 	u32 action;
 	bool ret = false;
 
+	trace_kvmi_event_create_vcpu_send(vcpu->vcpu_id);
+
 	action = kvmi_msg_send_create_vcpu(vcpu);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -1320,6 +1329,8 @@ static bool __kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
 		kvmi_handle_common_event_actions(vcpu, action, "CREATE");
 	}
 
+	trace_kvmi_event_create_vcpu_recv(vcpu->vcpu_id, action);
+
 	return ret;
 }
 
@@ -1345,6 +1356,8 @@ static bool __kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
 	u32 action;
 	bool ret = false;
 
+	trace_kvmi_event_pause_vcpu_send(vcpu->vcpu_id);
+
 	action = kvmi_msg_send_pause_vcpu(vcpu);
 	switch (action) {
 	case KVMI_EVENT_ACTION_CONTINUE:
@@ -1354,6 +1367,8 @@ static bool __kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
 		kvmi_handle_common_event_actions(vcpu, action, "PAUSE");
 	}
 
+	trace_kvmi_event_pause_vcpu_recv(vcpu->vcpu_id, action);
+
 	return ret;
 }
 
@@ -1857,6 +1872,8 @@ void kvmi_stop_ss(struct kvm_vcpu *vcpu)
 
 	ivcpu->ss_owner = false;
 
+	trace_kvmi_stop_singlestep(vcpu->vcpu_id);
+
 	kvmi_singlestep_event(vcpu);
 
 out:
@@ -1892,6 +1909,9 @@ static bool kvmi_run_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
 	gfn_t gfn = gpa_to_gfn(gpa);
 	int err;
 
+	trace_kvmi_run_singlestep(vcpu, gpa, access, ikvm->ss_level,
+				  IVCPU(vcpu)->ctx_size);
+
 	kvmi_arch_start_single_step(vcpu);
 
 	err = write_custom_data(vcpu);
diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
index 6244add60062..a7a01646ea5c 100644
--- a/virt/kvm/kvmi_mem.c
+++ b/virt/kvm/kvmi_mem.c
@@ -23,6 +23,7 @@
 #include <linux/remote_mapping.h>
 
 #include <uapi/linux/kvmi.h>
+#include <trace/events/kvmi.h>
 
 #include "kvmi_int.h"
 
@@ -221,6 +222,8 @@ int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
 	}
 	req_mm = target_kvm->mm;
 
+	trace_kvmi_mem_map(target_kvm, req_gpa, map_gpa);
+
 	/* translate source addresses */
 	req_gfn = gpa_to_gfn(req_gpa);
 	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
@@ -274,6 +277,8 @@ int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
 
 	kvm_debug("kvmi: unmapping request for map_gpa %016llx\n", map_gpa);
 
+	trace_kvmi_mem_unmap(map_gpa);
+
 	/* convert GPA -> HVA */
 	map_gfn = gpa_to_gfn(map_gpa);
 	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
index a5f87aafa237..bdb1e60906f9 100644
--- a/virt/kvm/kvmi_msg.c
+++ b/virt/kvm/kvmi_msg.c
@@ -8,6 +8,8 @@
 #include <linux/net.h>
 #include "kvmi_int.h"
 
+#include <trace/events/kvmi.h>
+
 typedef int (*vcpu_reply_fct)(struct kvm_vcpu *vcpu,
 			      const struct kvmi_msg_hdr *msg, int err,
 			      const void *rpl, size_t rpl_size);
@@ -165,6 +167,8 @@ static int kvmi_msg_vm_reply(struct kvmi *ikvm,
 			     const struct kvmi_msg_hdr *msg, int err,
 			     const void *rpl, size_t rpl_size)
 {
+	trace_kvmi_vm_reply(msg->id, msg->seq, err);
+
 	return kvmi_msg_reply(ikvm, msg, err, rpl, rpl_size);
 }
 
@@ -202,6 +206,8 @@ int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
 			const struct kvmi_msg_hdr *msg, int err,
 			const void *rpl, size_t rpl_size)
 {
+	trace_kvmi_vcpu_reply(vcpu->vcpu_id, msg->id, msg->seq, err);
+
 	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, rpl_size);
 }
 
@@ -559,6 +565,8 @@ static int handle_event_reply(struct kvm_vcpu *vcpu,
 	struct kvmi_vcpu_reply *expected = &ivcpu->reply;
 	size_t useful, received, common;
 
+	trace_kvmi_event_reply(reply->event, msg->seq);
+
 	if (unlikely(msg->seq != expected->seq))
 		goto out;
 
@@ -883,6 +891,8 @@ static struct kvmi_msg_hdr *kvmi_msg_recv(struct kvmi *ikvm, bool *unsupported)
 static int kvmi_msg_dispatch_vm_cmd(struct kvmi *ikvm,
 				    const struct kvmi_msg_hdr *msg)
 {
+	trace_kvmi_vm_command(msg->id, msg->seq);
+
 	return msg_vm[msg->id](ikvm, msg, msg + 1);
 }
 
@@ -895,6 +905,8 @@ static int kvmi_msg_dispatch_vcpu_job(struct kvmi *ikvm,
 	struct kvm_vcpu *vcpu = NULL;
 	int err;
 
+	trace_kvmi_vcpu_command(cmd->vcpu, hdr->id, hdr->seq);
+
 	if (invalid_vcpu_hdr(cmd))
 		return -KVM_EINVAL;
 
@@ -1051,6 +1063,8 @@ int kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
 	ivcpu->reply.size = rpl_size;
 	ivcpu->reply.error = -EINTR;
 
+	trace_kvmi_event(vcpu->vcpu_id, common.event, hdr.seq);
+
 	err = kvmi_sock_write(ikvm, vec, n, msg_size);
 	if (err)
 		goto out;
@@ -1091,6 +1105,8 @@ int kvmi_msg_send_unhook(struct kvmi *ikvm)
 
 	kvmi_setup_event_common(&common, KVMI_EVENT_UNHOOK, 0);
 
+	trace_kvmi_event(0, common.event, hdr.seq);
+
 	return kvmi_sock_write(ikvm, vec, n, msg_size);
 }
 

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 78/92] kvm: x86: add tracepoints for interrupt and exception injections
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (76 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 77/92] kvm: introspection: add trace functions Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64 Adalbert Lazăr
                   ` (16 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr, Nicușor Cîțu

From: Nicușor Cîțu <ncitu@bitdefender.com>

This patch introduces additional tracepoints that are meant to help
in following the flow of interrupts and exceptions queued to a guest
VM. At the same time the kvm_exit tracepoint is enhanced with the
vCPU ID.

One scenario in which these help is debugging lost interrupts due to
a buggy VMEXIT handler.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/svm.c     |   9 +++-
 arch/x86/kvm/trace.h   | 118 ++++++++++++++++++++++++++++++++---------
 arch/x86/kvm/vmx/vmx.c |   8 ++-
 arch/x86/kvm/x86.c     |  12 +++--
 4 files changed, 116 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cb536a2611f6..00bdf885f9a4 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -799,6 +799,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 	bool reinject = vcpu->arch.exception.injected;
 	u32 error_code = vcpu->arch.exception.error_code;
 
+	trace_kvm_inj_exception(vcpu);
+
 	/*
 	 * If we are within a nested VM we'd better #VMEXIT and let the guest
 	 * handle the exception
@@ -5108,6 +5110,8 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	trace_kvm_inj_nmi(vcpu);
+
 	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
 	vcpu->arch.hflags |= HF_NMI_MASK;
 	set_intercept(svm, INTERCEPT_IRET);
@@ -5133,7 +5137,8 @@ static void svm_set_irq(struct kvm_vcpu *vcpu)
 
 	BUG_ON(!(gif_set(svm)));
 
-	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
+	trace_kvm_inj_interrupt(vcpu);
+
 	++vcpu->stat.irq_injections;
 
 	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
@@ -5637,6 +5642,8 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb_control_area *control = &svm->vmcb->control;
 
+	trace_kvm_cancel_inj(vcpu);
+
 	control->exit_int_info = control->event_inj;
 	control->exit_int_info_err = control->event_inj_err;
 	control->event_inj = 0;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 6432d08c7de7..cb47889ddc2c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -227,6 +227,7 @@ TRACE_EVENT(kvm_exit,
 	TP_ARGS(exit_reason, vcpu, isa),
 
 	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id		)
 		__field(	unsigned int,	exit_reason	)
 		__field(	unsigned long,	guest_rip	)
 		__field(	u32,	        isa             )
@@ -235,6 +236,7 @@ TRACE_EVENT(kvm_exit,
 	),
 
 	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
 		__entry->exit_reason	= exit_reason;
 		__entry->guest_rip	= kvm_rip_read(vcpu);
 		__entry->isa            = isa;
@@ -242,7 +244,8 @@ TRACE_EVENT(kvm_exit,
 					   &__entry->info2);
 	),
 
-	TP_printk("reason %s rip 0x%lx info %llx %llx",
+	TP_printk("vcpu %u reason %s rip 0x%lx info %llx %llx",
+		 __entry->vcpu_id,
 		 (__entry->isa == KVM_ISA_VMX) ?
 		 __print_symbolic(__entry->exit_reason, VMX_EXIT_REASONS) :
 		 __print_symbolic(__entry->exit_reason, SVM_EXIT_REASONS),
@@ -252,19 +255,38 @@ TRACE_EVENT(kvm_exit,
 /*
  * Tracepoint for kvm interrupt injection:
  */
-TRACE_EVENT(kvm_inj_virq,
-	TP_PROTO(unsigned int irq),
-	TP_ARGS(irq),
-
+TRACE_EVENT(kvm_inj_interrupt,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
 	TP_STRUCT__entry(
-		__field(	unsigned int,	irq		)
+		__field(__u32, vcpu_id)
+		__field(__u32, nr)
 	),
-
 	TP_fast_assign(
-		__entry->irq		= irq;
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->nr = vcpu->arch.interrupt.nr;
 	),
+	TP_printk("vcpu %u irq %u",
+		  __entry->vcpu_id,
+		  __entry->nr
+	)
+);
 
-	TP_printk("irq %u", __entry->irq)
+/*
+ * Tracepoint for kvm nmi injection:
+ */
+TRACE_EVENT(kvm_inj_nmi,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+	),
+	TP_printk("vcpu %u",
+		  __entry->vcpu_id
+	)
 );
 
 #define EXS(x) { x##_VECTOR, "#" #x }
@@ -275,28 +297,76 @@ TRACE_EVENT(kvm_inj_virq,
 	EXS(MF), EXS(AC), EXS(MC)
 
 /*
- * Tracepoint for kvm interrupt injection:
+ * Tracepoint for kvm exception injection:
  */
-TRACE_EVENT(kvm_inj_exception,
-	TP_PROTO(unsigned exception, bool has_error, unsigned error_code),
-	TP_ARGS(exception, has_error, error_code),
-
+TRACE_EVENT(
+	kvm_inj_exception,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
 	TP_STRUCT__entry(
-		__field(	u8,	exception	)
-		__field(	u8,	has_error	)
-		__field(	u32,	error_code	)
+		__field(__u32, vcpu_id)
+		__field(__u8, nr)
+		__field(__u64, address)
+		__field(__u16, error_code)
+		__field(bool, has_error_code)
 	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->nr = vcpu->arch.exception.nr;
+		__entry->address = vcpu->arch.exception.nested_apf ?
+			vcpu->arch.apf.nested_apf_token : vcpu->arch.cr2;
+		__entry->error_code = vcpu->arch.exception.error_code;
+		__entry->has_error_code = vcpu->arch.exception.has_error_code;
+	),
+	TP_printk("vcpu %u %s address %llx error %x",
+		  __entry->vcpu_id,
+		  __print_symbolic(__entry->nr, kvm_trace_sym_exc),
+		  __entry->nr == PF_VECTOR ? __entry->address : 0,
+		  __entry->has_error_code ? __entry->error_code : 0
+	)
+);
 
+TRACE_EVENT(
+	kvm_inj_emul_exception,
+	TP_PROTO(struct kvm_vcpu *vcpu, struct x86_exception *fault),
+	TP_ARGS(vcpu, fault),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+		__field(__u8, vector)
+		__field(__u64, address)
+		__field(__u16, error_code)
+		__field(bool, error_code_valid)
+	),
 	TP_fast_assign(
-		__entry->exception	= exception;
-		__entry->has_error	= has_error;
-		__entry->error_code	= error_code;
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->vector = fault->vector;
+		__entry->address = fault->address;
+		__entry->error_code = fault->error_code;
+		__entry->error_code_valid = fault->error_code_valid;
 	),
+	TP_printk("vcpu %u %s address %llx error %x",
+		  __entry->vcpu_id,
+		  __print_symbolic(__entry->vector, kvm_trace_sym_exc),
+		  __entry->vector == PF_VECTOR ? __entry->address : 0,
+		  __entry->error_code_valid ? __entry->error_code : 0
+	)
+);
 
-	TP_printk("%s (0x%x)",
-		  __print_symbolic(__entry->exception, kvm_trace_sym_exc),
-		  /* FIXME: don't print error_code if not present */
-		  __entry->has_error ? __entry->error_code : 0)
+/*
+ * Tracepoint for kvm cancel injection:
+ */
+TRACE_EVENT(kvm_cancel_inj,
+	TP_PROTO(struct kvm_vcpu *vcpu),
+	TP_ARGS(vcpu),
+	TP_STRUCT__entry(
+		__field(__u32, vcpu_id)
+	),
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+	),
+	TP_printk("vcpu %u",
+		  __entry->vcpu_id
+	)
 );
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 152c58b63f69..85561994661a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1494,6 +1494,8 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
 	u32 error_code = vcpu->arch.exception.error_code;
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	trace_kvm_inj_exception(vcpu);
+
 	kvm_deliver_exception_payload(vcpu);
 
 	if (has_error_code) {
@@ -4266,7 +4268,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 	uint32_t intr;
 	int irq = vcpu->arch.interrupt.nr;
 
-	trace_kvm_inj_virq(irq);
+	trace_kvm_inj_interrupt(vcpu);
 
 	++vcpu->stat.irq_injections;
 	if (vmx->rmode.vm86_active) {
@@ -4293,6 +4295,8 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	trace_kvm_inj_nmi(vcpu);
+
 	if (!enable_vnmi) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon
@@ -6452,6 +6456,8 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	trace_kvm_cancel_inj(vcpu);
+
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3975331230b9..e09a76179c4b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6178,6 +6178,9 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
 static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
+
+	trace_kvm_inj_emul_exception(vcpu, &ctxt->exception);
+
 	if (ctxt->exception.vector == PF_VECTOR)
 		return kvm_propagate_fault(vcpu, &ctxt->exception);
 
@@ -7487,10 +7490,6 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
-		trace_kvm_inj_exception(vcpu->arch.exception.nr,
-					vcpu->arch.exception.has_error_code,
-					vcpu->arch.exception.error_code);
-
 		WARN_ON_ONCE(vcpu->arch.exception.injected);
 		vcpu->arch.exception.pending = false;
 		vcpu->arch.exception.injected = true;
@@ -10250,7 +10249,10 @@ EXPORT_SYMBOL(kvm_arch_vcpu_intercept_desc);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_interrupt);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_nmi);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_exception);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cancel_inj);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (77 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 78/92] kvm: x86: add tracepoints for interrupt and exception injections Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-13  9:17   ` Paolo Bonzini
  2019-08-09 16:00 ` [RFC PATCH v6 80/92] kvm: x86: emulate movss xmm, m32 Adalbert Lazăr
                   ` (15 subsequent siblings)
  94 siblings, 1 reply; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed in order to be able to support guest code that uses movsd to
write into pages that are marked for write tracking.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 34431cf31f74..9d38f892beea 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1177,6 +1177,27 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 	return X86EMUL_CONTINUE;
 }
 
+static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
+			       int simd_prefix)
+{
+	u8 bytes;
+
+	switch (ctxt->b) {
+	case 0x11:
+		/* movsd xmm, m64 */
+		/* movups xmm, m128 */
+		if (simd_prefix == 0xf2) {
+			bytes = 8;
+			break;
+		}
+		/* fallthrough */
+	default:
+		bytes = 16;
+		break;
+	}
+	return bytes;
+}
+
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
 				    struct operand *op)
 {
@@ -1187,7 +1208,7 @@ static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
 
 	if (ctxt->d & Sse) {
 		op->type = OP_XMM;
-		op->bytes = 16;
+		op->bytes = ctxt->op_bytes;
 		op->addr.xmm = reg;
 		read_sse_reg(ctxt, &op->vec_val, reg);
 		return;
@@ -1238,7 +1259,7 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
 				ctxt->d & ByteOp);
 		if (ctxt->d & Sse) {
 			op->type = OP_XMM;
-			op->bytes = 16;
+			op->bytes = ctxt->op_bytes;
 			op->addr.xmm = ctxt->modrm_rm;
 			read_sse_reg(ctxt, &op->vec_val, ctxt->modrm_rm);
 			return rc;
@@ -4529,7 +4550,7 @@ static const struct gprefix pfx_0f_2b = {
 };
 
 static const struct gprefix pfx_0f_10_0f_11 = {
-	I(Unaligned, em_mov), I(Unaligned, em_mov), N, N,
+	I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov), N,
 };
 
 static const struct gprefix pfx_0f_28_0f_29 = {
@@ -5097,7 +5118,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 {
 	int rc = X86EMUL_CONTINUE;
 	int mode = ctxt->mode;
-	int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
+	int def_op_bytes, def_ad_bytes, goffset, simd_prefix = 0;
 	bool op_prefix = false;
 	bool has_seg_override = false;
 	struct opcode opcode;
@@ -5320,7 +5341,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 			ctxt->op_bytes = 4;
 
 		if (ctxt->d & Sse)
-			ctxt->op_bytes = 16;
+			ctxt->op_bytes = simd_prefix_to_bytes(ctxt,
+							      simd_prefix);
 		else if (ctxt->d & Mmx)
 			ctxt->op_bytes = 8;
 	}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 80/92] kvm: x86: emulate movss xmm, m32
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (78 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 81/92] kvm: x86: emulate movq xmm, m64 Adalbert Lazăr
                   ` (14 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed in order to be able to support guest code that uses movss to
write into pages that are marked for write tracking.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 9d38f892beea..b8a412b8b087 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1184,9 +1184,13 @@ static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 
 	switch (ctxt->b) {
 	case 0x11:
+		/* movss xmm, m32 */
 		/* movsd xmm, m64 */
 		/* movups xmm, m128 */
-		if (simd_prefix == 0xf2) {
+		if (simd_prefix == 0xf3) {
+			bytes = 4;
+			break;
+		} else if (simd_prefix == 0xf2) {
 			bytes = 8;
 			break;
 		}
@@ -4550,7 +4554,7 @@ static const struct gprefix pfx_0f_2b = {
 };
 
 static const struct gprefix pfx_0f_10_0f_11 = {
-	I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov), N,
+	I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov),
 };
 
 static const struct gprefix pfx_0f_28_0f_29 = {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 81/92] kvm: x86: emulate movq xmm, m64
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (79 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 80/92] kvm: x86: emulate movss xmm, m32 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 82/92] kvm: x86: emulate movq r, xmm Adalbert Lazăr
                   ` (13 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed in order to be able to support guest code that uses movq to
write into pages that are marked for write tracking.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b8a412b8b087..2297955d0934 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1180,23 +1180,24 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 			       int simd_prefix)
 {
-	u8 bytes;
+	u8 bytes = 16;
 
 	switch (ctxt->b) {
 	case 0x11:
 		/* movss xmm, m32 */
 		/* movsd xmm, m64 */
 		/* movups xmm, m128 */
-		if (simd_prefix == 0xf3) {
+		if (simd_prefix == 0xf3)
 			bytes = 4;
-			break;
-		} else if (simd_prefix == 0xf2) {
+		else if (simd_prefix == 0xf2)
 			bytes = 8;
-			break;
-		}
-		/* fallthrough */
+		break;
+	case 0xd6:
+		/* movq xmm, m64 */
+		if (simd_prefix == 0x66)
+			bytes = 8;
+		break;
 	default:
-		bytes = 16;
 		break;
 	}
 	return bytes;
@@ -4549,6 +4550,10 @@ static const struct instr_dual instr_dual_0f_2b = {
 	I(0, em_mov), N
 };
 
+static const struct gprefix pfx_0f_d6 = {
+	N, I(0, em_mov), N, N,
+};
+
 static const struct gprefix pfx_0f_2b = {
 	ID(0, &instr_dual_0f_2b), ID(0, &instr_dual_0f_2b), N, N,
 };
@@ -4846,7 +4851,8 @@ static const struct opcode twobyte_table[256] = {
 	/* 0xC8 - 0xCF */
 	X8(I(DstReg, em_bswap)),
 	/* 0xD0 - 0xDF */
-	N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
+	N, N, N, N, N, N, GP(ModRM | SrcReg | DstMem | Mov | Sse, &pfx_0f_d6),
+	N, N, N, N, N, N, N, N, N,
 	/* 0xE0 - 0xEF */
 	N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_e7),
 	N, N, N, N, N, N, N, N,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 82/92] kvm: x86: emulate movq r, xmm
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (80 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 81/92] kvm: x86: emulate movq xmm, m64 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 83/92] kvm: x86: emulate movd xmm, m32 Adalbert Lazăr
                   ` (12 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This adds support for movq r, xmm. It introduces a new flag (GPRModRM)
to indicate decode_modrm() that the encoded register is a general purpose
one.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2297955d0934..7c79504e58cd 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -172,6 +172,7 @@
 #define NoMod	    ((u64)1 << 47)  /* Mod field is ignored */
 #define Intercept   ((u64)1 << 48)  /* Has valid intercept field */
 #define CheckPerm   ((u64)1 << 49)  /* Has valid check_perm field */
+#define GPRModRM    ((u64)1 << 50)  /* The ModRM encoded register is a GP one */
 #define PrivUD      ((u64)1 << 51)  /* #UD instead of #GP on CPL > 0 */
 #define NearBranch  ((u64)1 << 52)  /* Near branches */
 #define No16	    ((u64)1 << 53)  /* No 16 bit operand */
@@ -1197,6 +1198,11 @@ static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 		if (simd_prefix == 0x66)
 			bytes = 8;
 		break;
+	case 0x6e:
+		/* movq r/m64, xmm */
+		if (simd_prefix == 0x66)
+			bytes = 8;
+		break;
 	default:
 		break;
 	}
@@ -1262,7 +1268,7 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
 		op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes;
 		op->addr.reg = decode_register(ctxt, ctxt->modrm_rm,
 				ctxt->d & ByteOp);
-		if (ctxt->d & Sse) {
+		if ((ctxt->d & Sse) && !(ctxt->d & GPRModRM)) {
 			op->type = OP_XMM;
 			op->bytes = ctxt->op_bytes;
 			op->addr.xmm = ctxt->modrm_rm;
@@ -4546,6 +4552,10 @@ static const struct gprefix pfx_0f_6f_0f_7f = {
 	I(Mmx, em_mov), I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
+static const struct gprefix pfx_0f_6e_0f_7e = {
+	N, I(Sse, em_mov), N, N
+};
+
 static const struct instr_dual instr_dual_0f_2b = {
 	I(0, em_mov), N
 };
@@ -4807,7 +4817,8 @@ static const struct opcode twobyte_table[256] = {
 	N, N, N, N,
 	N, N, N, N,
 	N, N, N, N,
-	N, N, N, GP(SrcMem | DstReg | ModRM | Mov, &pfx_0f_6f_0f_7f),
+	N, N, GP(SrcMem | DstReg | ModRM | GPRModRM | Mov, &pfx_0f_6e_0f_7e),
+	GP(SrcMem | DstReg | ModRM | Mov, &pfx_0f_6f_0f_7f),
 	/* 0x70 - 0x7F */
 	N, N, N, N,
 	N, N, N, N,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 83/92] kvm: x86: emulate movd xmm, m32
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (81 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 82/92] kvm: x86: emulate movq r, xmm Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 84/92] kvm: x86: enable the half part of movss, movsd, movups Adalbert Lazăr
                   ` (11 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This is needed in order to be able to support guest code that uses movd to
write into pages that are marked for write tracking.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7c79504e58cd..b42a71653622 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1203,6 +1203,11 @@ static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 		if (simd_prefix == 0x66)
 			bytes = 8;
 		break;
+	case 0x7e:
+		/* movd xmm, m32 */
+		if (simd_prefix == 0x66)
+			bytes = 4;
+		break;
 	default:
 		break;
 	}
@@ -4564,6 +4569,10 @@ static const struct gprefix pfx_0f_d6 = {
 	N, I(0, em_mov), N, N,
 };
 
+static const struct gprefix pfx_0f_7e = {
+	N, I(0, em_mov), N, N,
+};
+
 static const struct gprefix pfx_0f_2b = {
 	ID(0, &instr_dual_0f_2b), ID(0, &instr_dual_0f_2b), N, N,
 };
@@ -4823,7 +4832,8 @@ static const struct opcode twobyte_table[256] = {
 	N, N, N, N,
 	N, N, N, N,
 	N, N, N, N,
-	N, N, N, GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_6f_0f_7f),
+	N, N, GP(ModRM | SrcReg | DstMem | GPRModRM | Mov | Sse, &pfx_0f_7e),
+	GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_6f_0f_7f),
 	/* 0x80 - 0x8F */
 	X16(D(SrcImm | NearBranch)),
 	/* 0x90 - 0x9F */

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 84/92] kvm: x86: enable the half part of movss, movsd, movups
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (82 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 83/92] kvm: x86: emulate movd xmm, m32 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 85/92] kvm: x86: emulate lfence Adalbert Lazăr
                   ` (10 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

A previous patch added emulation support for these instructions with a
register source and memory destination. This patch adds the variants
with a memory source and a register destination.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b42a71653622..a2e5e63bd94a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1184,6 +1184,10 @@ static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 	u8 bytes = 16;
 
 	switch (ctxt->b) {
+	case 0x10:
+		/* movss m32, xmm */
+		/* movsd m64, xmm */
+		/* movups m128, xmm */
 	case 0x11:
 		/* movss xmm, m32 */
 		/* movsd xmm, m64 */

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 85/92] kvm: x86: emulate lfence
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (83 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 84/92] kvm: x86: enable the half part of movss, movsd, movups Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 86/92] kvm: x86: emulate xorpd xmm2/m128, xmm1 Adalbert Lazăr
                   ` (9 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This adds support for all encoding variants of lfence (0x0f 0xae 0xe[8-f]).

I did not use rmb() in case it will be made to use a different instruction
on future architectures.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a2e5e63bd94a..287d3751675d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4168,6 +4168,12 @@ static int em_fxrstor(struct x86_emulate_ctxt *ctxt)
 	return rc;
 }
 
+static int em_lfence(struct x86_emulate_ctxt *ctxt)
+{
+	asm volatile ("lfence" ::: "memory");
+	return X86EMUL_CONTINUE;
+}
+
 static bool valid_cr(int nr)
 {
 	switch (nr) {
@@ -4554,7 +4560,7 @@ static const struct group_dual group15 = { {
 	I(ModRM | Aligned16, em_fxrstor),
 	N, N, N, N, N, GP(0, &pfx_0f_ae_7),
 }, {
-	N, N, N, N, N, N, N, N,
+	N, N, N, N, N, I(ModRM | Sse, em_lfence), N, N,
 } };
 
 static const struct gprefix pfx_0f_6f_0f_7f = {

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 86/92] kvm: x86: emulate xorpd xmm2/m128, xmm1
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (84 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 85/92] kvm: x86: emulate lfence Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 87/92] kvm: x86: emulate xorps xmm/m128, xmm Adalbert Lazăr
                   ` (8 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This adds support for xorpd xmm2/m128, xmm1.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 287d3751675d..28aac552b34b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1178,6 +1178,22 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 	return X86EMUL_CONTINUE;
 }
 
+static int em_xorpd(struct x86_emulate_ctxt *ctxt)
+{
+	const sse128_t *src = &ctxt->src.vec_val;
+	sse128_t *dst = &ctxt->dst.vec_val;
+	sse128_t xmm0;
+
+	asm volatile("movdqu %%xmm0, %0\n"
+		     "movdqu %1, %%xmm0\n"
+		     "xorpd %2, %%xmm0\n"
+		     "movdqu %%xmm0, %1\n"
+		     "movdqu %0, %%xmm0"
+		     : "+m"(xmm0), "+m"(*dst) : "m"(*src));
+
+	return X86EMUL_CONTINUE;
+}
+
 static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
 			       int simd_prefix)
 {
@@ -4831,7 +4847,8 @@ static const struct opcode twobyte_table[256] = {
 	/* 0x40 - 0x4F */
 	X16(D(DstReg | SrcMem | ModRM)),
 	/* 0x50 - 0x5F */
-	N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
+	N, N, N, N, N, N, N, I(SrcMem | DstReg | ModRM | Unaligned | Sse, em_xorpd),
+	N, N, N, N, N, N, N, N,
 	/* 0x60 - 0x6F */
 	N, N, N, N,
 	N, N, N, N,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 87/92] kvm: x86: emulate xorps xmm/m128, xmm
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (85 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 86/92] kvm: x86: emulate xorpd xmm2/m128, xmm1 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 88/92] kvm: x86: emulate fst/fstp m64fp Adalbert Lazăr
                   ` (7 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This extends the previous xorpd by creating a dedicated group, something
I should have done since the very beginning.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 28aac552b34b..14895c043edc 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1178,6 +1178,22 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 	return X86EMUL_CONTINUE;
 }
 
+static int em_xorps(struct x86_emulate_ctxt *ctxt)
+{
+	const sse128_t *src = &ctxt->src.vec_val;
+	sse128_t *dst = &ctxt->dst.vec_val;
+	sse128_t xmm0;
+
+	asm volatile("movdqu %%xmm0, %0\n"
+		     "movdqu %1, %%xmm0\n"
+		     "xorps %2, %%xmm0\n"
+		     "movdqu %%xmm0, %1\n"
+		     "movdqu %0, %%xmm0"
+		     : "+m"(xmm0), "+m"(*dst) : "m"(*src));
+
+	return X86EMUL_CONTINUE;
+}
+
 static int em_xorpd(struct x86_emulate_ctxt *ctxt)
 {
 	const sse128_t *src = &ctxt->src.vec_val;
@@ -4615,6 +4631,10 @@ static const struct gprefix pfx_0f_e7 = {
 	N, I(Sse, em_mov), N, N,
 };
 
+static const struct gprefix pfx_0f_57 = {
+	I(Unaligned, em_xorps), I(Unaligned, em_xorpd), N, N
+};
+
 static const struct escape escape_d9 = { {
 	N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstcw),
 }, {
@@ -4847,7 +4867,7 @@ static const struct opcode twobyte_table[256] = {
 	/* 0x40 - 0x4F */
 	X16(D(DstReg | SrcMem | ModRM)),
 	/* 0x50 - 0x5F */
-	N, N, N, N, N, N, N, I(SrcMem | DstReg | ModRM | Unaligned | Sse, em_xorpd),
+	N, N, N, N, N, N, N, GP(SrcMem | DstReg | ModRM | Sse, &pfx_0f_57),
 	N, N, N, N, N, N, N, N,
 	/* 0x60 - 0x6F */
 	N, N, N, N,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 88/92] kvm: x86: emulate fst/fstp m64fp
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (86 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 87/92] kvm: x86: emulate xorps xmm/m128, xmm Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 89/92] kvm: x86: make lock cmpxchg r, r/m atomic Adalbert Lazăr
                   ` (6 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This adds support for fst m64fp and fstp m64fp.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 14895c043edc..7261b94c6c00 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1178,6 +1178,26 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 	return X86EMUL_CONTINUE;
 }
 
+static int em_fstp(struct x86_emulate_ctxt *ctxt)
+{
+	if (ctxt->ops->get_cr(ctxt, 0) & (X86_CR0_TS | X86_CR0_EM))
+		return emulate_nm(ctxt);
+
+	asm volatile("fstpl %0" : "=m"(ctxt->dst.val));
+
+	return X86EMUL_CONTINUE;
+}
+
+static int em_fst(struct x86_emulate_ctxt *ctxt)
+{
+	if (ctxt->ops->get_cr(ctxt, 0) & (X86_CR0_TS | X86_CR0_EM))
+		return emulate_nm(ctxt);
+
+	asm volatile("fstl %0" : "=m"(ctxt->dst.val));
+
+	return X86EMUL_CONTINUE;
+}
+
 static int em_xorps(struct x86_emulate_ctxt *ctxt)
 {
 	const sse128_t *src = &ctxt->src.vec_val;
@@ -4678,7 +4698,8 @@ static const struct escape escape_db = { {
 } };
 
 static const struct escape escape_dd = { {
-	N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstsw),
+	N, N, I(DstMem64 | Mov, em_fst), I(DstMem64 | Mov, em_fstp),
+	N, N, N, I(DstMem16 | Mov, em_fnstsw),
 }, {
 	/* 0xC0 - 0xC7 */
 	N, N, N, N, N, N, N, N,

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 89/92] kvm: x86: make lock cmpxchg r, r/m atomic
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (87 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 88/92] kvm: x86: emulate fst/fstp m64fp Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 90/92] kvm: x86: emulate lock cmpxchg8b atomically Adalbert Lazăr
                   ` (5 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

The current emulation takes place in two steps: the first does all the
actions that an cmpxchg would do, sets ZF and saves all results in a
temporary storage (the emulation context). It's the second step that
does the actual atomic operation (actually uses cmpxchg). The problem
with this approach is that steps one and two can observe different
values in memory and when that happens RAX and RFLAGS will have invalid
values when returning to the guest as emulator_cmpxchg_emulated() does
not set these.

This patch modifies the prototype of emulator_cmpxchg_emulated() so that
when cmpxchg fails, it returns in *old the current value. We also modify
em_cmpxchg() so that if the LOCK prefix is present we invoke
emulator_cmpxchg_emulated() directly and set RAX and RFLAGS. Note that we
also disable writeback as it is no longer needed.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_emulate.h |  2 +-
 arch/x86/kvm/emulate.c             | 57 +++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c                 | 48 ++++++++++++++++++-------
 3 files changed, 89 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 97cb592687cb..863c04561a37 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -178,7 +178,7 @@ struct x86_emulate_ops {
 	 */
 	int (*cmpxchg_emulated)(struct x86_emulate_ctxt *ctxt,
 				unsigned long addr,
-				const void *old,
+				void *old,
 				const void *new,
 				unsigned int bytes,
 				struct x86_exception *fault);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7261b94c6c00..dac4c0ca1ee3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1547,11 +1547,15 @@ static int segmented_cmpxchg(struct x86_emulate_ctxt *ctxt,
 {
 	int rc;
 	ulong linear;
+	unsigned char buf[16];
 
 	rc = linearize(ctxt, addr, size, true, &linear);
 	if (rc != X86EMUL_CONTINUE)
 		return rc;
-	return ctxt->ops->cmpxchg_emulated(ctxt, linear, orig_data, data,
+	if (size > sizeof(buf))
+		return X86EMUL_UNHANDLEABLE;
+	memcpy(buf, orig_data, size);
+	return ctxt->ops->cmpxchg_emulated(ctxt, linear, buf, data,
 					   size, &ctxt->exception);
 }
 
@@ -1803,16 +1807,21 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
 		/* CS(RPL) <- CPL */
 		selector = (selector & 0xfffc) | cpl;
 		break;
-	case VCPU_SREG_TR:
+	case VCPU_SREG_TR: {
+		struct desc_struct buf;
+
 		if (seg_desc.s || (seg_desc.type != 1 && seg_desc.type != 9))
 			goto exception;
-		old_desc = seg_desc;
+		buf = old_desc = seg_desc;
 		seg_desc.type |= 2; /* busy */
-		ret = ctxt->ops->cmpxchg_emulated(ctxt, desc_addr, &old_desc, &seg_desc,
-						  sizeof(seg_desc), &ctxt->exception);
+		ret = ctxt->ops->cmpxchg_emulated(ctxt, desc_addr, &buf,
+						  &seg_desc,
+						  sizeof(seg_desc),
+						  &ctxt->exception);
 		if (ret != X86EMUL_CONTINUE)
 			return ret;
 		break;
+	}
 	case VCPU_SREG_LDTR:
 		if (seg_desc.s || seg_desc.type != 2)
 			goto exception;
@@ -2384,6 +2393,44 @@ static int em_ret_far_imm(struct x86_emulate_ctxt *ctxt)
 
 static int em_cmpxchg(struct x86_emulate_ctxt *ctxt)
 {
+	if (ctxt->lock_prefix) {
+		int rc;
+		ulong linear;
+		u64 old = reg_read(ctxt, VCPU_REGS_RAX);
+		u64 new = ctxt->src.val64;
+
+		/* disable writeback altogether */
+		ctxt->d &= ~SrcWrite;
+		ctxt->d |= NoWrite;
+
+		rc = linearize(ctxt, ctxt->dst.addr.mem, ctxt->dst.bytes, true,
+			       &linear);
+		if (rc != X86EMUL_CONTINUE)
+			return rc;
+
+		rc = ctxt->ops->cmpxchg_emulated(ctxt, linear, &old, &new,
+						 ctxt->dst.bytes,
+						 &ctxt->exception);
+
+		switch (rc) {
+		case X86EMUL_CONTINUE:
+			ctxt->eflags |= X86_EFLAGS_ZF;
+			break;
+		case X86EMUL_CMPXCHG_FAILED: {
+			u64 mask = BITMAP_LAST_WORD_MASK(ctxt->dst.bytes * 8);
+
+			*reg_write(ctxt, VCPU_REGS_RAX) = old & mask;
+
+			ctxt->eflags &= ~X86_EFLAGS_ZF;
+
+			rc = X86EMUL_CONTINUE;
+			break;
+		}
+		}
+
+		return rc;
+	}
+
 	/* Save real source value, then compare EAX against destination. */
 	ctxt->dst.orig_val = ctxt->dst.val;
 	ctxt->dst.val = reg_read(ctxt, VCPU_REGS_RAX);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e09a76179c4b..346ce6c5887b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5643,18 +5643,18 @@ static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
 }
 
 #define CMPXCHG_TYPE(t, ptr, old, new) \
-	(cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old))
+	cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new))
 
 #ifdef CONFIG_X86_64
 #  define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new)
 #else
 #  define CMPXCHG64(ptr, old, new) \
-	(cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old))
+	cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new))
 #endif
 
 static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 				     unsigned long addr,
-				     const void *old,
+				     void *old,
 				     const void *new,
 				     unsigned int bytes,
 				     struct x86_exception *exception)
@@ -5663,7 +5663,7 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	gpa_t gpa;
 	struct page *page;
 	char *kaddr;
-	bool exchanged;
+	bool exchanged = false;
 
 	/* guests cmpxchg8b have to be emulated atomically */
 	if (bytes > 8 || (bytes & (bytes - 1)))
@@ -5688,18 +5688,42 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	kaddr = kmap_atomic(page);
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
-	case 1:
-		exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
+	case 1: {
+		u8 val = CMPXCHG_TYPE(u8, kaddr, old, new);
+
+		if (*((u8 *)old) == val)
+			exchanged = true;
+		else
+			*((u8 *)old) = val;
 		break;
-	case 2:
-		exchanged = CMPXCHG_TYPE(u16, kaddr, old, new);
+	}
+	case 2: {
+		u16 val = CMPXCHG_TYPE(u16, kaddr, old, new);
+
+		if (*((u16 *)old) == val)
+			exchanged = true;
+		else
+			*((u16 *)old) = val;
 		break;
-	case 4:
-		exchanged = CMPXCHG_TYPE(u32, kaddr, old, new);
+	}
+	case 4: {
+		u32 val = CMPXCHG_TYPE(u32, kaddr, old, new);
+
+		if (*((u32 *)old) == val)
+			exchanged = true;
+		else
+			*((u32 *)old) = val;
 		break;
-	case 8:
-		exchanged = CMPXCHG64(kaddr, old, new);
+	}
+	case 8: {
+		u64 val = CMPXCHG64(kaddr, old, new);
+
+		if (*((u64 *)old) == val)
+			exchanged = true;
+		else
+			*((u64 *)old) = val;
 		break;
+	}
 	default:
 		BUG();
 	}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 90/92] kvm: x86: emulate lock cmpxchg8b atomically
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (88 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 89/92] kvm: x86: make lock cmpxchg r, r/m atomic Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 91/92] kvm: x86: emulate lock cmpxchg16b m128 Adalbert Lazăr
                   ` (4 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

As it was the case for lock cmpxchg, lock cmpxchg8b was emulated in two
steps the first one setting/clearing the zero flag and the last one
making the actual atomic operation.

This patch fixes that by combining the two, ie. the writeback step is
no longer necessary as the first step made the changes directly in
memory.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index dac4c0ca1ee3..2038e42c1eae 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2320,7 +2320,47 @@ static int em_call_near_abs(struct x86_emulate_ctxt *ctxt)
 
 static int em_cmpxchg8b(struct x86_emulate_ctxt *ctxt)
 {
-	u64 old = ctxt->dst.orig_val64;
+	u64 old;
+
+	if (ctxt->lock_prefix) {
+		int rc;
+		ulong linear;
+		u64 new = (reg_read(ctxt, VCPU_REGS_RBX) & (u32)-1) |
+			((reg_read(ctxt, VCPU_REGS_RCX) & (u32)-1) << 32);
+
+		old = (reg_read(ctxt, VCPU_REGS_RAX) & (u32)-1) |
+			((reg_read(ctxt, VCPU_REGS_RDX) & (u32)-1) << 32);
+
+		/* disable writeback altogether */
+		ctxt->d &= ~SrcWrite;
+		ctxt->d |= NoWrite;
+
+		rc = linearize(ctxt, ctxt->dst.addr.mem, 8, true, &linear);
+		if (rc != X86EMUL_CONTINUE)
+			return rc;
+
+		rc = ctxt->ops->cmpxchg_emulated(ctxt, linear, &old, &new,
+						 ctxt->dst.bytes,
+						 &ctxt->exception);
+
+		switch (rc) {
+		case X86EMUL_CONTINUE:
+			ctxt->eflags |= X86_EFLAGS_ZF;
+			break;
+		case X86EMUL_CMPXCHG_FAILED:
+			*reg_write(ctxt, VCPU_REGS_RAX) = old & (u32)-1;
+			*reg_write(ctxt, VCPU_REGS_RDX) = (old >> 32) & (u32)-1;
+
+			ctxt->eflags &= ~X86_EFLAGS_ZF;
+
+			rc = X86EMUL_CONTINUE;
+			break;
+		}
+
+		return rc;
+	}
+
+	old = ctxt->dst.orig_val64;
 
 	if (ctxt->dst.bytes == 16)
 		return X86EMUL_UNHANDLEABLE;

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 91/92] kvm: x86: emulate lock cmpxchg16b m128
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (89 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 90/92] kvm: x86: emulate lock cmpxchg8b atomically Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-09 16:00 ` [RFC PATCH v6 92/92] kvm: x86: fallback to the single-step on multipage CMPXCHG emulation Adalbert Lazăr
                   ` (3 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

This patch adds support for lock cmpxchg16b m128 by extending the
existent emulation for lock cmpxchg8b m64.

For implementing the atomic operation, we use an explicit assembler
statement, as cmpxchg_double() does not provide the contents of the
memory on failure. As before, writeback is completely disabled as the
operation is executed directly on guest memory, unless the architecture
does not advertise CMPXCHG16B in CPUID.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/emulate.c | 117 ++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/x86.c     |  37 ++++++++++++-
 2 files changed, 122 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 2038e42c1eae..a37ad63836ea 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2318,46 +2318,103 @@ static int em_call_near_abs(struct x86_emulate_ctxt *ctxt)
 	return rc;
 }
 
-static int em_cmpxchg8b(struct x86_emulate_ctxt *ctxt)
+static int em_cmpxchg8b_locked(struct x86_emulate_ctxt *ctxt)
 {
-	u64 old;
+	int rc;
+	ulong linear;
+	u64 new = (reg_read(ctxt, VCPU_REGS_RBX) & (u32)-1) |
+		((reg_read(ctxt, VCPU_REGS_RCX) & (u32)-1) << 32);
+	u64 old = (reg_read(ctxt, VCPU_REGS_RAX) & (u32)-1) |
+		((reg_read(ctxt, VCPU_REGS_RDX) & (u32)-1) << 32);
 
-	if (ctxt->lock_prefix) {
-		int rc;
-		ulong linear;
-		u64 new = (reg_read(ctxt, VCPU_REGS_RBX) & (u32)-1) |
-			((reg_read(ctxt, VCPU_REGS_RCX) & (u32)-1) << 32);
+	/* disable writeback altogether */
+	ctxt->d |= NoWrite;
 
-		old = (reg_read(ctxt, VCPU_REGS_RAX) & (u32)-1) |
-			((reg_read(ctxt, VCPU_REGS_RDX) & (u32)-1) << 32);
+	rc = linearize(ctxt, ctxt->dst.addr.mem, 8, true, &linear);
+	if (rc != X86EMUL_CONTINUE)
+		return rc;
 
-		/* disable writeback altogether */
-		ctxt->d &= ~SrcWrite;
-		ctxt->d |= NoWrite;
+	rc = ctxt->ops->cmpxchg_emulated(ctxt, linear, &old, &new,
+					 8, &ctxt->exception);
 
-		rc = linearize(ctxt, ctxt->dst.addr.mem, 8, true, &linear);
-		if (rc != X86EMUL_CONTINUE)
-			return rc;
 
-		rc = ctxt->ops->cmpxchg_emulated(ctxt, linear, &old, &new,
-						 ctxt->dst.bytes,
-						 &ctxt->exception);
+	switch (rc) {
+	case X86EMUL_CONTINUE:
+		ctxt->eflags |= X86_EFLAGS_ZF;
+		break;
+	case X86EMUL_CMPXCHG_FAILED:
+		*reg_write(ctxt, VCPU_REGS_RAX) = old & (u32)-1;
+		*reg_write(ctxt, VCPU_REGS_RDX) = (old >> 32) & (u32)-1;
 
-		switch (rc) {
-		case X86EMUL_CONTINUE:
-			ctxt->eflags |= X86_EFLAGS_ZF;
-			break;
-		case X86EMUL_CMPXCHG_FAILED:
-			*reg_write(ctxt, VCPU_REGS_RAX) = old & (u32)-1;
-			*reg_write(ctxt, VCPU_REGS_RDX) = (old >> 32) & (u32)-1;
+		ctxt->eflags &= ~X86_EFLAGS_ZF;
 
-			ctxt->eflags &= ~X86_EFLAGS_ZF;
+		rc = X86EMUL_CONTINUE;
+		break;
+	}
 
-			rc = X86EMUL_CONTINUE;
-			break;
-		}
+	return rc;
+}
+
+#ifdef CONFIG_X86_64
+static int em_cmpxchg16b_locked(struct x86_emulate_ctxt *ctxt)
+{
+	int rc;
+	ulong linear;
+	u64 new[2] = {
+		reg_read(ctxt, VCPU_REGS_RBX),
+		reg_read(ctxt, VCPU_REGS_RCX)
+	};
+	u64 old[2] = {
+		reg_read(ctxt, VCPU_REGS_RAX),
+		reg_read(ctxt, VCPU_REGS_RDX)
+	};
 
+	/* disable writeback altogether */
+	ctxt->d |= NoWrite;
+
+	rc = linearize(ctxt, ctxt->dst.addr.mem, 16, true, &linear);
+	if (rc != X86EMUL_CONTINUE)
 		return rc;
+
+	if (linear % 16)
+		return emulate_gp(ctxt, 0);
+
+	rc = ctxt->ops->cmpxchg_emulated(ctxt, linear, old, new,
+					 16, &ctxt->exception);
+
+	switch (rc) {
+	case X86EMUL_CONTINUE:
+		ctxt->eflags |= X86_EFLAGS_ZF;
+		break;
+	case X86EMUL_CMPXCHG_FAILED:
+		*reg_write(ctxt, VCPU_REGS_RAX) = old[0];
+		*reg_write(ctxt, VCPU_REGS_RDX) = old[1];
+
+		ctxt->eflags &= ~X86_EFLAGS_ZF;
+
+		rc = X86EMUL_CONTINUE;
+		break;
+	}
+
+	return rc;
+}
+#else
+static int em_cmpxchg16b_locked(struct x86_emulate_ctxt *ctxt)
+{
+	return X86EMUL_UNHANDLEABLE;
+}
+#endif
+
+static int em_cmpxchg8_16b(struct x86_emulate_ctxt *ctxt)
+{
+	u64 old;
+
+	if (ctxt->lock_prefix) {
+		if (ctxt->dst.bytes == 8)
+			return em_cmpxchg8b_locked(ctxt);
+		else if (ctxt->dst.bytes == 16)
+			return em_cmpxchg16b_locked(ctxt);
+		return X86EMUL_UNHANDLEABLE;
 	}
 
 	old = ctxt->dst.orig_val64;
@@ -4679,7 +4736,7 @@ static const struct gprefix pfx_0f_c7_7 = {
 
 
 static const struct group_dual group9 = { {
-	N, I(DstMem64 | Lock | PageTable, em_cmpxchg8b), N, N, N, N, N, N,
+	N, I(DstMem64 | Lock | PageTable, em_cmpxchg8_16b), N, N, N, N, N, N,
 }, {
 	N, N, N, N, N, N, N,
 	GP(0, &pfx_0f_c7_7),
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 346ce6c5887b..0e904782d303 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5665,8 +5665,17 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	char *kaddr;
 	bool exchanged = false;
 
-	/* guests cmpxchg8b have to be emulated atomically */
-	if (bytes > 8 || (bytes & (bytes - 1)))
+#ifdef CONFIG_X86_64
+#define CMPXCHG_MAX_BYTES 16
+#else
+#define CMPXCHG_MAX_BYTES 8
+#endif
+
+	/* guests cmpxchg{8,16}b have to be emulated atomically */
+	if (bytes > CMPXCHG_MAX_BYTES || (bytes & (bytes - 1)))
+		goto emul_write;
+
+	if (bytes == 16 && !system_has_cmpxchg_double())
 		goto emul_write;
 
 	gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
@@ -5724,6 +5733,30 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 			*((u64 *)old) = val;
 		break;
 	}
+#ifdef CONFIG_X86_64
+	case 16: {
+		u64 *p1 = (u64 *)kaddr;
+		u64 *p2 = p1 + 1;
+		u64 *o1 = old;
+		u64 *o2 = o1 + 1;
+		const u64 *n1 = new;
+		const u64 *n2 = n1 + 1;
+		const u64 __o1 = *o1;
+		const u64 __o2 = *o2;
+
+		/*
+		 * We use an explicit asm statement because cmpxchg_double()
+		 * does not return the previous memory contents on failure
+		 */
+		asm volatile ("lock cmpxchg16b %2\n"
+			      : "+a"(*o1), "+d"(*o2), "+m"(*p1), "+m"(*p2)
+			      : "b"(*n1), "c"(*n2) : "memory");
+
+		if (__o1 == *o1 && __o2 == *o2)
+			exchanged = true;
+		break;
+	}
+#endif
 	default:
 		BUG();
 	}

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [RFC PATCH v6 92/92] kvm: x86: fallback to the single-step on multipage CMPXCHG emulation
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (90 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 91/92] kvm: x86: emulate lock cmpxchg16b m128 Adalbert Lazăr
@ 2019-08-09 16:00 ` Adalbert Lazăr
  2019-08-12 18:23 ` [RFC PATCH v6 00/92] VM introspection Sean Christopherson
                   ` (2 subsequent siblings)
  94 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-08-09 16:00 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Adalbert Lazăr

From: Mihai Donțu <mdontu@bitdefender.com>

There are cases where we need to emulate a CMPXCHG that touches two
pages (4 in one and another 4 in the next, for example). Because it
is not easy to map two pages in the kernel so that we can directly
execute the exchange instruction, we fallback to single-stepping.
Luckly, this is an uncommon occurrence making the overhead of the
single-step mechanism acceptable.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e904782d303..e283b074db26 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5671,6 +5671,12 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 #define CMPXCHG_MAX_BYTES 8
 #endif
 
+	gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
+
+	if (gpa == UNMAPPED_GVA ||
+	    (gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+		goto emul_write;
+
 	/* guests cmpxchg{8,16}b have to be emulated atomically */
 	if (bytes > CMPXCHG_MAX_BYTES || (bytes & (bytes - 1)))
 		goto emul_write;
@@ -5678,12 +5684,6 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (bytes == 16 && !system_has_cmpxchg_double())
 		goto emul_write;
 
-	gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
-
-	if (gpa == UNMAPPED_GVA ||
-	    (gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
-		goto emul_write;
-
 	if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
 		goto emul_write;
 
@@ -5772,6 +5772,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	return X86EMUL_CONTINUE;
 
 emul_write:
+	if (kvmi_tracked_gfn(vcpu, gpa >> PAGE_SHIFT))
+		return X86EMUL_UNHANDLEABLE;
+
 	printk_once(KERN_WARNING "kvm: emulating exchange as write\n");
 
 	return emulator_write_emulated(ctxt, addr, new, bytes, exception);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* DANGER WILL ROBINSON, DANGER
  2019-08-09 16:00 ` [RFC PATCH v6 71/92] mm: add support for remote mapping Adalbert Lazăr
@ 2019-08-09 16:24   ` Matthew Wilcox
  2019-08-13  9:29     ` Paolo Bonzini
       [not found]     ` <1565694095.D172a51.28640.@15f23d3a749365d981e968181cce585d2dcb3ffa>
  0 siblings, 2 replies; 158+ messages in thread
From: Matthew Wilcox @ 2019-08-09 16:24 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> +++ b/include/linux/page-flags.h
> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>   */
>  #define PAGE_MAPPING_ANON	0x1
>  #define PAGE_MAPPING_MOVABLE	0x2
> +#define PAGE_MAPPING_REMOTE	0x4

Uh.  How do you know page->mapping would otherwise have bit 2 clear?
Who's guaranteeing that?

This is an awfully big patch to the memory management code, buried in
the middle of a gigantic series which almost guarantees nobody would
look at it.  I call shenanigans.

> @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
>   * __page_set_anon_rmap - set up new anonymous rmap
>   * @page:	Page or Hugepage to add to rmap
>   * @vma:	VM area to add page to.
> - * @address:	User virtual address of the mapping	
> + * @address:	User virtual address of the mapping

And mixing in fluff changes like this is a real no-no.  Try again.


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 00/92] VM introspection
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (91 preceding siblings ...)
  2019-08-09 16:00 ` [RFC PATCH v6 92/92] kvm: x86: fallback to the single-step on multipage CMPXCHG emulation Adalbert Lazăr
@ 2019-08-12 18:23 ` Sean Christopherson
  2019-08-12 21:40 ` Sean Christopherson
  2019-08-13  9:34 ` Paolo Bonzini
  94 siblings, 0 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-12 18:23 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

On Fri, Aug 09, 2019 at 06:59:15PM +0300, Adalbert Lazăr wrote:
>  virt/kvm/kvm_main.c                      |   70 +-
>  virt/kvm/kvmi.c                          | 2054 ++++++++++++++++++++++
>  virt/kvm/kvmi_int.h                      |  311 ++++
>  virt/kvm/kvmi_mem.c                      |  324 ++++
>  virt/kvm/kvmi_mem_guest.c                |  651 +++++++
>  virt/kvm/kvmi_msg.c                      | 1236 +++++++++++++

That's a lot of code.  An 'introspection' sub-directory might be warranted.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
  2019-08-09 15:59 ` [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem) Adalbert Lazăr
@ 2019-08-12 20:20   ` Sean Christopherson
  2019-08-13  9:11     ` Paolo Bonzini
       [not found]     ` <5d52a5ae.1c69fb81.5c260.1573SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-12 20:20 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On Fri, Aug 09, 2019 at 06:59:16PM +0300, Adalbert Lazăr wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 72fa955f4a15..f70a6a1b6814 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
>  	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
>  	 auditing of KVM MMU events at runtime.
>  
> +config KVM_INTROSPECTION
> +	bool "VM Introspection"
> +	depends on KVM && (KVM_INTEL || KVM_AMD)
> +	help
> +	 This option enables functions to control the execution of VM-s, query
> +	 the state of the vCPU-s (GPR-s, MSR-s etc.).

This does a lot more than enable functions, it allows userspace to do all
of these things *while the VM is running*.  Everything above can already
be done by userspace.

The "-s" syntax is difficult to read and unnecessary, e.g. at first I
thought VM-s was referring to a new subsystem or feature introduced by
introspection.  VMs, vCPUs, GPRs, MSRs, etc...

> +
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
>  source "drivers/vhost/Kconfig"
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 31ecf7a76d5a..312597bd47c7 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -7,6 +7,7 @@ KVM := ../../../virt/kvm
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>  				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> +kvm-$(CONFIG_KVM_INTROSPECTION) += $(KVM)/kvmi.o
>  
>  kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>  			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c38cc5eb7e73..582b0187f5a4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -455,6 +455,10 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +
> +	struct completion kvmi_completed;
> +	refcount_t kvmi_ref;

The refcounting approach seems a bit backwards, and AFAICT is driven by
implementing unhook via a message, which also seems backwards.  I assume
hook and unhook are relatively rare events and not performance critical,
so make those the restricted/slow flows, e.g. force userspace to quiesce
the VM by making unhook() mutually exclusive with every vcpu ioctl() and
maybe anything that takes kvm->lock. 

Then kvmi_ioctl_unhook() can use thread_stop() and kvmi_recv() just needs
to check kthread_should_stop().

That way kvmi doesn't need to be refcounted since it's guaranteed to be
alive if the pointer is non-null.  Eliminating the refcounting will clean
up a lot of the code by eliminating calls to kvmi_{get,put}(), e.g.
wrappers like kvmi_breakpoint_event() just check vcpu->kvmi, or maybe
even get dropped altogether.

> +	void *kvmi;

Why is this a void*?  Just forward declare struct kvmi in kvmi.h.

IMO this should be 'struct kvm_introspection *introspection', similar to
'struct kvm_vcpu_arch arch' and 'struct kvm_vmx'.  Ditto for the vCPU
flavor.  Local variables could be kvmi+vcpui, kvm_i+vcpu_i, or maybe
a more long form if someone can come up with a good abbreviation?

Using 'ikvm' as the local variable name when everything else refers to
introspection as 'kvmi' is especially funky.

>  };
>  
>  #define kvm_err(fmt, ...) \
> diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
> new file mode 100644
> index 000000000000..e36de3f9f3de
> --- /dev/null
> +++ b/include/linux/kvmi.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_H__
> +#define __KVMI_H__
> +
> +#define kvmi_is_present() IS_ENABLED(CONFIG_KVM_INTROSPECTION)

Peeking forward a few patches, introspection should have a module param.
The code is also inconsistent in its usage of kvmi_is_present() versus
#ifdef CONFIG_KVM_INTROSPECTION.

And maybe kvm_is_instrospection_enabled() so that the gating function has
a more descriptive name for first-time readers?

> +
> +#ifdef CONFIG_KVM_INTROSPECTION
> +
> +int kvmi_init(void);
> +void kvmi_uninit(void);
> +void kvmi_create_vm(struct kvm *kvm);
> +void kvmi_destroy_vm(struct kvm *kvm);
> +
> +#else
> +
> +static inline int kvmi_init(void) { return 0; }
> +static inline void kvmi_uninit(void) { }
> +static inline void kvmi_create_vm(struct kvm *kvm) { }
> +static inline void kvmi_destroy_vm(struct kvm *kvm) { }
> +
> +#endif /* CONFIG_KVM_INTROSPECTION */
> +
> +#endif
> diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
> new file mode 100644
> index 000000000000..dbf63ad0862f
> --- /dev/null
> +++ b/include/uapi/linux/kvmi.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI__LINUX_KVMI_H
> +#define _UAPI__LINUX_KVMI_H
> +
> +/*
> + * KVMI structures and definitions
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +
> +#define KVMI_VERSION 0x00000001
> +
> +enum {
> +	KVMI_EVENT_REPLY           = 0,
> +	KVMI_EVENT                 = 1,
> +
> +	KVMI_FIRST_COMMAND         = 2,
> +
> +	KVMI_GET_VERSION           = 2,
> +	KVMI_CHECK_COMMAND         = 3,
> +	KVMI_CHECK_EVENT           = 4,
> +	KVMI_GET_GUEST_INFO        = 5,
> +	KVMI_GET_VCPU_INFO         = 6,
> +	KVMI_PAUSE_VCPU            = 7,
> +	KVMI_CONTROL_VM_EVENTS     = 8,
> +	KVMI_CONTROL_EVENTS        = 9,
> +	KVMI_CONTROL_CR            = 10,
> +	KVMI_CONTROL_MSR           = 11,
> +	KVMI_CONTROL_VE            = 12,
> +	KVMI_GET_REGISTERS         = 13,
> +	KVMI_SET_REGISTERS         = 14,
> +	KVMI_GET_CPUID             = 15,
> +	KVMI_GET_XSAVE             = 16,
> +	KVMI_READ_PHYSICAL         = 17,
> +	KVMI_WRITE_PHYSICAL        = 18,
> +	KVMI_INJECT_EXCEPTION      = 19,
> +	KVMI_GET_PAGE_ACCESS       = 20,
> +	KVMI_SET_PAGE_ACCESS       = 21,
> +	KVMI_GET_MAP_TOKEN         = 22,
> +	KVMI_GET_MTRR_TYPE         = 23,
> +	KVMI_CONTROL_SPP           = 24,
> +	KVMI_GET_PAGE_WRITE_BITMAP = 25,
> +	KVMI_SET_PAGE_WRITE_BITMAP = 26,
> +	KVMI_CONTROL_CMD_RESPONSE  = 27,

Each command should be introduced along with the patch that adds the
associated functionality.

It'd be helpful to incorporate the scope of the command in the name,
e.g. VM vs. vCPU.

Why are VM and vCPU commands smushed together?

> +
> +	KVMI_NEXT_AVAILABLE_COMMAND,

Why not KVMI_NR_COMMANDS or KVM_NUM_COMMANDS?  At least be consistent
between COMMANDS and EVENTS below.

> +
> +};
> +
> +enum {
> +	KVMI_EVENT_UNHOOK      = 0,
> +	KVMI_EVENT_CR	       = 1,
> +	KVMI_EVENT_MSR	       = 2,
> +	KVMI_EVENT_XSETBV      = 3,
> +	KVMI_EVENT_BREAKPOINT  = 4,
> +	KVMI_EVENT_HYPERCALL   = 5,
> +	KVMI_EVENT_PF	       = 6,
> +	KVMI_EVENT_TRAP	       = 7,
> +	KVMI_EVENT_DESCRIPTOR  = 8,
> +	KVMI_EVENT_CREATE_VCPU = 9,
> +	KVMI_EVENT_PAUSE_VCPU  = 10,
> +	KVMI_EVENT_SINGLESTEP  = 11,
> +
> +	KVMI_NUM_EVENTS
> +};
> +
> +#endif /* _UAPI__LINUX_KVMI_H */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 585845203db8..90e432d225ab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -51,6 +51,7 @@
>  #include <linux/slab.h>
>  #include <linux/sort.h>
>  #include <linux/bsearch.h>
> +#include <linux/kvmi.h>
>  
>  #include <asm/processor.h>
>  #include <asm/io.h>
> @@ -680,6 +681,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	if (r)
>  		goto out_err;
>  
> +	kvmi_create_vm(kvm);
> +
>  	spin_lock(&kvm_lock);
>  	list_add(&kvm->vm_list, &vm_list);
>  	spin_unlock(&kvm_lock);
> @@ -725,6 +728,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>  
> +	kvmi_destroy_vm(kvm);
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -1556,7 +1560,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>  	 * Whoever called remap_pfn_range is also going to call e.g.
>  	 * unmap_mapping_range before the underlying pages are freed,
>  	 * causing a call to our MMU notifier.
> -	 */ 
> +	 */

Spurious whitespace change.

>  	kvm_get_pfn(pfn);
>  
>  	*p_pfn = pfn;
> @@ -4204,6 +4208,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>  	r = kvm_vfio_ops_init();
>  	WARN_ON(r);
>  
> +	r = kvmi_init();
> +	WARN_ON(r);

Leftover development/debugging crud.

> +
>  	return 0;
>  
>  out_unreg:
> @@ -4229,6 +4236,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
>  
>  void kvm_exit(void)
>  {
> +	kvmi_uninit();
>  	debugfs_remove_recursive(kvm_debugfs_dir);
>  	misc_deregister(&kvm_dev);
>  	kmem_cache_destroy(kvm_vcpu_cache);
> diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
> new file mode 100644
> index 000000000000..20638743bd03
> --- /dev/null
> +++ b/virt/kvm/kvmi.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection
> + *
> + * Copyright (C) 2017-2019 Bitdefender S.R.L.
> + *
> + */
> +#include <uapi/linux/kvmi.h>
> +#include "kvmi_int.h"
> +
> +int kvmi_init(void)
> +{
> +	return 0;
> +}
> +
> +void kvmi_uninit(void)
> +{
> +}
> +
> +struct kvmi * __must_check kvmi_get(struct kvm *kvm)
> +{
> +	if (refcount_inc_not_zero(&kvm->kvmi_ref))
> +		return kvm->kvmi;
> +
> +	return NULL;
> +}
> +
> +static void kvmi_destroy(struct kvm *kvm)
> +{
> +}
> +
> +static void kvmi_release(struct kvm *kvm)
> +{
> +	kvmi_destroy(kvm);
> +
> +	complete(&kvm->kvmi_completed);
> +}
> +
> +/* This function may be called from atomic context and must not sleep */
> +void kvmi_put(struct kvm *kvm)
> +{
> +	if (refcount_dec_and_test(&kvm->kvmi_ref))
> +		kvmi_release(kvm);
> +}
> +
> +void kvmi_create_vm(struct kvm *kvm)
> +{
> +	init_completion(&kvm->kvmi_completed);
> +	complete(&kvm->kvmi_completed);

Pretty sure you don't want to be calling complete() here.

> +}
> +
> +void kvmi_destroy_vm(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm;
> +
> +	ikvm = kvmi_get(kvm);
> +	if (!ikvm)
> +		return;
> +
> +	kvmi_put(kvm);
> +
> +	/* wait for introspection resources to be released */
> +	wait_for_completion_killable(&kvm->kvmi_completed);
> +}
> diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
> new file mode 100644
> index 000000000000..ac23ad6fc4df
> --- /dev/null
> +++ b/virt/kvm/kvmi_int.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_INT_H__
> +#define __KVMI_INT_H__
> +
> +#include <linux/kvm_host.h>
> +
> +#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
> +
> +struct kvmi {
> +};
> +
> +#endif

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 64/92] kvm: introspection: add single-stepping
  2019-08-09 16:00 ` [RFC PATCH v6 64/92] kvm: introspection: add single-stepping Adalbert Lazăr
@ 2019-08-12 20:50   ` Sean Christopherson
  2019-08-14 12:36     ` Nicusor CITU
  0 siblings, 1 reply; 158+ messages in thread
From: Sean Christopherson @ 2019-08-12 20:50 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Nicușor Cîțu, Jim Mattson, Joerg Roedel

On Fri, Aug 09, 2019 at 07:00:19PM +0300, Adalbert Lazăr wrote:
> From: Nicușor Cîțu <ncitu@bitdefender.com>
> 
> This would be used either if the introspection tool request it as a
> reply to a KVMI_EVENT_PF event or to cope with instructions that cannot
> be handled by the x86 emulator during the handling of a VMEXIT. In
> these situations, all other vCPU-s are kicked and held, the EPT-based
> protection is removed and the guest is single stepped by the vCPU that
> triggered the initial VMEXIT. Upon completion the EPT-base protection
> is reinstalled and all vCPU-s all allowed to return to the guest.
> 
> This is a rather slow workaround that kicks in occasionally. In the
> future, the most frequently single-stepped instructions should be added
> to the emulator (usually, stores to and from memory - SSE/AVX).
> 
> For the moment it works only on Intel.
> 
> CC: Jim Mattson <jmattson@google.com>
> CC: Sean Christopherson <sean.j.christopherson@intel.com>
> CC: Joerg Roedel <joro@8bytes.org>
> Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
> Co-developed-by: Mihai Donțu <mdontu@bitdefender.com>
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/kvm/kvmi.c             |  47 ++++++++++-
>  arch/x86/kvm/svm.c              |   5 ++
>  arch/x86/kvm/vmx/vmx.c          |  17 ++++
>  arch/x86/kvm/x86.c              |  19 +++++
>  include/linux/kvmi.h            |   4 +
>  virt/kvm/kvmi.c                 | 145 +++++++++++++++++++++++++++++++-
>  virt/kvm/kvmi_int.h             |  16 ++++
>  8 files changed, 253 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ad36a5fc2048..60e2c298d469 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1016,6 +1016,7 @@ struct kvm_x86_ops {
>  	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
>  				bool enable);
>  	bool (*desc_intercept)(struct kvm_vcpu *vcpu, bool enable);
> +	void (*set_mtf)(struct kvm_vcpu *vcpu, bool enable);

MTF is a VMX specific implementation of single-stepping, this should be
enable_single_step() or something along those lines.  For example, I assume
SVM could implement something that is mostly functional via RFLAGS.TF.

>  	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
>  	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
>  	bool (*spt_fault)(struct kvm_vcpu *vcpu);
> @@ -1628,6 +1629,8 @@ void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>  				bool enable);
>  bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
>  bool kvm_spt_fault(struct kvm_vcpu *vcpu);
> +void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable);
> +void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
>  void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable);
>  
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
> index 04cac5b8a4d0..f0ab4bd9eb37 100644
> --- a/arch/x86/kvm/kvmi.c
> +++ b/arch/x86/kvm/kvmi.c
> @@ -520,7 +520,6 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
>  	u32 ctx_size;
>  	u64 ctx_addr;
>  	u32 action;
> -	bool singlestep_ignored;
>  	bool ret = false;
>  
>  	if (!kvm_spt_fault(vcpu))
> @@ -533,7 +532,7 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
>  	if (ivcpu->effective_rep_complete)
>  		return true;
>  
> -	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &singlestep_ignored,
> +	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &ivcpu->ss_requested,
>  				  &ivcpu->rep_complete, &ctx_addr,
>  				  ivcpu->ctx_data, &ctx_size);
>  
> @@ -547,6 +546,8 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu, gpa_t gpa, gva_t gva,
>  		ret = true;
>  		break;
>  	case KVMI_EVENT_ACTION_RETRY:
> +		if (ivcpu->ss_requested && !kvmi_start_ss(vcpu, gpa, access))
> +			ret = true;
>  		break;
>  	default:
>  		kvmi_handle_common_event_actions(vcpu, action, "PF");
> @@ -758,6 +759,48 @@ int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
>  	return 0;
>  }
>  
> +void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu)
> +{
> +	kvm_set_mtf(vcpu, true);
> +
> +	/*
> +	 * Set block by STI only if the RFLAGS.IF = 1.
> +	 * Blocking by both STI and MOV/POP SS is not possible.
> +	 */
> +	if (kvm_arch_interrupt_allowed(vcpu))
> +		kvm_set_interrupt_shadow(vcpu, KVM_X86_SHADOW_INT_STI);

This is wrong, the STI shadow only exists if interrupts were unblocked
prior to STI.  I'm guessing this is a hack to workaround
kvmi_arch_stop_single_step() not properly handling the clearing case.

> +
> +}
> +
> +void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu)
> +{
> +	kvm_set_mtf(vcpu, false);
> +	/*
> +	 * The blocking by STI is cleared after the guest
> +	 * executes one instruction or incurs an exception.
> +	 * However we migh stop the SS before entering to guest,
> +	 * so be sure we are clearing the STI blocking.
> +	 */
> +	kvm_set_interrupt_shadow(vcpu, 0);

There are only three callers of kvmi_stop_ss(), it should be possible
to accurately update interruptibility:

  - kvmi_run_ss() fail, do nothing
  - VM-Exit that wasn't a single-step - clear interruptibility if the
    guest executed an instruction (including faulted on an instr).
  - MTF VM-Exit - do nothing (VMCS should already be up-to-date).

> +}
> +
> +u8 kvmi_arch_relax_page_access(u8 old, u8 new)
> +{
> +	u8 ret = old | new;
> +
> +	/*
> +	 * An SPTE entry with just the -wx bits set can trigger a
> +	 * misconfiguration error from the hardware, as it's the case
> +	 * for x86 where this access mode is used to mark I/O memory.
> +	 * Thus, we make sure that -wx accesses are translated to rwx.
> +	 */
> +	if ((ret & (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X)) ==
> +	    (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X))
> +		ret |= KVMI_PAGE_ACCESS_R;
> +
> +	return ret;
> +}
> +
>  static const struct {
>  	unsigned int allow_bit;
>  	enum kvm_page_track_mode track_mode;
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b178b8900660..3481c0247680 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -7183,6 +7183,10 @@ static bool svm_spt_fault(struct kvm_vcpu *vcpu)
>  	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
>  }
>  
> +static void svm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> +{
> +}
> +
>  static void svm_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
>  {
>  }
> @@ -7225,6 +7229,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>  	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
>  	.has_emulated_msr = svm_has_emulated_msr,
>  
> +	.set_mtf = svm_set_mtf,
>  	.cr3_write_exiting = svm_cr3_write_exiting,
>  	.msr_intercept = svm_msr_intercept,
>  	.desc_intercept = svm_desc_intercept,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7d1e341b51ad..f0369d0574dc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5384,6 +5384,7 @@ static int handle_invalid_op(struct kvm_vcpu *vcpu)
>  
>  static int handle_monitor_trap(struct kvm_vcpu *vcpu)
>  {
> +	kvmi_stop_ss(vcpu);
>  	return 1;
>  }
>  
> @@ -5992,6 +5993,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>  		}
>  	}
>  
> +	if (kvmi_vcpu_enabled_ss(vcpu)
> +			&& exit_reason != EXIT_REASON_EPT_VIOLATION
> +			&& exit_reason != EXIT_REASON_MONITOR_TRAP_FLAG)

Bad indentation.  This is prevelant through the series.

> +		kvmi_stop_ss(vcpu);
> +
>  	if (exit_reason < kvm_vmx_max_exit_handlers
>  	    && kvm_vmx_exit_handlers[exit_reason])
>  		return kvm_vmx_exit_handlers[exit_reason](vcpu);
> @@ -7842,6 +7848,16 @@ static __exit void hardware_unsetup(void)
>  	free_kvm_area();
>  }
>  
> +static void vmx_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> +{
> +	if (enable)
> +		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
> +			      CPU_BASED_MONITOR_TRAP_FLAG);
> +	else
> +		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
> +				CPU_BASED_MONITOR_TRAP_FLAG);
> +}
> +
>  static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>  			      bool enable)
>  {
> @@ -7927,6 +7943,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.cpu_has_accelerated_tpr = report_flexpriority,
>  	.has_emulated_msr = vmx_has_emulated_msr,
>  
> +	.set_mtf = vmx_set_mtf,
>  	.msr_intercept = vmx_msr_intercept,
>  	.cr3_write_exiting = vmx_cr3_write_exiting,
>  	.desc_intercept = vmx_desc_intercept,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 38aaddadb93a..65855340249a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7358,6 +7358,13 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>  {
>  	int r;
>  
> +	if (kvmi_vcpu_enabled_ss(vcpu))
> +		/*
> +		 * We cannot inject events during single-stepping.
> +		 * Try again later.
> +		 */
> +		return -1;
> +
>  	/* try to reinject previous events if any */
>  
>  	if (vcpu->arch.exception.injected)
> @@ -10134,6 +10141,18 @@ void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable)
>  }
>  EXPORT_SYMBOL(kvm_control_cr3_write_exiting);
>  
> +void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> +{
> +	kvm_x86_ops->set_mtf(vcpu, enable);
> +}
> +EXPORT_SYMBOL(kvm_set_mtf);
> +
> +void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> +{
> +	kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
> +}
> +EXPORT_SYMBOL(kvm_set_interrupt_shadow);

Why do these wrappers exist, and why are they exported?  Introspection is
built into kvm, any reason not to use kvm_x86_ops directly?  The most
definitely don't need to be exported.

> +
>  bool kvm_spt_fault(struct kvm_vcpu *vcpu)
>  {
>  	return kvm_x86_ops->spt_fault(vcpu);
> diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
> index 5d162b9e67f2..1dc90284dc3a 100644
> --- a/include/linux/kvmi.h
> +++ b/include/linux/kvmi.h
> @@ -22,6 +22,8 @@ bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
>  void kvmi_trap_event(struct kvm_vcpu *vcpu);
>  bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor, u8 write);
>  void kvmi_handle_requests(struct kvm_vcpu *vcpu);
> +void kvmi_stop_ss(struct kvm_vcpu *vcpu);
> +bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu);

Spell out single step, and be consistent between single_step and singlestep.
That applies to pretty much every variable and function unless doing so
really makes the verbosity obnoxious.

>  void kvmi_init_emulate(struct kvm_vcpu *vcpu);
>  void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
>  bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg);
> @@ -44,6 +46,8 @@ static inline void kvmi_handle_requests(struct kvm_vcpu *vcpu) { }
>  static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) { return false; }
>  static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) { return true; }
>  static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
> +static inline void kvmi_stop_ss(struct kvm_vcpu *vcpu) { }
> +static inline bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu) { return false; }
>  static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
>  static inline void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu) { }
>  static inline bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg)
> diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
> index d47a725a4045..a3a5af9080a9 100644
> --- a/virt/kvm/kvmi.c
> +++ b/virt/kvm/kvmi.c
> @@ -1260,11 +1260,19 @@ void kvmi_run_jobs(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static bool need_to_wait_for_ss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +
> +	return atomic_read(&ikvm->ss_active) && !ivcpu->ss_owner;
> +}
> +
>  static bool need_to_wait(struct kvm_vcpu *vcpu)
>  {
>  	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
>  
> -	return ivcpu->reply_waiting;
> +	return ivcpu->reply_waiting || need_to_wait_for_ss(vcpu);
>  }
>  
>  static bool done_waiting(struct kvm_vcpu *vcpu)
> @@ -1572,6 +1580,141 @@ int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu, bool wait)
>  	return 0;
>  }
>  
> +void kvmi_stop_ss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvmi *ikvm;
> +	int i;
> +
> +	ikvm = kvmi_get(kvm);
> +	if (!ikvm)
> +		return;
> +
> +	if (unlikely(!ivcpu->ss_owner)) {
> +		kvmi_warn(ikvm, "%s\n", __func__);
> +		goto out;
> +	}
> +
> +	for (i = ikvm->ss_level; i--;)
> +		kvmi_set_gfn_access(kvm,
> +				    ikvm->ss_context[i].gfn,
> +				    ikvm->ss_context[i].old_access,
> +				    ikvm->ss_context[i].old_write_bitmap);
> +
> +	ikvm->ss_level = 0;
> +
> +	kvmi_arch_stop_single_step(vcpu);
> +
> +	atomic_set(&ikvm->ss_active, false);
> +	/*
> +	 * Make ss_active update visible
> +	 * before resuming all the other vCPUs.
> +	 */
> +	smp_mb__after_atomic();
> +	kvm_make_all_cpus_request(kvm, 0);
> +
> +	ivcpu->ss_owner = false;
> +
> +out:
> +	kvmi_put(kvm);
> +}
> +EXPORT_SYMBOL(kvmi_stop_ss);
> +
> +static bool kvmi_acquire_ss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +
> +	if (ivcpu->ss_owner)
> +		return true;
> +
> +	if (atomic_cmpxchg(&ikvm->ss_active, false, true) != false)
> +		return false;
> +
> +	kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_INTROSPECTION |
> +						KVM_REQUEST_WAIT);
> +
> +	ivcpu->ss_owner = true;
> +
> +	return true;
> +}
> +
> +static bool kvmi_run_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
> +{
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +	u8 old_access, new_access;
> +	u32 old_write_bitmap;
> +	gfn_t gfn = gpa_to_gfn(gpa);
> +	int err;
> +
> +	kvmi_arch_start_single_step(vcpu);
> +
> +	err = kvmi_get_gfn_access(ikvm, gfn, &old_access, &old_write_bitmap);
> +	/* likely was removed from radix tree due to rwx */
> +	if (err) {
> +		kvmi_warn(ikvm, "%s: gfn 0x%llx not found in the radix tree\n",
> +			  __func__, gfn);
> +		return true;
> +	}
> +
> +	if (ikvm->ss_level == SINGLE_STEP_MAX_DEPTH - 1) {
> +		kvmi_err(ikvm, "single step limit reached\n");
> +		return false;
> +	}
> +
> +	ikvm->ss_context[ikvm->ss_level].gfn = gfn;
> +	ikvm->ss_context[ikvm->ss_level].old_access = old_access;
> +	ikvm->ss_context[ikvm->ss_level].old_write_bitmap = old_write_bitmap;
> +	ikvm->ss_level++;
> +
> +	new_access = kvmi_arch_relax_page_access(old_access, access);
> +
> +	kvmi_set_gfn_access(vcpu->kvm, gfn, new_access, old_write_bitmap);
> +
> +	return true;
> +}
> +
> +bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
> +{
> +	bool ret = false;
> +
> +	while (!kvmi_acquire_ss(vcpu)) {
> +		int err = kvmi_run_jobs_and_wait(vcpu);
> +
> +		if (err) {
> +			kvmi_err(IKVM(vcpu->kvm), "kvmi_acquire_ss() has failed\n");
> +			goto out;
> +		}
> +	}
> +
> +	if (kvmi_run_ss(vcpu, gpa, access))
> +		ret = true;
> +	else
> +		kvmi_stop_ss(vcpu);
> +
> +out:
> +	return ret;
> +}
> +
> +bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi *ikvm;
> +	bool ret;
> +
> +	ikvm = kvmi_get(vcpu->kvm);
> +	if (!ikvm)
> +		return false;
> +
> +	ret = ivcpu->ss_owner;
> +
> +	kvmi_put(vcpu->kvm);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(kvmi_vcpu_enabled_ss);
> +
>  static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
>  {
>  	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
> index d7f9858d3e97..1550fe33ed48 100644
> --- a/virt/kvm/kvmi_int.h
> +++ b/virt/kvm/kvmi_int.h
> @@ -126,6 +126,9 @@ struct kvmi_vcpu {
>  		DECLARE_BITMAP(high, KVMI_NUM_MSR);
>  	} msr_mask;
>  
> +	bool ss_owner;

Why is single-stepping mutually exclusive across all vCPUs?  Does that
always have to be the case?

> +	bool ss_requested;
> +
>  	struct list_head job_list;
>  	spinlock_t job_lock;
>  
> @@ -151,6 +154,15 @@ struct kvmi {
>  	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
>  	DECLARE_BITMAP(vm_ev_mask, KVMI_NUM_EVENTS);
>  
> +#define SINGLE_STEP_MAX_DEPTH 8
> +	struct {
> +		gfn_t gfn;
> +		u8 old_access;
> +		u32 old_write_bitmap;
> +	} ss_context[SINGLE_STEP_MAX_DEPTH];
> +	u8 ss_level;
> +	atomic_t ss_active;

Good opportunity for an unnamed struct, e.g.

	struct {
		struct single_step_context[...];
		bool owner;
		bool requested;
		u8 level
		atomic_t active;
	} single_step;

> +
>  	struct {
>  		bool initialized;
>  		atomic_t enabled;
> @@ -224,6 +236,7 @@ int kvmi_add_job(struct kvm_vcpu *vcpu,
>  		 void *ctx, void (*free_fct)(void *ctx));
>  void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action,
>  				      const char *str);
> +bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access);
>  
>  /* arch */
>  void kvmi_arch_update_page_tracking(struct kvm *kvm,
> @@ -274,6 +287,9 @@ int kvmi_arch_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
>  				   u64 address);
>  int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
>  			     const struct kvmi_control_cr *req);
> +void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu);
> +void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu);
> +u8 kvmi_arch_relax_page_access(u8 old, u8 new);
>  int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
>  			      const struct kvmi_control_msr *req);
>  int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8 *type);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-09 16:00 ` [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR Adalbert Lazăr
@ 2019-08-12 21:05   ` Sean Christopherson
  2019-08-15  6:36     ` Nicusor CITU
  2019-08-19 18:52   ` Sean Christopherson
  1 sibling, 1 reply; 158+ messages in thread
From: Sean Christopherson @ 2019-08-12 21:05 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

On Fri, Aug 09, 2019 at 07:00:10PM +0300, Adalbert Lazăr wrote:
> From: Mihai Donțu <mdontu@bitdefender.com>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 22f08f2732cc..91cd43a7a7bf 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1013,6 +1013,8 @@ struct kvm_x86_ops {
>  	bool (*has_emulated_msr)(int index);
>  	void (*cpuid_update)(struct kvm_vcpu *vcpu);
>  
> +	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
> +				bool enable);

This should be toggle_wrmsr_intercept(), or toggle_msr_intercept() with
a paramter to control RDMSR vs. WRMSR.

>  	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
>  	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
>  	bool (*spt_fault)(struct kvm_vcpu *vcpu);
> @@ -1621,6 +1623,8 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> +				bool enable);
>  bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
>  bool kvm_spt_fault(struct kvm_vcpu *vcpu);
>  void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool enable);
> diff --git a/arch/x86/include/asm/kvmi_host.h b/arch/x86/include/asm/kvmi_host.h
> index 83a098dc8939..8285d1eb0db6 100644

...

> diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
> index b3cab0db6a70..5dba4f87afef 100644
> --- a/arch/x86/kvm/kvmi.c
> +++ b/arch/x86/kvm/kvmi.c
> @@ -9,6 +9,133 @@
>  #include <asm/vmx.h>
>  #include "../../../virt/kvm/kvmi_int.h"
>  
> +static unsigned long *msr_mask(struct kvm_vcpu *vcpu, unsigned int *msr)
> +{
> +	switch (*msr) {
> +	case 0 ... 0x1fff:
> +		return IVCPU(vcpu)->msr_mask.low;
> +	case 0xc0000000 ... 0xc0001fff:
> +		*msr &= 0x1fff;
> +		return IVCPU(vcpu)->msr_mask.high;
> +	}
> +
> +	return NULL;
> +}

...

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6450c8c44771..0306c7ef3158 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7784,6 +7784,15 @@ static __exit void hardware_unsetup(void)
>  	free_kvm_area();
>  }
>  
> +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> +			      bool enable)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> +
> +	vmx_set_intercept_for_msr(msr_bitmap, msr, MSR_TYPE_W, enable);
> +}

Unless I overlooked a check, this will allow userspace to disable WRMSR
interception for any MSR in the above range, i.e. userspace can use KVM
to gain full write access to pretty much all the interesting MSRs.  This
needs to only disable interception if KVM had interception disabled before
introspection started modifying state.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 00/92] VM introspection
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (92 preceding siblings ...)
  2019-08-12 18:23 ` [RFC PATCH v6 00/92] VM introspection Sean Christopherson
@ 2019-08-12 21:40 ` Sean Christopherson
  2019-08-13  9:34 ` Paolo Bonzini
  94 siblings, 0 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-12 21:40 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

On Fri, Aug 09, 2019 at 06:59:15PM +0300, Adalbert Lazăr wrote:
>  55 files changed, 13485 insertions(+), 225 deletions(-)

The size of this series is overwhelming, to say the least.  The remote
pages concept and SPP patches on their own would be hefty series to review.

It would be very helpful to reviewers to reorder the patches and only send
the bits that are absolutely mandatory for initial support.  For example,
AFAICT the SPP support and remote pages concept are largely performance
related and not functionally required.

Note, this wouldn't prevent you from carrying the series in its entirety
in your own branches.

Possible reordering:

  - Bug fixes (if any patches qualify as such)
  - Emulator changes
  - KVM preparatory patches
  - Basic instrospection functionality

------>8-------- cut the series here

  - Optional introspection functionality (if there is any)
  - SPP and introspection integration
  - Remote pages and introspection integration

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault()
  2019-08-09 15:59 ` [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault() Adalbert Lazăr
@ 2019-08-13  8:12   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:12 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu,
	Nicușor Cîțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
> +		return false;
> +	return true;
> +}
> +

This hook is misnamed; it has nothing to do with nested virtualization.
 Rather, it returns true if it the failure happened while translating
the address of a guest page table.

SVM makes the same information available in EXITINFO[33].

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest
  2019-08-09 15:59 ` [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest Adalbert Lazăr
@ 2019-08-13  8:26   ` Paolo Bonzini
       [not found]     ` <5d52c10e.1c69fb81.26904.fd34SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:26 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu,
	Mircea Cîrjaliu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +			prepare_to_swait_exclusive(&vcpu->wq, &wait,
> +						   TASK_INTERRUPTIBLE);
> +
> +			if (kvm_vcpu_check_block(vcpu) < 0)
> +				break;
> +
> +			waited = true;
> +			schedule();
> +
> +			if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu)) {
> +				do_kvmi_work = true;
> +				break;
> +			}
> +		}
>  
> -		waited = true;
> -		schedule();
> +		finish_swait(&vcpu->wq, &wait);
> +
> +		if (do_kvmi_work)
> +			kvmi_handle_requests(vcpu);
> +		else
> +			break;
>  	}

Is this needed?  Or can it just go back to KVM_RUN and handle
KVM_REQ_INTROSPECTION there (in which case it would be basically
premature optimization)?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty
  2019-08-09 15:59 ` [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty Adalbert Lazăr
@ 2019-08-13  8:43   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:43 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +void kvmi_handle_requests(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi *ikvm;
> +
> +	ikvm = kvmi_get(vcpu->kvm);
> +	if (!ikvm)
> +		return;
> +
> +	for (;;) {
> +		int err = kvmi_run_jobs_and_wait(vcpu);
> +
> +		if (err)
> +			break;
> +	}
> +
> +	kvmi_put(vcpu->kvm);
> +}
> +

Using kvmi_run_jobs_and_wait from two places (here and kvmi_send_event)
is very confusing.  Does kvmi_handle_requests need to do this, or can it
just use kvmi_run_jobs?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook)
  2019-08-09 15:59 ` [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook) Adalbert Lazăr
@ 2019-08-13  8:44   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:44 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu,
	Mircea Cîrjaliu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +static int kvmi_recv(void *arg)
> +{
> +	struct kvmi *ikvm = arg;
> +
> +	kvmi_info(ikvm, "Hooking VM\n");
> +
> +	while (kvmi_msg_process(ikvm))
> +		;
> +
> +	kvmi_info(ikvm, "Unhooking VM\n");
> +
> +	kvmi_end_introspection(ikvm);
> +
> +	return 0;
> +}
> +

Rename this to kvmi_recv_thread instead, please.

> +
> +	/*
> +	 * Make sure all the KVM/KVMI structures are linked and no pointer
> +	 * is read as NULL after the reference count has been set.
> +	 */
> +	smp_mb__before_atomic();

This is an smp_wmb(), not an smp_mb__before_atomic().  Add a comment
that it pairs with the refcount_inc_not_zero in kvmi_get.

> +	refcount_set(&kvm->kvmi_ref, 1);
> +


> @@ -57,8 +183,27 @@ void kvmi_destroy_vm(struct kvm *kvm)
>  	if (!ikvm)
>  		return;
>  
> +	/* trigger socket shutdown - kvmi_recv() will start shutdown process */
> +	kvmi_sock_shutdown(ikvm);
> +
>  	kvmi_put(kvm);
>  
>  	/* wait for introspection resources to be released */
>  	wait_for_completion_killable(&kvm->kvmi_completed);
>  }
> +

This addition means that kvmi_destroy_vm should have called
kvmi_end_introspection instead.  In patch 1, kvmi_end_introspection
should have been just kvmi_put, now this patch can add kvmi_sock_shutdown.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage()
  2019-08-09 16:00 ` [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage() Adalbert Lazăr
@ 2019-08-13  8:47   ` Paolo Bonzini
       [not found]     ` <5d52ca22.1c69fb81.4ceb8.e90bSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:47 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 18:00, Adalbert Lazăr wrote:
> If the EPT violation was caused by an execute restriction imposed by the
> introspection tool, gpa_available will point to the instruction pointer,
> not the to the read/write location that has to be used to emulate the
> current instruction.
> 
> This optimization should be disabled only when the VM is introspected,
> not just because the introspection subsystem is present.
> 
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>

The right thing to do is to not set gpa_available for fetch failures in 
kvm_mmu_page_fault instead:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..1bdca40fa831 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5364,8 +5364,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	enum emulation_result er;
 	bool direct = vcpu->arch.mmu->direct_map;
 
-	/* With shadow page tables, fault_address contains a GVA or nGPA.  */
-	if (vcpu->arch.mmu->direct_map) {
+	/*
+	 * With shadow page tables, fault_address contains a GVA or nGPA.
+	 * On a fetch fault, fault_address contains the instruction pointer.
+	 */
+	if (vcpu->arch.mmu->direct_map &&
+	    likely(!(error_code & PFERR_FETCH_MASK)) {
 		vcpu->arch.gpa_available = true;
 		vcpu->arch.gpa_val = cr2;
 	}


Paolo

> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 965c4f0108eb..3975331230b9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5532,7 +5532,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
>  	 * operation using rep will only have the initial GPA from the NPF
>  	 * occurred.
>  	 */
> -	if (vcpu->arch.gpa_available &&
> +	if (vcpu->arch.gpa_available && !kvmi_is_present() &&
>  	    emulator_can_use_gpa(ctxt) &&
>  	    (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
>  		gpa = vcpu->arch.gpa_val;
> 


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies
  2019-08-09 15:59 ` [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies Adalbert Lazăr
@ 2019-08-13  8:55   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  8:55 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> 
> +			 reply->padding2);
> +
> +	ivcpu->reply_waiting = false;
> +	return expected->error;
> +}
> +
>  /*

Is this missing a wakeup?

>  
> +static bool need_to_wait(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	return ivcpu->reply_waiting;
> +}
> +

Do you actually need this function?  It seems to me that everywhere you
call it you already have an ivcpu, so you can just access the field.

Also, "reply_waiting" means "there is a reply that is waiting".  What
you mean is "waiting_for_reply".

The overall structure of the jobs code is confusing.  The same function
kvm_run_jobs_and_wait is an infinite loop before and gets a "break"
later.  It is also not clear why kvmi_job_wait is called through a job.
 Can you have instead just kvm_run_jobs in KVM_REQ_INTROSPECTION, and
something like this instead when sending an event:

int kvmi_wait_for_reply(struct kvm_vcpu *vcpu)
{
	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);

	while (ivcpu->waiting_for_reply) {
		kvmi_run_jobs(vcpu);

		err = swait_event_killable(*wq,
				!ivcpu->waiting_for_reply ||
				!list_empty(&ivcpu->job_list));

		if (err)
			return -EINTR;
	}

	return 0;
}

?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 27/92] kvm: introspection: use page track
  2019-08-09 15:59 ` [RFC PATCH v6 27/92] kvm: introspection: use page track Adalbert Lazăr
@ 2019-08-13  9:06   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:06 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu,
	Nicușor Cîțu, Marian Rotariu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +
> +	/*
> +	 * This function uses kvm->mmu_lock so it's not allowed to be
> +	 * called under kvmi_put(). It can reach a deadlock if called
> +	 * from kvm_mmu_load -> kvmi_tracked_gfn -> kvmi_put.
> +	 */
> +	kvmi_clear_mem_access(kvm);

kvmi_tracked_gfn does not exist yet.

More in general, this comment says why you are calling this here, but it
says nothing about the split of responsibility between
kvmi_end_introspection and kvmi_release.  Please add a comment for this
as soon as you add kvmi_end_introspection (which according to my earlier
review should be patch 1).

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when tracked by the introspection tool
  2019-08-09 16:00 ` [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when " Adalbert Lazăr
@ 2019-08-13  9:08   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:08 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 18:00, Adalbert Lazăr wrote:
> It should complete the commit fd34a9518173 ("kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map()")
> 
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/kvm/mmu.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 65b6acba82da..fd64cf1115da 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2660,6 +2660,9 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
>  					   unsigned int acc)
>  {
> +	if (!kvmi_tracked_gfn(vcpu, gfn))
> +		return acc;
> +
>  	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
>  		acc &= ~ACC_USER_MASK;
>  	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> 

If this patch is always needed, then the function should be named
something like kvm_mmu_apply_introspection_access and kvmi_tracked_gfn
should be tested from the moment it is introduced.

But the commit message says nothing about _why_ it is needed, so I
cannot guess.  I would very much avoid it however.  Is it just an
optimization?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
  2019-08-12 20:20   ` Sean Christopherson
@ 2019-08-13  9:11     ` Paolo Bonzini
       [not found]     ` <5d52a5ae.1c69fb81.5c260.1573SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:11 UTC (permalink / raw)
  To: Sean Christopherson, Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 12/08/19 22:20, Sean Christopherson wrote:
> The refcounting approach seems a bit backwards, and AFAICT is driven by
> implementing unhook via a message, which also seems backwards.  I assume
> hook and unhook are relatively rare events and not performance critical,
> so make those the restricted/slow flows, e.g. force userspace to quiesce
> the VM by making unhook() mutually exclusive with every vcpu ioctl() and
> maybe anything that takes kvm->lock. 

The reason for the unhook event, as far as I understand, is because the
introspection appliance can poke int3 into the guest and needs an
opportunity to undo that.

I don't have a big problem with that and the refcounting, at least for
this first iteration---it can be tackled later, once the general event
loop is simplified---however I agree with the other comments that Sean
made.  Fortunately it should not be hard to apply them to the whole
patchset with search and replace on the patches themselves.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE
  2019-08-09 15:59 ` [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE Adalbert Lazăr
@ 2019-08-13  9:15   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:15 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> +If `now` is 1, the command reply is enabled/disabled (according to
> +`enable`) starting with the current command. For example, `enable=0`
> +and `now=1` means that the reply is disabled for this command too,
> +while `enable=0` and `now=0` means that a reply will be send for this
> +command, but not for the next ones (until enabled back with another
> +*KVMI_CONTROL_CMD_RESPONSE*).
> +
> +This command is used by the introspection tool to disable the replies
> +for commands returning an error code only (eg. *KVMI_SET_REGISTERS*)
> +when an error is less likely to happen. For example, the following
> +commands can be used to reply to an event with a single `write()` call:
> +
> +	KVMI_CONTROL_CMD_RESPONSE enable=0 now=1
> +	KVMI_SET_REGISTERS vcpu=N
> +	KVMI_EVENT_REPLY   vcpu=N
> +	KVMI_CONTROL_CMD_RESPONSE enable=1 now=0

I don't understand the usage.  Is there any case where you want now == 1
actually?  Can you just say that KVMI_CONTROL_CMD_RESPONSE never has a
reply, or to make now==enable?

> +	if (err)
> +		kvmi_warn(ikvm, "Error code %d discarded for message id %d\n",
> +			  err, msg->id);
> +

Would it make sense to even close the socket if there is an error?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command
  2019-08-09 15:59 ` [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command Adalbert Lazăr
@ 2019-08-13  9:16   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:16 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> Obviously, the KVMI_GET_VERSION command must not be used when the command
> reply is disabled by a previous KVMI_CONTROL_CMD_RESPONSE command.
> 
> This commit changes the code path in order to check the reply option
> (enabled/disabled) before trying to reply to this command. If the command
> reply is disabled it will return an error to the caller. In the end, the
> receiving worker will finish and the introspection socket will be closed.
> 
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  virt/kvm/kvmi_msg.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
> index ea5c7e23669a..2237a6ed25f6 100644
> --- a/virt/kvm/kvmi_msg.c
> +++ b/virt/kvm/kvmi_msg.c
> @@ -169,7 +169,7 @@ static int handle_get_version(struct kvmi *ikvm,
>  	memset(&rpl, 0, sizeof(rpl));
>  	rpl.version = KVMI_VERSION;
>  
> -	return kvmi_msg_vm_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
> +	return kvmi_msg_vm_maybe_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
>  }
>  
>  static bool is_command_allowed(struct kvmi *ikvm, int id)
> 

Go ahead and squash this in the previous patch.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64
  2019-08-09 16:00 ` [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64 Adalbert Lazăr
@ 2019-08-13  9:17   ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:17 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 18:00, Adalbert Lazăr wrote:
> From: Mihai Donțu <mdontu@bitdefender.com>
> 
> This is needed in order to be able to support guest code that uses movsd to
> write into pages that are marked for write tracking.
> 
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/kvm/emulate.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 34431cf31f74..9d38f892beea 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1177,6 +1177,27 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
>  	return X86EMUL_CONTINUE;
>  }
>  
> +static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
> +			       int simd_prefix)
> +{
> +	u8 bytes;
> +
> +	switch (ctxt->b) {
> +	case 0x11:
> +		/* movsd xmm, m64 */
> +		/* movups xmm, m128 */
> +		if (simd_prefix == 0xf2) {
> +			bytes = 8;
> +			break;
> +		}
> +		/* fallthrough */
> +	default:
> +		bytes = 16;
> +		break;
> +	}
> +	return bytes;
> +}
> +
>  static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
>  				    struct operand *op)
>  {
> @@ -1187,7 +1208,7 @@ static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
>  
>  	if (ctxt->d & Sse) {
>  		op->type = OP_XMM;
> -		op->bytes = 16;
> +		op->bytes = ctxt->op_bytes;
>  		op->addr.xmm = reg;
>  		read_sse_reg(ctxt, &op->vec_val, reg);
>  		return;
> @@ -1238,7 +1259,7 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
>  				ctxt->d & ByteOp);
>  		if (ctxt->d & Sse) {
>  			op->type = OP_XMM;
> -			op->bytes = 16;
> +			op->bytes = ctxt->op_bytes;
>  			op->addr.xmm = ctxt->modrm_rm;
>  			read_sse_reg(ctxt, &op->vec_val, ctxt->modrm_rm);
>  			return rc;
> @@ -4529,7 +4550,7 @@ static const struct gprefix pfx_0f_2b = {
>  };
>  
>  static const struct gprefix pfx_0f_10_0f_11 = {
> -	I(Unaligned, em_mov), I(Unaligned, em_mov), N, N,
> +	I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov), N,
>  };
>  
>  static const struct gprefix pfx_0f_28_0f_29 = {
> @@ -5097,7 +5118,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>  {
>  	int rc = X86EMUL_CONTINUE;
>  	int mode = ctxt->mode;
> -	int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
> +	int def_op_bytes, def_ad_bytes, goffset, simd_prefix = 0;
>  	bool op_prefix = false;
>  	bool has_seg_override = false;
>  	struct opcode opcode;
> @@ -5320,7 +5341,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>  			ctxt->op_bytes = 4;
>  
>  		if (ctxt->d & Sse)
> -			ctxt->op_bytes = 16;
> +			ctxt->op_bytes = simd_prefix_to_bytes(ctxt,
> +							      simd_prefix);
>  		else if (ctxt->d & Mmx)
>  			ctxt->op_bytes = 8;
>  	}
> 

Please submit all these emulator patches as a separate series, complete
with testcases for kvm-unit-tests.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present
  2019-08-09 16:00 ` [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present Adalbert Lazăr
@ 2019-08-13  9:18   ` Paolo Bonzini
       [not found]     ` <0550f8d65bb97486e98d88255ea45d490da6b802.camel@bitdefender.com>
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:18 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 18:00, Adalbert Lazăr wrote:
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index dc648ba47df3..152c58b63f69 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7718,7 +7718,7 @@ static __init int hardware_setup(void)
>  	    !cpu_has_vmx_invept_global())
>  		enable_ept = 0;
>  
> -	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
> +	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
>  		enable_ept_ad_bits = 0;
>  
>  	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
> 

Why?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation
  2019-08-09 16:00 ` [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation Adalbert Lazăr
@ 2019-08-13  9:20   ` Paolo Bonzini
       [not found]     ` <5d53f965.1c69fb81.cd952.035bSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:20 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu, Joerg Roedel

On 09/08/19 18:00, Adalbert Lazăr wrote:
> From: Mihai Donțu <mdontu@bitdefender.com>
> 
> It can happened for us to end up emulating the VMCALL instruction as a
> result of the handling of an EPT write fault. In this situation, the
> emulator will try to unconditionally patch the correct hypercall opcode
> bytes using emulator_write_emulated(). However, this last call uses the
> fault GPA (if available) or walks the guest page tables at RIP,
> otherwise. The trouble begins when using KVMI, when we forbid the use of
> the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
> newer) the page that we try to write into is marked read-execute and as
> such emulator_write_emulated() fails and we inject a write #PF, leading
> to a guest crash.
> 
> The fix is rather simple: check the existing instruction bytes before
> doing the patching. This does not change the normal KVM behaviour, but
> does help when using KVMI as we no longer inject a write #PF.

Fixing the hypercall is just an optimization.  Can we just hush and
return to the guest if emulator_write_emulated returns
X86EMUL_PROPAGATE_FAULT?

Paolo

> CC: Joerg Roedel <joro@8bytes.org>
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/kvm/x86.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04b1d2916a0a..965c4f0108eb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7363,16 +7363,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>  
> +#define KVM_HYPERCALL_INSN_LEN 3
> +
>  static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
>  {
> +	int err;
>  	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> -	char instruction[3];
> +	char buf[KVM_HYPERCALL_INSN_LEN];
> +	char instruction[KVM_HYPERCALL_INSN_LEN];
>  	unsigned long rip = kvm_rip_read(vcpu);
>  
> +	err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
> +				     &ctxt->exception);
> +	if (err != X86EMUL_CONTINUE)
> +		return err;
> +
>  	kvm_x86_ops->patch_hypercall(vcpu, instruction);
> +	if (!memcmp(instruction, buf, sizeof(instruction)))
> +		/*
> +		 * The hypercall instruction is the correct one. Retry
> +		 * its execution maybe we got here as a result of an
> +		 * event other than #UD which has been resolved in the
> +		 * mean time.
> +		 */
> +		return X86EMUL_CONTINUE;
>  
> -	return emulator_write_emulated(ctxt, rip, instruction, 3,
> -		&ctxt->exception);
> +	return emulator_write_emulated(ctxt, rip, instruction,
> +				       sizeof(instruction), &ctxt->exception);
>  }
>  
>  static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
> 


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-09 16:24   ` DANGER WILL ROBINSON, DANGER Matthew Wilcox
@ 2019-08-13  9:29     ` Paolo Bonzini
  2019-08-13 11:24       ` Matthew Wilcox
       [not found]     ` <1565694095.D172a51.28640.@15f23d3a749365d981e968181cce585d2dcb3ffa>
  1 sibling, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:29 UTC (permalink / raw)
  To: Matthew Wilcox, Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 09/08/19 18:24, Matthew Wilcox wrote:
> On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
>> +++ b/include/linux/page-flags.h
>> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>>   */
>>  #define PAGE_MAPPING_ANON	0x1
>>  #define PAGE_MAPPING_MOVABLE	0x2
>> +#define PAGE_MAPPING_REMOTE	0x4
> Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> Who's guaranteeing that?
> 
> This is an awfully big patch to the memory management code, buried in
> the middle of a gigantic series which almost guarantees nobody would
> look at it.  I call shenanigans.

Are you calling shenanigans on the patch submitter (which is gratuitous)
or on the KVM maintainers/reviewers?

It's not true that nobody would look at it.  Of course no one from
linux-mm is going to look at it, but the maintainer that looks at the
gigantic series is very much expected to look at it and explain to the
submitter that this patch is unacceptable as is.

In fact I shouldn't have to to explain this to you; you know better than
believing that I would try to sneak it past the mm folks.  I am puzzled.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 00/92] VM introspection
  2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
                   ` (93 preceding siblings ...)
  2019-08-12 21:40 ` Sean Christopherson
@ 2019-08-13  9:34 ` Paolo Bonzini
  94 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13  9:34 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 09/08/19 17:59, Adalbert Lazăr wrote:
> 
> Patches 1-20: unroll a big part of the KVM introspection subsystem,
> sent in one patch in the previous versions.
> 
> Patches 21-24: extend the current page tracking code.
> 
> Patches 25-33: make use of page tracking to support the
> KVMI_SET_PAGE_ACCESS introspection command and the KVMI_EVENT_PF event
> (on EPT violations caused by the tracking settings).
> 
> Patches 34-42: include the SPP feature (Enable Sub-page
> Write Protection Support), already sent to KVM list:
> 
> 	https://lore.kernel.org/lkml/20190717133751.12910-1-weijiang.yang@intel.com/
> 
> Patches 43-46: add the commands needed to use SPP.
> 
> Patches 47-63: unroll almost all the rest of the introspection code.
> 
> Patches 64-67: add single-stepping, mostly as a way to overcome the
> unimplemented instructions, but also as a feature for the introspection
> tool.
> 
> Patches 68-70: cover more cases related to EPT violations.
> 
> Patches 71-73: add the remote mapping feature, allowing the introspection
> tool to map into its address space a page from guest memory.
> 
> Patches 74: add a fix to hypercall emulation.
> 
> Patches 75-76: disable some features/optimizations when the introspection
> code is present.
> 
> Patches 77-78: add trace functions for the introspection code and change
> some related to interrupts/exceptions injection.
> 
> Patches 79-92: new instruction for the x86 emulator, including cmpxchg
> fixes.

Thanks for the very good explanation.  Apart from the complicated flow
of KVM request handling and KVM reply, the main issue is the complete
lack of testcases.  There should be a kvmi_test in
tools/testing/selftests/kvm, and each patch adding a new ioctl or event
should add a new testcase.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-13  9:29     ` Paolo Bonzini
@ 2019-08-13 11:24       ` Matthew Wilcox
  2019-08-13 12:02         ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Matthew Wilcox @ 2019-08-13 11:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On Tue, Aug 13, 2019 at 11:29:07AM +0200, Paolo Bonzini wrote:
> On 09/08/19 18:24, Matthew Wilcox wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> >> +++ b/include/linux/page-flags.h
> >> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> >>   */
> >>  #define PAGE_MAPPING_ANON	0x1
> >>  #define PAGE_MAPPING_MOVABLE	0x2
> >> +#define PAGE_MAPPING_REMOTE	0x4
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> 
> Are you calling shenanigans on the patch submitter (which is gratuitous)
> or on the KVM maintainers/reviewers?

On the patch submitter, of course.  How can I possibly be criticising you
for something you didn't do?


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-13 11:24       ` Matthew Wilcox
@ 2019-08-13 12:02         ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 12:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 13/08/19 13:24, Matthew Wilcox wrote:
>>>
>>> This is an awfully big patch to the memory management code, buried in
>>> the middle of a gigantic series which almost guarantees nobody would
>>> look at it.  I call shenanigans.
>> Are you calling shenanigans on the patch submitter (which is gratuitous)
>> or on the KVM maintainers/reviewers?
>
> On the patch submitter, of course.  How can I possibly be criticising you
> for something you didn't do?

No idea.  "Nobody would look at it" definitely includes me though.

In any case, water under the bridge.  The submitter did duly mark the
series as RFC, I don't see anything wrong in what he did apart from not
having testcases. :)

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
       [not found]     ` <5d52a5ae.1c69fb81.5c260.1573SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2019-08-13 12:09       ` Paolo Bonzini
  2019-08-13 15:01         ` Sean Christopherson
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 12:09 UTC (permalink / raw)
  To: Adalbert Lazăr, Sean Christopherson
  Cc: kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 13/08/19 13:57, Adalbert Lazăr wrote:
>> The refcounting approach seems a bit backwards, and AFAICT is driven by
>> implementing unhook via a message, which also seems backwards.  I assume
>> hook and unhook are relatively rare events and not performance critical,
>> so make those the restricted/slow flows, e.g. force userspace to quiesce
>> the VM by making unhook() mutually exclusive with every vcpu ioctl() and
>> maybe anything that takes kvm->lock. 
>>
>> Then kvmi_ioctl_unhook() can use thread_stop() and kvmi_recv() just needs
>> to check kthread_should_stop().
>>
>> That way kvmi doesn't need to be refcounted since it's guaranteed to be
>> alive if the pointer is non-null.  Eliminating the refcounting will clean
>> up a lot of the code by eliminating calls to kvmi_{get,put}(), e.g.
>> wrappers like kvmi_breakpoint_event() just check vcpu->kvmi, or maybe
>> even get dropped altogether.
> 
> The unhook event has been added to cover the following case: while the
> introspection tool runs in another VM, both VMs, the virtual appliance
> and the introspected VM, could be paused by the user. We needed a way
> to signal this to the introspection tool and give it time to unhook
> (the introspected VM has to run and execute the introspection commands
> during this phase). The receiving threads quits when the socket is closed
> (by QEMU or by the introspection tool).
> 
> It's a bit unclear how, but we'll try to get ride of the refcount object,
> which will remove a lot of code, indeed.

You can keep it for now.  It may become clearer how to fix it after the
event loop is cleaned up.

>>
>>> +void kvmi_create_vm(struct kvm *kvm)
>>> +{
>>> +	init_completion(&kvm->kvmi_completed);
>>> +	complete(&kvm->kvmi_completed);
>> Pretty sure you don't want to be calling complete() here.
> The intention was to stop the hooking ioctl until the VM is
> created. A better name for 'kvmi_completed' would have been
> 'ready_to_be_introspected', as kvmi_hook() will wait for it.
> 
> We'll see how we can get ride of the completion object.

The ioctls are not accessible while kvm_create_vm runs (only after
kvm_dev_ioctl_create_vm calls fd_install).  Even if it were, however,
you should have placed init_completion much earlier, otherwise
wait_for_completion would access uninitialized memory.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage()
       [not found]     ` <5d52ca22.1c69fb81.4ceb8.e90bSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2019-08-13 14:35       ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 14:35 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu

On 13/08/19 16:33, Adalbert Lazăr wrote:
> On Tue, 13 Aug 2019 10:47:34 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 09/08/19 18:00, Adalbert Lazăr wrote:
>>> If the EPT violation was caused by an execute restriction imposed by the
>>> introspection tool, gpa_available will point to the instruction pointer,
>>> not the to the read/write location that has to be used to emulate the
>>> current instruction.
>>>
>>> This optimization should be disabled only when the VM is introspected,
>>> not just because the introspection subsystem is present.
>>>
>>> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
>>
>> The right thing to do is to not set gpa_available for fetch failures in 
>> kvm_mmu_page_fault instead:
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 24843cf49579..1bdca40fa831 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -5364,8 +5364,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
>>  	enum emulation_result er;
>>  	bool direct = vcpu->arch.mmu->direct_map;
>>  
>> -	/* With shadow page tables, fault_address contains a GVA or nGPA.  */
>> -	if (vcpu->arch.mmu->direct_map) {
>> +	/*
>> +	 * With shadow page tables, fault_address contains a GVA or nGPA.
>> +	 * On a fetch fault, fault_address contains the instruction pointer.
>> +	 */
>> +	if (vcpu->arch.mmu->direct_map &&
>> +	    likely(!(error_code & PFERR_FETCH_MASK)) {
>>  		vcpu->arch.gpa_available = true;
>>  		vcpu->arch.gpa_val = cr2;
>>  	}
>
> Sure, but I think we'll have to extend the check.
> 
> Searching the logs I've found:
> 
>     kvm/x86: re-translate broken translation that caused EPT violation
>     
>     Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> 
>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> /home/b/kvmi@9cad844~1/arch/x86/kvm/x86.c:4757,4762 - /home/b/kvmi@9cad844/arch/x86/kvm/x86.c:4757,4763
>   	 */
>   	if (vcpu->arch.gpa_available &&
>   	    emulator_can_use_gpa(ctxt) &&
> + 	    (vcpu->arch.error_code & PFERR_GUEST_FINAL_MASK) &&
>   	    (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
>   		gpa = vcpu->arch.gpa_val;
>   		ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write);
> 

Yes, adding that check makes sense as well (still in kvm_mmu_page_fault).

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest
       [not found]     ` <5d52c10e.1c69fb81.26904.fd34SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2019-08-13 14:45       ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 14:45 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu,
	Mircea Cîrjaliu

On 13/08/19 15:54, Adalbert Lazăr wrote:
>     Leaving kvm_vcpu_block() in order to handle a request such as 'pause',
>     would cause the vCPU to enter the guest when resumed. Most of the
>     time this does not appear to be an issue, but during early boot it
>     can happen for a non-boot vCPU to start executing code from areas that
>     first needed to be set up by vCPU #0.
>     
>     In a particular case, vCPU #1 executed code which resided in an area
>     not covered by a memslot, which caused an EPT violation that got
>     turned in mmu_set_spte() into a MMIO request that required emulation.
>     Unfortunatelly, the emulator tripped, exited to userspace and the VM
>     was aborted.

Okay, this makes sense.  Maybe you want to handle KVM_REQ_INTROSPECTION
in vcpu_run rather than vcpu_enter_guest?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
  2019-08-13 12:09       ` Paolo Bonzini
@ 2019-08-13 15:01         ` Sean Christopherson
  2019-08-13 21:03           ` Paolo Bonzini
       [not found]           ` <5d53d8d1.1c69fb81.7d32.0bedSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-13 15:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On Tue, Aug 13, 2019 at 02:09:51PM +0200, Paolo Bonzini wrote:
> On 13/08/19 13:57, Adalbert Lazăr wrote:
> >> The refcounting approach seems a bit backwards, and AFAICT is driven by
> >> implementing unhook via a message, which also seems backwards.  I assume
> >> hook and unhook are relatively rare events and not performance critical,
> >> so make those the restricted/slow flows, e.g. force userspace to quiesce
> >> the VM by making unhook() mutually exclusive with every vcpu ioctl() and
> >> maybe anything that takes kvm->lock. 
> >>
> >> Then kvmi_ioctl_unhook() can use thread_stop() and kvmi_recv() just needs
> >> to check kthread_should_stop().
> >>
> >> That way kvmi doesn't need to be refcounted since it's guaranteed to be
> >> alive if the pointer is non-null.  Eliminating the refcounting will clean
> >> up a lot of the code by eliminating calls to kvmi_{get,put}(), e.g.
> >> wrappers like kvmi_breakpoint_event() just check vcpu->kvmi, or maybe
> >> even get dropped altogether.
> > 
> > The unhook event has been added to cover the following case: while the
> > introspection tool runs in another VM, both VMs, the virtual appliance
> > and the introspected VM, could be paused by the user. We needed a way
> > to signal this to the introspection tool and give it time to unhook
> > (the introspected VM has to run and execute the introspection commands
> > during this phase). The receiving threads quits when the socket is closed
> > (by QEMU or by the introspection tool).

Why does closing the socket require destroying the kvmi object?  E.g. can
it be marked as defunct or whatever and only fully removed on a synchronous
unhook from userspace?  Re-hooking could either require said unhook, or
maybe reuse the existing kvmi object with a new socket.

> > It's a bit unclear how, but we'll try to get ride of the refcount object,
> > which will remove a lot of code, indeed.
> 
> You can keep it for now.  It may become clearer how to fix it after the
> event loop is cleaned up.

By event loop, do you mean the per-vCPU jobs list?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
  2019-08-13 15:01         ` Sean Christopherson
@ 2019-08-13 21:03           ` Paolo Bonzini
       [not found]           ` <5d53d8d1.1c69fb81.7d32.0bedSMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 21:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 13/08/19 17:01, Sean Christopherson wrote:
>>> It's a bit unclear how, but we'll try to get ride of the refcount object,
>>> which will remove a lot of code, indeed.
>> You can keep it for now.  It may become clearer how to fix it after the
>> event loop is cleaned up.
> By event loop, do you mean the per-vCPU jobs list?

Yes, I meant event handling (which involves the jobs list).

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present
       [not found]     ` <0550f8d65bb97486e98d88255ea45d490da6b802.camel@bitdefender.com>
@ 2019-08-13 21:05       ` Paolo Bonzini
  2019-08-14  8:53         ` Mihai Donțu
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-13 21:05 UTC (permalink / raw)
  To: Mihai Donțu, KVM list

On 13/08/19 20:36, Mihai Donțu wrote:
>> Why?
> When EPT A/D is enabled, all guest page table walks are treated as
> writes (like AMD's NPT). Thus, an introspection tool hooking the guest
> page tables would trigger a flood of VMEXITs (EPT write violations)
> that will get the introspected VM into an unusable state.
> 
> Our implementation of such an introspection tool builds a cache of
> {cr3, gva} -> gpa, which is why it needs to monitor all guest PTs by
> hooking them for write.

Please include the kvm list too.

One issue here is that it changes the nested VMX ABI.  Can you leave EPT
A/D in place for the shadow EPT MMU, but not for "regular" EPT pages?

Also, what is the state of introspection support on AMD?

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present
  2019-08-13 21:05       ` Paolo Bonzini
@ 2019-08-14  8:53         ` Mihai Donțu
  2019-08-14 10:36           ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Mihai Donțu @ 2019-08-14  8:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Adalbert Lazăr

On Tue, 2019-08-13 at 23:05 +0200, Paolo Bonzini wrote:
> On 13/08/19 20:36, Mihai Donțu wrote:
> > > Why?
> > When EPT A/D is enabled, all guest page table walks are treated as
> > writes (like AMD's NPT). Thus, an introspection tool hooking the guest
> > page tables would trigger a flood of VMEXITs (EPT write violations)
> > that will get the introspected VM into an unusable state.
> > 
> > Our implementation of such an introspection tool builds a cache of
> > {cr3, gva} -> gpa, which is why it needs to monitor all guest PTs by
> > hooking them for write.
> 
> Please include the kvm list too.

I apologize. I did not notice I trimmed the CC list.

> One issue here is that it changes the nested VMX ABI.  Can you leave EPT
> A/D in place for the shadow EPT MMU, but not for "regular" EPT pages?

I'm not sure I follow. Something like this?

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 48e3cdd7b009..569e8f4d5dd7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2853,7 +2853,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
        eptp |= (get_ept_level(vcpu) == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
 
        if (enable_ept_ad_bits &&
-           (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
+           ((!is_guest_mode(vcpu) && !kvmi_is_present()) || nested_ept_ad_enabled(vcpu)))
                eptp |= VMX_EPTP_AD_ENABLE_BIT;
        eptp |= (root_hpa & PAGE_MASK);

> Also, what is the state of introspection support on AMD?

The way we'd like to do introspection is not currently possible on AMD
CPUs. The reasons being:
 * the NPT has the behavior I talked above (guest PT walks translate to
   writes)
 * it is not possible to mark a guest page as execute-only
 * there is no equivalent to Intel's MTF, though it can _probably_ be
   emulated using a creative trick involving the debug registers

If, however, none of the above are of importance for other users,
everything else should work. I have to admit, though, we have not done
any tests.

Regards,

-- 
Mihai Donțu



^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present
  2019-08-14  8:53         ` Mihai Donțu
@ 2019-08-14 10:36           ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-14 10:36 UTC (permalink / raw)
  To: Mihai Donțu; +Cc: kvm, Adalbert Lazăr

On 14/08/19 10:53, Mihai Donțu wrote:
> On Tue, 2019-08-13 at 23:05 +0200, Paolo Bonzini wrote:
>> On 13/08/19 20:36, Mihai Donțu wrote:
>>>> Why?
>>> When EPT A/D is enabled, all guest page table walks are treated as
>>> writes (like AMD's NPT). Thus, an introspection tool hooking the guest
>>> page tables would trigger a flood of VMEXITs (EPT write violations)
>>> that will get the introspected VM into an unusable state.
>>>
>>> Our implementation of such an introspection tool builds a cache of
>>> {cr3, gva} -> gpa, which is why it needs to monitor all guest PTs by
>>> hooking them for write.
>>
>> Please include the kvm list too.
> 
> I apologize. I did not notice I trimmed the CC list.
> 
>> One issue here is that it changes the nested VMX ABI.  Can you leave EPT
>> A/D in place for the shadow EPT MMU, but not for "regular" EPT pages?
> 
> I'm not sure I follow. Something like this?
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 48e3cdd7b009..569e8f4d5dd7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2853,7 +2853,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
>         eptp |= (get_ept_level(vcpu) == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
>  
>         if (enable_ept_ad_bits &&
> -           (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
> +           ((!is_guest_mode(vcpu) && !kvmi_is_present()) || nested_ept_ad_enabled(vcpu)))
>                 eptp |= VMX_EPTP_AD_ENABLE_BIT;
>         eptp |= (root_hpa & PAGE_MASK);

Almost, because you also have to change the accessed/dirty bits for
mmu.c.  Probably by moving the call to kvm_mmu_set_mask_ptes from
vmx_enable_tdp to vmx_set_cr3.

Paolo

>> Also, what is the state of introspection support on AMD?
> 
> The way we'd like to do introspection is not currently possible on AMD
> CPUs. The reasons being:
>  * the NPT has the behavior I talked above (guest PT walks translate to
>    writes)
>  * it is not possible to mark a guest page as execute-only
>  * there is no equivalent to Intel's MTF, though it can _probably_ be
>    emulated using a creative trick involving the debug registers
> 
> If, however, none of the above are of importance for other users,
> everything else should work. I have to admit, though, we have not done
> any tests.
> 
> Regards,
> 


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)
       [not found]           ` <5d53d8d1.1c69fb81.7d32.0bedSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2019-08-14 10:37             ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-14 10:37 UTC (permalink / raw)
  To: Adalbert Lazăr, Sean Christopherson
  Cc: kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu,
	Mircea Cîrjaliu

On 14/08/19 11:48, Adalbert Lazăr wrote:
>> Why does closing the socket require destroying the kvmi object?  E.g. can
>> it be marked as defunct or whatever and only fully removed on a synchronous
>> unhook from userspace?  Re-hooking could either require said unhook, or
>> maybe reuse the existing kvmi object with a new socket.
> Will it be better to have the following ioctls?
> 
>   - hook (alloc kvmi and kvmi_vcpu structs)
>   - notify_imminent_unhook (send the KVMI_EVENT_UNHOOK event)
>   - unhook (free kvmi and kvmi_vcpu structs)

Yeah, that is nice also because it leaves the timeout policy to
userspace.  (BTW, please change references to QEMU to "userspace").

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation
       [not found]     ` <5d53f965.1c69fb81.cd952.035bSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2019-08-14 12:33       ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-14 12:33 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C Zhang, Mihai Donțu, Joerg Roedel

On 14/08/19 14:07, Adalbert Lazăr wrote:
> On Tue, 13 Aug 2019 11:20:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 09/08/19 18:00, Adalbert Lazăr wrote:
>>> From: Mihai Donțu <mdontu@bitdefender.com>
>>>
>>> It can happened for us to end up emulating the VMCALL instruction as a
>>> result of the handling of an EPT write fault. In this situation, the
>>> emulator will try to unconditionally patch the correct hypercall opcode
>>> bytes using emulator_write_emulated(). However, this last call uses the
>>> fault GPA (if available) or walks the guest page tables at RIP,
>>> otherwise. The trouble begins when using KVMI, when we forbid the use of
>>> the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
>>> newer) the page that we try to write into is marked read-execute and as
>>> such emulator_write_emulated() fails and we inject a write #PF, leading
>>> to a guest crash.
>>>
>>> The fix is rather simple: check the existing instruction bytes before
>>> doing the patching. This does not change the normal KVM behaviour, but
>>> does help when using KVMI as we no longer inject a write #PF.
>>
>> Fixing the hypercall is just an optimization.  Can we just hush and
>> return to the guest if emulator_write_emulated returns
>> X86EMUL_PROPAGATE_FAULT?
>>
>> Paolo
> 
> Something like this?
> 
> 	err = emulator_write_emulated(...);
> 	if (err == X86EMUL_PROPAGATE_FAULT)
> 		err = X86EMUL_CONTINUE;
> 	return err;

Yes.  The only difference will be that you'll keep getting #UD vmexits
instead of hypercall vmexits.  It's also safer, we want to obey those
r-x permissions because PatchGuard would crash the system if it noticed
the rewriting for whatever reason.

Paolo

>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 04b1d2916a0a..965c4f0108eb 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7363,16 +7363,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>>>  
>>> +#define KVM_HYPERCALL_INSN_LEN 3
>>> +
>>>  static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
>>>  {
>>> +	int err;
>>>  	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
>>> -	char instruction[3];
>>> +	char buf[KVM_HYPERCALL_INSN_LEN];
>>> +	char instruction[KVM_HYPERCALL_INSN_LEN];
>>>  	unsigned long rip = kvm_rip_read(vcpu);
>>>  
>>> +	err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
>>> +				     &ctxt->exception);
>>> +	if (err != X86EMUL_CONTINUE)
>>> +		return err;
>>> +
>>>  	kvm_x86_ops->patch_hypercall(vcpu, instruction);
>>> +	if (!memcmp(instruction, buf, sizeof(instruction)))
>>> +		/*
>>> +		 * The hypercall instruction is the correct one. Retry
>>> +		 * its execution maybe we got here as a result of an
>>> +		 * event other than #UD which has been resolved in the
>>> +		 * mean time.
>>> +		 */
>>> +		return X86EMUL_CONTINUE;
>>>  
>>> -	return emulator_write_emulated(ctxt, rip, instruction, 3,
>>> -		&ctxt->exception);
>>> +	return emulator_write_emulated(ctxt, rip, instruction,
>>> +				       sizeof(instruction), &ctxt->exception);
>>>  }


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 64/92] kvm: introspection: add single-stepping
  2019-08-12 20:50   ` Sean Christopherson
@ 2019-08-14 12:36     ` Nicusor CITU
  2019-08-14 12:53       ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Nicusor CITU @ 2019-08-14 12:36 UTC (permalink / raw)
  To: Sean Christopherson, Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu, Jim Mattson,
	Joerg Roedel

On Mon, 2019-08-12 at 13:50 -0700, Sean Christopherson wrote:
> On Fri, Aug 09, 2019 at 07:00:19PM +0300, Adalbert Lazăr wrote:
> > From: Nicușor Cîțu <ncitu@bitdefender.com>
> > 
> > This would be used either if the introspection tool request it as a
> > reply to a KVMI_EVENT_PF event or to cope with instructions that
> > cannot
> > be handled by the x86 emulator during the handling of a VMEXIT. In
> > these situations, all other vCPU-s are kicked and held, the EPT-
> > based
> > protection is removed and the guest is single stepped by the vCPU
> > that
> > triggered the initial VMEXIT. Upon completion the EPT-base
> > protection
> > is reinstalled and all vCPU-s all allowed to return to the guest.
> > 
> > This is a rather slow workaround that kicks in occasionally. In the
> > future, the most frequently single-stepped instructions should be
> > added
> > to the emulator (usually, stores to and from memory - SSE/AVX).
> > 
> > For the moment it works only on Intel.
> > 
> > CC: Jim Mattson <jmattson@google.com>
> > CC: Sean Christopherson <sean.j.christopherson@intel.com>
> > CC: Joerg Roedel <joro@8bytes.org>
> > Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
> > Co-developed-by: Mihai Donțu <mdontu@bitdefender.com>
> > Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> > Co-developed-by: Adalbert Lazăr <alazar@bitdefender.com>
> > Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   3 +
> >  arch/x86/kvm/kvmi.c             |  47 ++++++++++-
> >  arch/x86/kvm/svm.c              |   5 ++
> >  arch/x86/kvm/vmx/vmx.c          |  17 ++++
> >  arch/x86/kvm/x86.c              |  19 +++++
> >  include/linux/kvmi.h            |   4 +
> >  virt/kvm/kvmi.c                 | 145
> > +++++++++++++++++++++++++++++++-
> >  virt/kvm/kvmi_int.h             |  16 ++++
> >  8 files changed, 253 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index ad36a5fc2048..60e2c298d469 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1016,6 +1016,7 @@ struct kvm_x86_ops {
> >  	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
> >  				bool enable);
> >  	bool (*desc_intercept)(struct kvm_vcpu *vcpu, bool enable);
> > +	void (*set_mtf)(struct kvm_vcpu *vcpu, bool enable);
> 
> MTF is a VMX specific implementation of single-stepping, this should
> be
> enable_single_step() or something along those lines.  For example, I
> assume
> SVM could implement something that is mostly functional via
> RFLAGS.TF.
> 
> >  	void (*cr3_write_exiting)(struct kvm_vcpu *vcpu, bool enable);
> >  	bool (*nested_pagefault)(struct kvm_vcpu *vcpu);
> >  	bool (*spt_fault)(struct kvm_vcpu *vcpu);
> > @@ -1628,6 +1629,8 @@ void kvm_arch_msr_intercept(struct kvm_vcpu
> > *vcpu, unsigned int msr,
> >  				bool enable);
> >  bool kvm_mmu_nested_pagefault(struct kvm_vcpu *vcpu);
> >  bool kvm_spt_fault(struct kvm_vcpu *vcpu);
> > +void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable);
> > +void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
> >  void kvm_control_cr3_write_exiting(struct kvm_vcpu *vcpu, bool
> > enable);
> >  
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/kvmi.c b/arch/x86/kvm/kvmi.c
> > index 04cac5b8a4d0..f0ab4bd9eb37 100644
> > --- a/arch/x86/kvm/kvmi.c
> > +++ b/arch/x86/kvm/kvmi.c
> > @@ -520,7 +520,6 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu,
> > gpa_t gpa, gva_t gva,
> >  	u32 ctx_size;
> >  	u64 ctx_addr;
> >  	u32 action;
> > -	bool singlestep_ignored;
> >  	bool ret = false;
> >  
> >  	if (!kvm_spt_fault(vcpu))
> > @@ -533,7 +532,7 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu,
> > gpa_t gpa, gva_t gva,
> >  	if (ivcpu->effective_rep_complete)
> >  		return true;
> >  
> > -	action = kvmi_msg_send_pf(vcpu, gpa, gva, access,
> > &singlestep_ignored,
> > +	action = kvmi_msg_send_pf(vcpu, gpa, gva, access, &ivcpu-
> > >ss_requested,
> >  				  &ivcpu->rep_complete, &ctx_addr,
> >  				  ivcpu->ctx_data, &ctx_size);
> >  
> > @@ -547,6 +546,8 @@ bool kvmi_arch_pf_event(struct kvm_vcpu *vcpu,
> > gpa_t gpa, gva_t gva,
> >  		ret = true;
> >  		break;
> >  	case KVMI_EVENT_ACTION_RETRY:
> > +		if (ivcpu->ss_requested && !kvmi_start_ss(vcpu, gpa,
> > access))
> > +			ret = true;
> >  		break;
> >  	default:
> >  		kvmi_handle_common_event_actions(vcpu, action, "PF");
> > @@ -758,6 +759,48 @@ int kvmi_arch_cmd_control_cr(struct kvm_vcpu
> > *vcpu,
> >  	return 0;
> >  }
> >  
> > +void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu)
> > +{
> > +	kvm_set_mtf(vcpu, true);
> > +
> > +	/*
> > +	 * Set block by STI only if the RFLAGS.IF = 1.
> > +	 * Blocking by both STI and MOV/POP SS is not possible.
> > +	 */
> > +	if (kvm_arch_interrupt_allowed(vcpu))
> > +		kvm_set_interrupt_shadow(vcpu, KVM_X86_SHADOW_INT_STI);
> 
> This is wrong, the STI shadow only exists if interrupts were
> unblocked
> prior to STI.  I'm guessing this is a hack to workaround
> kvmi_arch_stop_single_step() not properly handling the clearing case.
> 

Thank you for signaling this. This piece of code is leftover from the
initial attempt to make single step running.
Based on latest results, we do not actually need to change
interruptibility during the singlestep. It is enough to enable the MTF
and just suppress any interrupt injection (if any) before leaving the
vcpu entering in guest.

> > ++}
> > +
> > +void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu)
> > +{
> > +	kvm_set_mtf(vcpu, false);
> > +	/*
> > +	 * The blocking by STI is cleared after the guest
> > +	 * executes one instruction or incurs an exception.
> > +	 * However we migh stop the SS before entering to guest,
> > +	 * so be sure we are clearing the STI blocking.
> > +	 */
> > +	kvm_set_interrupt_shadow(vcpu, 0);
> 
> There are only three callers of kvmi_stop_ss(), it should be possible
> to accurately update interruptibility:
> 
>   - kvmi_run_ss() fail, do nothing
>   - VM-Exit that wasn't a single-step - clear interruptibility if the
>     guest executed an instruction (including faulted on an instr).
>   - MTF VM-Exit - do nothing (VMCS should already be up-to-date).
> 
> > +}
> > +
> > +u8 kvmi_arch_relax_page_access(u8 old, u8 new)
> > +{
> > +	u8 ret = old | new;
> > +
> > +	/*
> > +	 * An SPTE entry with just the -wx bits set can trigger a
> > +	 * misconfiguration error from the hardware, as it's the case
> > +	 * for x86 where this access mode is used to mark I/O memory.
> > +	 * Thus, we make sure that -wx accesses are translated to rwx.
> > +	 */
> > +	if ((ret & (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X)) ==
> > +	    (KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X))
> > +		ret |= KVMI_PAGE_ACCESS_R;
> > +
> > +	return ret;
> > +}
> > +
> >  static const struct {
> >  	unsigned int allow_bit;
> >  	enum kvm_page_track_mode track_mode;
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index b178b8900660..3481c0247680 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -7183,6 +7183,10 @@ static bool svm_spt_fault(struct kvm_vcpu
> > *vcpu)
> >  	return (svm->vmcb->control.exit_code == SVM_EXIT_NPF);
> >  }
> >  
> > +static void svm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> > +{
> > +}
> > +
> >  static void svm_cr3_write_exiting(struct kvm_vcpu *vcpu, bool
> > enable)
> >  {
> >  }
> > @@ -7225,6 +7229,7 @@ static struct kvm_x86_ops svm_x86_ops
> > __ro_after_init = {
> >  	.cpu_has_accelerated_tpr = svm_cpu_has_accelerated_tpr,
> >  	.has_emulated_msr = svm_has_emulated_msr,
> >  
> > +	.set_mtf = svm_set_mtf,
> >  	.cr3_write_exiting = svm_cr3_write_exiting,
> >  	.msr_intercept = svm_msr_intercept,
> >  	.desc_intercept = svm_desc_intercept,
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7d1e341b51ad..f0369d0574dc 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5384,6 +5384,7 @@ static int handle_invalid_op(struct kvm_vcpu
> > *vcpu)
> >  
> >  static int handle_monitor_trap(struct kvm_vcpu *vcpu)
> >  {
> > +	kvmi_stop_ss(vcpu);
> >  	return 1;
> >  }
> >  
> > @@ -5992,6 +5993,11 @@ static int vmx_handle_exit(struct kvm_vcpu
> > *vcpu)
> >  		}
> >  	}
> >  
> > +	if (kvmi_vcpu_enabled_ss(vcpu)
> > +			&& exit_reason != EXIT_REASON_EPT_VIOLATION
> > +			&& exit_reason !=
> > EXIT_REASON_MONITOR_TRAP_FLAG)
> 
> Bad indentation.  This is prevelant through the series.
> 
> > +		kvmi_stop_ss(vcpu);
> > +
> >  	if (exit_reason < kvm_vmx_max_exit_handlers
> >  	    && kvm_vmx_exit_handlers[exit_reason])
> >  		return kvm_vmx_exit_handlers[exit_reason](vcpu);
> > @@ -7842,6 +7848,16 @@ static __exit void hardware_unsetup(void)
> >  	free_kvm_area();
> >  }
> >  
> > +static void vmx_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> > +{
> > +	if (enable)
> > +		vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
> > +			      CPU_BASED_MONITOR_TRAP_FLAG);
> > +	else
> > +		vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
> > +				CPU_BASED_MONITOR_TRAP_FLAG);
> > +}
> > +
> >  static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int
> > msr,
> >  			      bool enable)
> >  {
> > @@ -7927,6 +7943,7 @@ static struct kvm_x86_ops vmx_x86_ops
> > __ro_after_init = {
> >  	.cpu_has_accelerated_tpr = report_flexpriority,
> >  	.has_emulated_msr = vmx_has_emulated_msr,
> >  
> > +	.set_mtf = vmx_set_mtf,
> >  	.msr_intercept = vmx_msr_intercept,
> >  	.cr3_write_exiting = vmx_cr3_write_exiting,
> >  	.desc_intercept = vmx_desc_intercept,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 38aaddadb93a..65855340249a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7358,6 +7358,13 @@ static int inject_pending_event(struct
> > kvm_vcpu *vcpu, bool req_int_win)
> >  {
> >  	int r;
> >  
> > +	if (kvmi_vcpu_enabled_ss(vcpu))
> > +		/*
> > +		 * We cannot inject events during single-stepping.
> > +		 * Try again later.
> > +		 */
> > +		return -1;
> > +
> >  	/* try to reinject previous events if any */
> >  
> >  	if (vcpu->arch.exception.injected)
> > @@ -10134,6 +10141,18 @@ void kvm_control_cr3_write_exiting(struct
> > kvm_vcpu *vcpu, bool enable)
> >  }
> >  EXPORT_SYMBOL(kvm_control_cr3_write_exiting);
> >  
> > +void kvm_set_mtf(struct kvm_vcpu *vcpu, bool enable)
> > +{
> > +	kvm_x86_ops->set_mtf(vcpu, enable);
> > +}
> > +EXPORT_SYMBOL(kvm_set_mtf);
> > +
> > +void kvm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
> > +{
> > +	kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
> > +}
> > +EXPORT_SYMBOL(kvm_set_interrupt_shadow);
> 
> Why do these wrappers exist, and why are they
> exported?  Introspection is
> built into kvm, any reason not to use kvm_x86_ops directly?  The most
> definitely don't need to be exported.
> 
> > +
> >  bool kvm_spt_fault(struct kvm_vcpu *vcpu)
> >  {
> >  	return kvm_x86_ops->spt_fault(vcpu);
> > diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
> > index 5d162b9e67f2..1dc90284dc3a 100644
> > --- a/include/linux/kvmi.h
> > +++ b/include/linux/kvmi.h
> > @@ -22,6 +22,8 @@ bool kvmi_queue_exception(struct kvm_vcpu *vcpu);
> >  void kvmi_trap_event(struct kvm_vcpu *vcpu);
> >  bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u8 descriptor,
> > u8 write);
> >  void kvmi_handle_requests(struct kvm_vcpu *vcpu);
> > +void kvmi_stop_ss(struct kvm_vcpu *vcpu);
> > +bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu);
> 
> Spell out single step, and be consistent between single_step and
> singlestep.
> That applies to pretty much every variable and function unless doing
> so
> really makes the verbosity obnoxious.
> 
> >  void kvmi_init_emulate(struct kvm_vcpu *vcpu);
> >  void kvmi_activate_rep_complete(struct kvm_vcpu *vcpu);
> >  bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32 dbg);
> > @@ -44,6 +46,8 @@ static inline void kvmi_handle_requests(struct
> > kvm_vcpu *vcpu) { }
> >  static inline bool kvmi_hypercall_event(struct kvm_vcpu *vcpu) {
> > return false; }
> >  static inline bool kvmi_queue_exception(struct kvm_vcpu *vcpu) {
> > return true; }
> >  static inline void kvmi_trap_event(struct kvm_vcpu *vcpu) { }
> > +static inline void kvmi_stop_ss(struct kvm_vcpu *vcpu) { }
> > +static inline bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu) {
> > return false; }
> >  static inline void kvmi_init_emulate(struct kvm_vcpu *vcpu) { }
> >  static inline void kvmi_activate_rep_complete(struct kvm_vcpu
> > *vcpu) { }
> >  static inline bool kvmi_bp_intercepted(struct kvm_vcpu *vcpu, u32
> > dbg)
> > diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
> > index d47a725a4045..a3a5af9080a9 100644
> > --- a/virt/kvm/kvmi.c
> > +++ b/virt/kvm/kvmi.c
> > @@ -1260,11 +1260,19 @@ void kvmi_run_jobs(struct kvm_vcpu *vcpu)
> >  	}
> >  }
> >  
> > +static bool need_to_wait_for_ss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> > +
> > +	return atomic_read(&ikvm->ss_active) && !ivcpu->ss_owner;
> > +}
> > +
> >  static bool need_to_wait(struct kvm_vcpu *vcpu)
> >  {
> >  	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> >  
> > -	return ivcpu->reply_waiting;
> > +	return ivcpu->reply_waiting || need_to_wait_for_ss(vcpu);
> >  }
> >  
> >  static bool done_waiting(struct kvm_vcpu *vcpu)
> > @@ -1572,6 +1580,141 @@ int kvmi_cmd_pause_vcpu(struct kvm_vcpu
> > *vcpu, bool wait)
> >  	return 0;
> >  }
> >  
> > +void kvmi_stop_ss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	struct kvm *kvm = vcpu->kvm;
> > +	struct kvmi *ikvm;
> > +	int i;
> > +
> > +	ikvm = kvmi_get(kvm);
> > +	if (!ikvm)
> > +		return;
> > +
> > +	if (unlikely(!ivcpu->ss_owner)) {
> > +		kvmi_warn(ikvm, "%s\n", __func__);
> > +		goto out;
> > +	}
> > +
> > +	for (i = ikvm->ss_level; i--;)
> > +		kvmi_set_gfn_access(kvm,
> > +				    ikvm->ss_context[i].gfn,
> > +				    ikvm->ss_context[i].old_access,
> > +				    ikvm-
> > >ss_context[i].old_write_bitmap);
> > +
> > +	ikvm->ss_level = 0;
> > +
> > +	kvmi_arch_stop_single_step(vcpu);
> > +
> > +	atomic_set(&ikvm->ss_active, false);
> > +	/*
> > +	 * Make ss_active update visible
> > +	 * before resuming all the other vCPUs.
> > +	 */
> > +	smp_mb__after_atomic();
> > +	kvm_make_all_cpus_request(kvm, 0);
> > +
> > +	ivcpu->ss_owner = false;
> > +
> > +out:
> > +	kvmi_put(kvm);
> > +}
> > +EXPORT_SYMBOL(kvmi_stop_ss);
> > +
> > +static bool kvmi_acquire_ss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> > +
> > +	if (ivcpu->ss_owner)
> > +		return true;
> > +
> > +	if (atomic_cmpxchg(&ikvm->ss_active, false, true) != false)
> > +		return false;
> > +
> > +	kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_INTROSPECTION |
> > +						KVM_REQUEST_WAIT);
> > +
> > +	ivcpu->ss_owner = true;
> > +
> > +	return true;
> > +}
> > +
> > +static bool kvmi_run_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8
> > access)
> > +{
> > +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> > +	u8 old_access, new_access;
> > +	u32 old_write_bitmap;
> > +	gfn_t gfn = gpa_to_gfn(gpa);
> > +	int err;
> > +
> > +	kvmi_arch_start_single_step(vcpu);
> > +
> > +	err = kvmi_get_gfn_access(ikvm, gfn, &old_access,
> > &old_write_bitmap);
> > +	/* likely was removed from radix tree due to rwx */
> > +	if (err) {
> > +		kvmi_warn(ikvm, "%s: gfn 0x%llx not found in the radix
> > tree\n",
> > +			  __func__, gfn);
> > +		return true;
> > +	}
> > +
> > +	if (ikvm->ss_level == SINGLE_STEP_MAX_DEPTH - 1) {
> > +		kvmi_err(ikvm, "single step limit reached\n");
> > +		return false;
> > +	}
> > +
> > +	ikvm->ss_context[ikvm->ss_level].gfn = gfn;
> > +	ikvm->ss_context[ikvm->ss_level].old_access = old_access;
> > +	ikvm->ss_context[ikvm->ss_level].old_write_bitmap =
> > old_write_bitmap;
> > +	ikvm->ss_level++;
> > +
> > +	new_access = kvmi_arch_relax_page_access(old_access, access);
> > +
> > +	kvmi_set_gfn_access(vcpu->kvm, gfn, new_access,
> > old_write_bitmap);
> > +
> > +	return true;
> > +}
> > +
> > +bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access)
> > +{
> > +	bool ret = false;
> > +
> > +	while (!kvmi_acquire_ss(vcpu)) {
> > +		int err = kvmi_run_jobs_and_wait(vcpu);
> > +
> > +		if (err) {
> > +			kvmi_err(IKVM(vcpu->kvm), "kvmi_acquire_ss()
> > has failed\n");
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	if (kvmi_run_ss(vcpu, gpa, access))
> > +		ret = true;
> > +	else
> > +		kvmi_stop_ss(vcpu);
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +bool kvmi_vcpu_enabled_ss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	struct kvmi *ikvm;
> > +	bool ret;
> > +
> > +	ikvm = kvmi_get(vcpu->kvm);
> > +	if (!ikvm)
> > +		return false;
> > +
> > +	ret = ivcpu->ss_owner;
> > +
> > +	kvmi_put(vcpu->kvm);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(kvmi_vcpu_enabled_ss);
> > +
> >  static void kvmi_job_abort(struct kvm_vcpu *vcpu, void *ctx)
> >  {
> >  	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
> > index d7f9858d3e97..1550fe33ed48 100644
> > --- a/virt/kvm/kvmi_int.h
> > +++ b/virt/kvm/kvmi_int.h
> > @@ -126,6 +126,9 @@ struct kvmi_vcpu {
> >  		DECLARE_BITMAP(high, KVMI_NUM_MSR);
> >  	} msr_mask;
> >  
> > +	bool ss_owner;
> 
> Why is single-stepping mutually exclusive across all vCPUs?  Does
> that
> always have to be the case?
> 
> > +	bool ss_requested;
> > +
> >  	struct list_head job_list;
> >  	spinlock_t job_lock;
> >  
> > @@ -151,6 +154,15 @@ struct kvmi {
> >  	DECLARE_BITMAP(event_allow_mask, KVMI_NUM_EVENTS);
> >  	DECLARE_BITMAP(vm_ev_mask, KVMI_NUM_EVENTS);
> >  
> > +#define SINGLE_STEP_MAX_DEPTH 8
> > +	struct {
> > +		gfn_t gfn;
> > +		u8 old_access;
> > +		u32 old_write_bitmap;
> > +	} ss_context[SINGLE_STEP_MAX_DEPTH];
> > +	u8 ss_level;
> > +	atomic_t ss_active;
> 
> Good opportunity for an unnamed struct, e.g.
> 
> 	struct {
> 		struct single_step_context[...];
> 		bool owner;
> 		bool requested;
> 		u8 level
> 		atomic_t active;
> 	} single_step;
> 
> > +
> >  	struct {
> >  		bool initialized;
> >  		atomic_t enabled;
> > @@ -224,6 +236,7 @@ int kvmi_add_job(struct kvm_vcpu *vcpu,
> >  		 void *ctx, void (*free_fct)(void *ctx));
> >  void kvmi_handle_common_event_actions(struct kvm_vcpu *vcpu, u32
> > action,
> >  				      const char *str);
> > +bool kvmi_start_ss(struct kvm_vcpu *vcpu, gpa_t gpa, u8 access);
> >  
> >  /* arch */
> >  void kvmi_arch_update_page_tracking(struct kvm *kvm,
> > @@ -274,6 +287,9 @@ int kvmi_arch_cmd_inject_exception(struct
> > kvm_vcpu *vcpu, u8 vector,
> >  				   u64 address);
> >  int kvmi_arch_cmd_control_cr(struct kvm_vcpu *vcpu,
> >  			     const struct kvmi_control_cr *req);
> > +void kvmi_arch_start_single_step(struct kvm_vcpu *vcpu);
> > +void kvmi_arch_stop_single_step(struct kvm_vcpu *vcpu);
> > +u8 kvmi_arch_relax_page_access(u8 old, u8 new);
> >  int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
> >  			      const struct kvmi_control_msr *req);
> >  int kvmi_arch_cmd_get_mtrr_type(struct kvm_vcpu *vcpu, u64 gpa, u8
> > *type);
> 
> ________________________
> This email was scanned by Bitdefender


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 64/92] kvm: introspection: add single-stepping
  2019-08-14 12:36     ` Nicusor CITU
@ 2019-08-14 12:53       ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-08-14 12:53 UTC (permalink / raw)
  To: Nicusor CITU, Sean Christopherson, Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu, Jim Mattson,
	Joerg Roedel

On 14/08/19 14:36, Nicusor CITU wrote:
> Thank you for signaling this. This piece of code is leftover from the
> initial attempt to make single step running.
> Based on latest results, we do not actually need to change
> interruptibility during the singlestep. It is enough to enable the MTF
> and just suppress any interrupt injection (if any) before leaving the
> vcpu entering in guest.
> 

This is exactly what testcases are for...

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-12 21:05   ` Sean Christopherson
@ 2019-08-15  6:36     ` Nicusor CITU
  2019-08-19 18:36       ` Sean Christopherson
  0 siblings, 1 reply; 158+ messages in thread
From: Nicusor CITU @ 2019-08-15  6:36 UTC (permalink / raw)
  To: Sean Christopherson, Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

> > +	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
> > +				bool enable);
> 
> This should be toggle_wrmsr_intercept(), or toggle_msr_intercept()
> with a paramter to control RDMSR vs. WRMSR.

Ok, I can do that.


> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 6450c8c44771..0306c7ef3158 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7784,6 +7784,15 @@ static __exit void hardware_unsetup(void)
> >  	free_kvm_area();
> >  }
> >  
> > +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int
> > msr,
> > +			      bool enable)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> > +
> > +	vmx_set_intercept_for_msr(msr_bitmap, msr, MSR_TYPE_W, enable);
> > +}
> 
> Unless I overlooked a check, this will allow userspace to disable
> WRMSR interception for any MSR in the above range, i.e. userspace can
> use KVM to gain full write access to pretty much all the interesting
> MSRs. This needs to only disable interception if KVM had interception
> disabled before introspection started modifying state.

We only need to enable the MSR interception. We never disable it -
please see kvmi_arch_cmd_control_msr().


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
       [not found]     ` <1565694095.D172a51.28640.@15f23d3a749365d981e968181cce585d2dcb3ffa>
@ 2019-08-15 19:19       ` Jerome Glisse
  2019-08-15 20:16         ` Jerome Glisse
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-08-15 19:19 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: Matthew Wilcox, kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu, Mircea Cîrjaliu

On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > +++ b/include/linux/page-flags.h
> > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > >   */
> > >  #define PAGE_MAPPING_ANON	0x1
> > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > +#define PAGE_MAPPING_REMOTE	0x4
> > 
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> > 
> > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > >   * __page_set_anon_rmap - set up new anonymous rmap
> > >   * @page:	Page or Hugepage to add to rmap
> > >   * @vma:	VM area to add page to.
> > > - * @address:	User virtual address of the mapping	
> > > + * @address:	User virtual address of the mapping
> > 
> > And mixing in fluff changes like this is a real no-no.  Try again.
> > 
> 
> No bad intentions, just overzealous.
> I didn't want to hide anything from our patches.
> Once we advance with the introspection patches related to KVM we'll be
> back with the remote mapping patch, split and cleaned.

They are not bit left in struct page ! Looking at the patch it seems
you want to have your own pin count just for KVM. This is bad, we are
already trying to solve the GUP thing (see all various patchset about
GUP posted recently).

You need to rethink how you want to achieve this. Why not simply a
remote read()/write() into the process memory ie KVMI would call
an ioctl that allow to read or write into a remote process memory
like ptrace() but on steroid ...

Adding this whole big complex infrastructure without justification
of why we need to avoid round trip is just too much really.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-15 19:19       ` Jerome Glisse
@ 2019-08-15 20:16         ` Jerome Glisse
  2019-08-16 17:45           ` Jason Gunthorpe
  2019-08-23 12:39           ` Mircea CIRJALIU - MELIU
  0 siblings, 2 replies; 158+ messages in thread
From: Jerome Glisse @ 2019-08-15 20:16 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: Matthew Wilcox, kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu, Mircea Cîrjaliu

On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > >   */
> > > >  #define PAGE_MAPPING_ANON	0x1
> > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > 
> > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > Who's guaranteeing that?
> > > 
> > > This is an awfully big patch to the memory management code, buried in
> > > the middle of a gigantic series which almost guarantees nobody would
> > > look at it.  I call shenanigans.
> > > 
> > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > >   * @page:	Page or Hugepage to add to rmap
> > > >   * @vma:	VM area to add page to.
> > > > - * @address:	User virtual address of the mapping	
> > > > + * @address:	User virtual address of the mapping
> > > 
> > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > 
> > 
> > No bad intentions, just overzealous.
> > I didn't want to hide anything from our patches.
> > Once we advance with the introspection patches related to KVM we'll be
> > back with the remote mapping patch, split and cleaned.
> 
> They are not bit left in struct page ! Looking at the patch it seems
> you want to have your own pin count just for KVM. This is bad, we are
> already trying to solve the GUP thing (see all various patchset about
> GUP posted recently).
> 
> You need to rethink how you want to achieve this. Why not simply a
> remote read()/write() into the process memory ie KVMI would call
> an ioctl that allow to read or write into a remote process memory
> like ptrace() but on steroid ...
> 
> Adding this whole big complex infrastructure without justification
> of why we need to avoid round trip is just too much really.

Thinking a bit more about this, you can achieve the same thing without
adding a single line to any mm code. Instead of having mmap with
PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
(i am assuming this is something you already have and can control
the mmap callback).

So now kernel side you have a vma with a vm_operations_struct under
your control this means that everything you want to block mm wise
from within the inspector process can be block through those call-
backs (find_special_page() specificaly for which you have to return
NULL all the time).

To mirror target process memory you can use hmm_mirror, when you
populate the inspector process page table you use insert_pfn()
(mmap of the kvm device file must mark this vma as PFNMAP).

By following the hmm_mirror API, anytime the target process has
a change in its page table (ie virtual address -> page) you will
get a callback and all you have to do is clear the page table
within the inspector process and flush tlb (use zap_page_range).

On page fault within the inspector process the fault callback of
vm_ops will get call and from there you call hmm_mirror following
its API.

Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
inspector process use fork() (you could support fork but then you
would need to mark the vma as SHARED and use unmap_mapping_pages
instead of zap_page_range).


There everything you want to do with already upstream mm code.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-15 20:16         ` Jerome Glisse
@ 2019-08-16 17:45           ` Jason Gunthorpe
  2019-08-23 12:39           ` Mircea CIRJALIU - MELIU
  1 sibling, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 17:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu, Mircea Cîrjaliu

On Thu, Aug 15, 2019 at 04:16:30PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > >   */
> > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > > 
> > > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > > Who's guaranteeing that?
> > > > 
> > > > This is an awfully big patch to the memory management code, buried in
> > > > the middle of a gigantic series which almost guarantees nobody would
> > > > look at it.  I call shenanigans.
> > > > 
> > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > >   * @page:	Page or Hugepage to add to rmap
> > > > >   * @vma:	VM area to add page to.
> > > > > - * @address:	User virtual address of the mapping	
> > > > > + * @address:	User virtual address of the mapping
> > > > 
> > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > 
> > > 
> > > No bad intentions, just overzealous.
> > > I didn't want to hide anything from our patches.
> > > Once we advance with the introspection patches related to KVM we'll be
> > > back with the remote mapping patch, split and cleaned.
> > 
> > They are not bit left in struct page ! Looking at the patch it seems
> > you want to have your own pin count just for KVM. This is bad, we are
> > already trying to solve the GUP thing (see all various patchset about
> > GUP posted recently).
> > 
> > You need to rethink how you want to achieve this. Why not simply a
> > remote read()/write() into the process memory ie KVMI would call
> > an ioctl that allow to read or write into a remote process memory
> > like ptrace() but on steroid ...
> > 
> > Adding this whole big complex infrastructure without justification
> > of why we need to avoid round trip is just too much really.
> 
> Thinking a bit more about this, you can achieve the same thing without
> adding a single line to any mm code. Instead of having mmap with
> PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
> (i am assuming this is something you already have and can control
> the mmap callback).
> 
> So now kernel side you have a vma with a vm_operations_struct under
> your control this means that everything you want to block mm wise
> from within the inspector process can be block through those call-
> backs (find_special_page() specificaly for which you have to return
> NULL all the time).

I'm actually aware of a couple of use cases that would like to
mirror the VA of one process into another. One big one in the HPC
world is the out of tree 'xpmem' still in wide use today. xpmem is
basically what Jerome described above.

If you do an approach like Jerome describes it would be nice if was a
general facility and not buried in kvm.

I know past xpmem adventures ran into trouble with locking/etc - ie
getting the mm_struct of the victim seemed a bit hard for some reason,
but maybe that could be done with a FD pass 'ioctl(I AM THE VICITM)' ?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-15  6:36     ` Nicusor CITU
@ 2019-08-19 18:36       ` Sean Christopherson
  2019-08-20  8:44         ` Nicusor CITU
  0 siblings, 1 reply; 158+ messages in thread
From: Sean Christopherson @ 2019-08-19 18:36 UTC (permalink / raw)
  To: Nicusor CITU
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

On Thu, Aug 15, 2019 at 06:36:44AM +0000, Nicusor CITU wrote:
> > > +	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr,
> > > +				bool enable);
> > 
> > This should be toggle_wrmsr_intercept(), or toggle_msr_intercept()
> > with a paramter to control RDMSR vs. WRMSR.
> 
> Ok, I can do that.
> 
> 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 6450c8c44771..0306c7ef3158 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7784,6 +7784,15 @@ static __exit void hardware_unsetup(void)
> > >  	free_kvm_area();
> > >  }
> > >  
> > > +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int
> > > msr,
> > > +			      bool enable)
> > > +{
> > > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;

Is KVMI intended to play nice with nested virtualization?  Unconditionally
updating vmcs01.msr_bitmap is correct regardless of whether the vCPU is in
L1 or L2, but if the vCPU is currently in L2 then the effective bitmap,
i.e. vmcs02.msr_bitmap, won't be updated until the next nested VM-Enter.

> > > +
> > > +	vmx_set_intercept_for_msr(msr_bitmap, msr, MSR_TYPE_W, enable);
> > > +}
> > 
> > Unless I overlooked a check, this will allow userspace to disable
> > WRMSR interception for any MSR in the above range, i.e. userspace can
> > use KVM to gain full write access to pretty much all the interesting
> > MSRs. This needs to only disable interception if KVM had interception
> > disabled before introspection started modifying state.
> 
> We only need to enable the MSR interception. We never disable it -
> please see kvmi_arch_cmd_control_msr().

In that case, drop @enable and use enable_wrmsr_intercept() or something
along those lines for kvm_x86_ops instead of toggle_wrmsr_intercept().

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-09 16:00 ` [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR Adalbert Lazăr
  2019-08-12 21:05   ` Sean Christopherson
@ 2019-08-19 18:52   ` Sean Christopherson
  1 sibling, 0 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-19 18:52 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

On Fri, Aug 09, 2019 at 07:00:10PM +0300, Adalbert Lazăr wrote:
> +int kvmi_arch_cmd_control_msr(struct kvm_vcpu *vcpu,
> +			      const struct kvmi_control_msr *req)
> +{
> +	int err;
> +
> +	if (req->padding1 || req->padding2)
> +		return -KVM_EINVAL;
> +
> +	err = msr_control(vcpu, req->msr, req->enable);
> +
> +	if (!err && req->enable)

This needs a comment explaining that it intentionally calls into arch
code only for the enable case so as to avoid having to deal with tracking
whether or not it's safe to disable interception.  At first (and second)
glance it look like KVM is silently ignoring the @enable=false case.

> +		kvm_arch_msr_intercept(vcpu, req->msr, req->enable);

Renaming to kvm_arch_enable_msr_intercept() would also help communicate
that KVMI can't be used to disable msr interception.  The function can
always be renamed if someone takes on the task of enhancing the arch code
to handling disabling interception.

> +
> +	return err;
> +}

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-19 18:36       ` Sean Christopherson
@ 2019-08-20  8:44         ` Nicusor CITU
  2019-08-20 11:43           ` Mihai Donțu
  0 siblings, 1 reply; 158+ messages in thread
From: Nicusor CITU @ 2019-08-20  8:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C, Mihai Donțu

> > > > +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned
> > > > int
> > > > msr,
> > > > +			      bool enable)
> > > > +{
> > > > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> 
> Is KVMI intended to play nice with nested
> virtualization?  Unconditionally
> updating vmcs01.msr_bitmap is correct regardless of whether the vCPU
> is in
> L1 or L2, but if the vCPU is currently in L2 then the effective
> bitmap,
> i.e. vmcs02.msr_bitmap, won't be updated until the next nested VM-
> Enter.

Our initial proof of concept was running with success in nested
virtualization. But most of our tests were done on bare-metal.
We do however intend to make it fully functioning on nested systems
too.

Even thought, from KVMI point of view, the MSR interception
configuration would be just fine if it gets updated before the vcpu is
actually entering to nested VM.


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-20  8:44         ` Nicusor CITU
@ 2019-08-20 11:43           ` Mihai Donțu
  2019-08-21 15:18             ` Sean Christopherson
  0 siblings, 1 reply; 158+ messages in thread
From: Mihai Donțu @ 2019-08-20 11:43 UTC (permalink / raw)
  To: Nicusor CITU, Sean Christopherson
  Cc: Adalbert Lazăr, kvm, linux-mm, virtualization,
	Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C

On Tue, 2019-08-20 at 08:44 +0000, Nicusor CITU wrote:
> > > > > +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned
> > > > > int
> > > > > msr,
> > > > > +			      bool enable)
> > > > > +{
> > > > > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > > +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> > 
> > Is KVMI intended to play nice with nested virtualization? Unconditionally
> > updating vmcs01.msr_bitmap is correct regardless of whether the vCPU
> > is in L1 or L2, but if the vCPU is currently in L2 then the effective
> > bitmap, i.e. vmcs02.msr_bitmap, won't be updated until the next nested VM-
> > Enter.
> 
> Our initial proof of concept was running with success in nested
> virtualization. But most of our tests were done on bare-metal.
> We do however intend to make it fully functioning on nested systems
> too.
> 
> Even thought, from KVMI point of view, the MSR interception
> configuration would be just fine if it gets updated before the vcpu is
> actually entering to nested VM.
> 

I believe Sean is referring here to the case where the guest being
introspected is a hypervisor (eg. Windows 10 with device guard).

Even though we are looking at how to approach this scenario, the
introspection tools we have built will refuse to attach to a
hypervisor.

Regards,

-- 
Mihai Donțu



^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR
  2019-08-20 11:43           ` Mihai Donțu
@ 2019-08-21 15:18             ` Sean Christopherson
  0 siblings, 0 replies; 158+ messages in thread
From: Sean Christopherson @ 2019-08-21 15:18 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Nicusor CITU, Adalbert Lazăr, kvm, linux-mm, virtualization,
	Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Zhang, Yu C

On Tue, Aug 20, 2019 at 02:43:32PM +0300, Mihai Donțu wrote:
> On Tue, 2019-08-20 at 08:44 +0000, Nicusor CITU wrote:
> > > > > > +static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned
> > > > > > int
> > > > > > msr,
> > > > > > +			      bool enable)
> > > > > > +{
> > > > > > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > > > +	unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> > > 
> > > Is KVMI intended to play nice with nested virtualization? Unconditionally
> > > updating vmcs01.msr_bitmap is correct regardless of whether the vCPU
> > > is in L1 or L2, but if the vCPU is currently in L2 then the effective
> > > bitmap, i.e. vmcs02.msr_bitmap, won't be updated until the next nested VM-
> > > Enter.
> > 
> > Our initial proof of concept was running with success in nested
> > virtualization. But most of our tests were done on bare-metal.
> > We do however intend to make it fully functioning on nested systems
> > too.
> > 
> > Even thought, from KVMI point of view, the MSR interception
> > configuration would be just fine if it gets updated before the vcpu is
> > actually entering to nested VM.
> > 
> 
> I believe Sean is referring here to the case where the guest being
> introspected is a hypervisor (eg. Windows 10 with device guard).

Yep.

> Even though we are looking at how to approach this scenario, the
> introspection tools we have built will refuse to attach to a
> hypervisor.

In that case, it's probably a good idea to make KVMI mutually exclusive
with nested virtualization.  Doing so should, in theory, simplify the
implementation and expedite upstreaming, e.g. reviewers don't have to
nitpick edge cases related to nested virt.  My only hesitation in
disabling KVMI when nested virt is enabled is that it could make it much
more difficult to (re)enable the combination in the future.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* RE: DANGER WILL ROBINSON, DANGER
  2019-08-15 20:16         ` Jerome Glisse
  2019-08-16 17:45           ` Jason Gunthorpe
@ 2019-08-23 12:39           ` Mircea CIRJALIU - MELIU
  2019-09-05 18:09             ` Jerome Glisse
  1 sibling, 1 reply; 158+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2019-08-23 12:39 UTC (permalink / raw)
  To: Jerome Glisse, Adalbert Lazăr
  Cc: Matthew Wilcox, kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

> On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org>
> wrote:
> > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > >   */
> > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > >
> > > > Uh.  How do you know page->mapping would otherwise have bit 2
> clear?
> > > > Who's guaranteeing that?
> > > >
> > > > This is an awfully big patch to the memory management code, buried
> > > > in the middle of a gigantic series which almost guarantees nobody
> > > > would look at it.  I call shenanigans.
> > > >
> > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page
> *page, struct vm_area_struct *vma)
> > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > >   * @page:	Page or Hugepage to add to rmap
> > > > >   * @vma:	VM area to add page to.
> > > > > - * @address:	User virtual address of the mapping
> > > > > + * @address:	User virtual address of the mapping
> > > >
> > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > >
> > >
> > > No bad intentions, just overzealous.
> > > I didn't want to hide anything from our patches.
> > > Once we advance with the introspection patches related to KVM we'll
> > > be back with the remote mapping patch, split and cleaned.
> >
> > They are not bit left in struct page ! Looking at the patch it seems
> > you want to have your own pin count just for KVM. This is bad, we are
> > already trying to solve the GUP thing (see all various patchset about
> > GUP posted recently).
> >
> > You need to rethink how you want to achieve this. Why not simply a
> > remote read()/write() into the process memory ie KVMI would call an
> > ioctl that allow to read or write into a remote process memory like
> > ptrace() but on steroid ...
> >
> > Adding this whole big complex infrastructure without justification of
> > why we need to avoid round trip is just too much really.
> 
> Thinking a bit more about this, you can achieve the same thing without
> adding a single line to any mm code. Instead of having mmap with
> PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device
> file (i am assuming this is something you already have and can control the
> mmap callback).
> 
> So now kernel side you have a vma with a vm_operations_struct under your
> control this means that everything you want to block mm wise from within
> the inspector process can be block through those call- backs
> (find_special_page() specificaly for which you have to return NULL all the
> time).
> 
> To mirror target process memory you can use hmm_mirror, when you
> populate the inspector process page table you use insert_pfn() (mmap of
> the kvm device file must mark this vma as PFNMAP).
> 
> By following the hmm_mirror API, anytime the target process has a change in
> its page table (ie virtual address -> page) you will get a callback and all you
> have to do is clear the page table within the inspector process and flush tlb
> (use zap_page_range).
> 
> On page fault within the inspector process the fault callback of vm_ops will
> get call and from there you call hmm_mirror following its API.
> 
> Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
> inspector process use fork() (you could support fork but then you would
> need to mark the vma as SHARED and use unmap_mapping_pages instead of
> zap_page_range).
> 
> 
> There everything you want to do with already upstream mm code.

I'm the author of remote mapping, so I owe everybody some explanations.
My requirement was to map pages from one QEMU process to another QEMU 
process (our inspector process works in a virtual machine of its own). So I had 
to implement a KSM-like page sharing between processes, where an anon page
from the target QEMU's working memory is promoted to a remote page and 
mapped in the inspector QEMU's working memory (both anon VMAs). 
The extra page flag is for differentiating the page for rmap walking.

The mapping requests come at PAGE_SIZE granularity for random addresses 
within the target/inspector QEMUs, so I couldn't do any linear mapping that
would keep things simpler. 

I have an extra patch that does remote mapping by mirroring an entire VMA
from the target process by way of a device file. This thing creates a separate 
mirror VMA in my inspector process (at the moment a QEMU), but then I 
bumped into the KVM hva->gpa mapping, which makes it hard to override 
mappings with addresses outside memslot associated VMAs.

Mircea

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-08-23 12:39           ` Mircea CIRJALIU - MELIU
@ 2019-09-05 18:09             ` Jerome Glisse
  2019-09-09 17:00               ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-09-05 18:09 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Paolo Bonzini, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Fri, Aug 23, 2019 at 12:39:21PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org>
> > wrote:
> > > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > > +++ b/include/linux/page-flags.h
> > > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > > >   */
> > > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > > >
> > > > > Uh.  How do you know page->mapping would otherwise have bit 2
> > clear?
> > > > > Who's guaranteeing that?
> > > > >
> > > > > This is an awfully big patch to the memory management code, buried
> > > > > in the middle of a gigantic series which almost guarantees nobody
> > > > > would look at it.  I call shenanigans.
> > > > >
> > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page
> > *page, struct vm_area_struct *vma)
> > > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > > >   * @page:	Page or Hugepage to add to rmap
> > > > > >   * @vma:	VM area to add page to.
> > > > > > - * @address:	User virtual address of the mapping
> > > > > > + * @address:	User virtual address of the mapping
> > > > >
> > > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > >
> > > >
> > > > No bad intentions, just overzealous.
> > > > I didn't want to hide anything from our patches.
> > > > Once we advance with the introspection patches related to KVM we'll
> > > > be back with the remote mapping patch, split and cleaned.
> > >
> > > They are not bit left in struct page ! Looking at the patch it seems
> > > you want to have your own pin count just for KVM. This is bad, we are
> > > already trying to solve the GUP thing (see all various patchset about
> > > GUP posted recently).
> > >
> > > You need to rethink how you want to achieve this. Why not simply a
> > > remote read()/write() into the process memory ie KVMI would call an
> > > ioctl that allow to read or write into a remote process memory like
> > > ptrace() but on steroid ...
> > >
> > > Adding this whole big complex infrastructure without justification of
> > > why we need to avoid round trip is just too much really.
> > 
> > Thinking a bit more about this, you can achieve the same thing without
> > adding a single line to any mm code. Instead of having mmap with
> > PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device
> > file (i am assuming this is something you already have and can control the
> > mmap callback).
> > 
> > So now kernel side you have a vma with a vm_operations_struct under your
> > control this means that everything you want to block mm wise from within
> > the inspector process can be block through those call- backs
> > (find_special_page() specificaly for which you have to return NULL all the
> > time).
> > 
> > To mirror target process memory you can use hmm_mirror, when you
> > populate the inspector process page table you use insert_pfn() (mmap of
> > the kvm device file must mark this vma as PFNMAP).
> > 
> > By following the hmm_mirror API, anytime the target process has a change in
> > its page table (ie virtual address -> page) you will get a callback and all you
> > have to do is clear the page table within the inspector process and flush tlb
> > (use zap_page_range).
> > 
> > On page fault within the inspector process the fault callback of vm_ops will
> > get call and from there you call hmm_mirror following its API.
> > 
> > Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
> > inspector process use fork() (you could support fork but then you would
> > need to mark the vma as SHARED and use unmap_mapping_pages instead of
> > zap_page_range).
> > 
> > 
> > There everything you want to do with already upstream mm code.
> 
> I'm the author of remote mapping, so I owe everybody some explanations.
> My requirement was to map pages from one QEMU process to another QEMU 
> process (our inspector process works in a virtual machine of its own). So I had 
> to implement a KSM-like page sharing between processes, where an anon page
> from the target QEMU's working memory is promoted to a remote page and 
> mapped in the inspector QEMU's working memory (both anon VMAs). 
> The extra page flag is for differentiating the page for rmap walking.
> 
> The mapping requests come at PAGE_SIZE granularity for random addresses 
> within the target/inspector QEMUs, so I couldn't do any linear mapping that
> would keep things simpler. 
> 
> I have an extra patch that does remote mapping by mirroring an entire VMA
> from the target process by way of a device file. This thing creates a separate 
> mirror VMA in my inspector process (at the moment a QEMU), but then I 
> bumped into the KVM hva->gpa mapping, which makes it hard to override 
> mappings with addresses outside memslot associated VMAs.

Not sure i understand, you are saying that the solution i outline above
does not work ? If so then i think you are wrong, in the above solution
the importing process mmap a device file and the resulting vma is then
populated using insert_pfn() and constantly keep synchronize with the
target process through mirroring which means that you never have to look
at the struct page ... you can mirror any kind of memory from the remote
process.

Am i miss-understanding something here ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-09-05 18:09             ` Jerome Glisse
@ 2019-09-09 17:00               ` Paolo Bonzini
  2019-09-10  7:49                 ` Mircea CIRJALIU - MELIU
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-09-09 17:00 UTC (permalink / raw)
  To: Jerome Glisse, Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 05/09/19 20:09, Jerome Glisse wrote:
> Not sure i understand, you are saying that the solution i outline above
> does not work ? If so then i think you are wrong, in the above solution
> the importing process mmap a device file and the resulting vma is then
> populated using insert_pfn() and constantly keep synchronize with the
> target process through mirroring which means that you never have to look
> at the struct page ... you can mirror any kind of memory from the remote
> process.

If insert_pfn in turn calls MMU notifiers for the target VMA (which
would be the KVM MMU notifier), then that would work.  Though I guess it
would be possible to call MMU notifier update callbacks around the call
to insert_pfn.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* RE: DANGER WILL ROBINSON, DANGER
  2019-09-09 17:00               ` Paolo Bonzini
@ 2019-09-10  7:49                 ` Mircea CIRJALIU - MELIU
  2019-10-02 19:27                   ` Jerome Glisse
  0 siblings, 1 reply; 158+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2019-09-10  7:49 UTC (permalink / raw)
  To: Paolo Bonzini, Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

> On 05/09/19 20:09, Jerome Glisse wrote:
> > Not sure i understand, you are saying that the solution i outline
> > above does not work ? If so then i think you are wrong, in the above
> > solution the importing process mmap a device file and the resulting
> > vma is then populated using insert_pfn() and constantly keep
> > synchronize with the target process through mirroring which means that
> > you never have to look at the struct page ... you can mirror any kind
> > of memory from the remote process.
> 
> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> the KVM MMU notifier), then that would work.  Though I guess it would be
> possible to call MMU notifier update callbacks around the call to insert_pfn.

Can't do that.
First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
or vmf_insert_mixed().

Our model (the importing process is encapsulated in another VM) forces us
to mirror certain pages from the anon VMA backing one VM's system RAM to 
the other VM's anon VMA. 

Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
the target anon VMA, but I guess this breaks the VMA. Is this recommended?

Then, mapping anon pages from one VMA to another without fixing the 
refcount and the mapcount breaks the daemons that think they're working 
on a pure anon VMA (kcompactd, khugepaged).

Mircea

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool
  2019-08-09 16:00 ` [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool Adalbert Lazăr
@ 2019-09-10 14:26   ` Konrad Rzeszutek Wilk
  2019-09-10 16:28     ` Adalbert Lazăr
  0 siblings, 1 reply; 158+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-09-10 14:26 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Tamas K Lengyel, Mathieu Tarral, Samuel Laurén,
	Patrick Colp, Jan Kiszka, Stefan Hajnoczi, Weijiang Yang, Zhang,
	Yu C, Mihai Donțu

On Fri, Aug 09, 2019 at 07:00:24PM +0300, Adalbert Lazăr wrote:
> This patch might be obsolete thanks to single-stepping.

sooo should it be skipped from this large patchset to easy
review?

> 
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/kvm/x86.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c06de73a784..06f44ce8ed07 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6311,7 +6311,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
>  		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
>  
> -		if (indirect_shadow_pages)
> +		if (indirect_shadow_pages
> +		    && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
>  			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>  		return true;
> @@ -6322,7 +6323,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
>  	 * and it failed try to unshadow page and re-enter the
>  	 * guest to let CPU execute the instruction.
>  	 */
> -	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> +	if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>  	/*
>  	 * If the access faults on its page table, it can not
> @@ -6374,6 +6376,9 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
>  	if (!vcpu->arch.mmu->direct_map)
>  		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
>  
> +	if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> +		return false;
> +
>  	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>  	return true;

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool
  2019-09-10 14:26   ` Konrad Rzeszutek Wilk
@ 2019-09-10 16:28     ` Adalbert Lazăr
  0 siblings, 0 replies; 158+ messages in thread
From: Adalbert Lazăr @ 2019-09-10 16:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: kvm, linux-mm, virtualization, Paolo Bonzini,
	Radim Krčmář,
	Tamas K Lengyel, Mathieu Tarral, Samuel Laurén,
	Patrick Colp, Jan Kiszka, Stefan Hajnoczi, Weijiang Yang, Yu C,
	Mihai Donțu

On Tue, 10 Sep 2019 10:26:42 -0400, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Fri, Aug 09, 2019 at 07:00:24PM +0300, Adalbert Lazăr wrote:
> > This patch might be obsolete thanks to single-stepping.
> 
> sooo should it be skipped from this large patchset to easy
> review?

I'll add a couple of warning messages to check if this patch is still
needed, in order to skip it from the next submission (which will be smaller:)

However, on AMD, single-stepping is not an option.

Thanks,
Adalbert

> 
> > 
> > Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> > ---
> >  arch/x86/kvm/x86.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 2c06de73a784..06f44ce8ed07 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6311,7 +6311,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
> >  		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> >  		spin_unlock(&vcpu->kvm->mmu_lock);
> >  
> > -		if (indirect_shadow_pages)
> > +		if (indirect_shadow_pages
> > +		    && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> >  			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> >  
> >  		return true;
> > @@ -6322,7 +6323,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
> >  	 * and it failed try to unshadow page and re-enter the
> >  	 * guest to let CPU execute the instruction.
> >  	 */
> > -	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> > +	if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> > +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> >  
> >  	/*
> >  	 * If the access faults on its page table, it can not
> > @@ -6374,6 +6376,9 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
> >  	if (!vcpu->arch.mmu->direct_map)
> >  		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
> >  
> > +	if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> > +		return false;
> > +
> >  	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> >  
> >  	return true;

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 19:27                   ` Jerome Glisse
@ 2019-10-02 13:46                     ` Paolo Bonzini
  2019-10-02 14:15                       ` Jerome Glisse
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-02 13:46 UTC (permalink / raw)
  To: Jerome Glisse, Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 02/10/19 21:27, Jerome Glisse wrote:
> On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
>>> On 05/09/19 20:09, Jerome Glisse wrote:
>>>> Not sure i understand, you are saying that the solution i outline
>>>> above does not work ? If so then i think you are wrong, in the above
>>>> solution the importing process mmap a device file and the resulting
>>>> vma is then populated using insert_pfn() and constantly keep
>>>> synchronize with the target process through mirroring which means that
>>>> you never have to look at the struct page ... you can mirror any kind
>>>> of memory from the remote process.
>>>
>>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
>>> the KVM MMU notifier), then that would work.  Though I guess it would be
>>> possible to call MMU notifier update callbacks around the call to insert_pfn.
>>
>> Can't do that.
>> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
>> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
>> or vmf_insert_mixed().
> 
> Why would you need to target mmu notifier on target vma ?

If the mapping of the source VMA changes, mirroring can update the
target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
dismantles its own existing page tables (so that they can be recreated
with the new mapping from the source VMA)?

Thanks,

Paolo

> You do not need
> that. The workflow is:
> 
>     userspace:
>         ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)
> 
> Then when the mirroring process access ptr it triggers page fault that
> endup in the vm_operation_struct->fault() which is just doing:
> 
>     kernel-kvm-mirroring-function:
>         kvm_mirror_page_fault(struct vm_fault *vmf) {
>             struct kvm_mirror_struct *kvmms;
> 
>             kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
>             ...
>         again:
>             hmm_range_register(&range);
>             hmm_range_snapshot(&range);
>             take_lock(kvmms->update);
>             if (!hmm_range_valid(&range)) {
>                 vm_insert_pfn();
>                 drop_lock(kvmms->update);
>                 hmm_range_unregister(&range);
>                 return VM_FAULT_NOPAGE;
>             }
>             drop_lock(kvmms->update);
>             goto again;
>         }
> 
> The notifier callback:
>         kvmms_notifier_start() {
>             take_lock(kvmms->update);
>             clear_pte(start, end);
>             drop_lock(kvmms->update);
>         }
> 
>>
>> Our model (the importing process is encapsulated in another VM) forces us
>> to mirror certain pages from the anon VMA backing one VM's system RAM to 
>> the other VM's anon VMA. 
> 
> The mirror does not have to be an anon vma it can very well be a
> device vma ie mmap of a device file. I do not see any reasons why
> the mirror need to be an anon vma. Please explain why.
> 
>>
>> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
>> the target anon VMA, but I guess this breaks the VMA. Is this recommended?
> 
> The mirror vma should not be an anon vma.
> 
>>
>> Then, mapping anon pages from one VMA to another without fixing the 
>> refcount and the mapcount breaks the daemons that think they're working 
>> on a pure anon VMA (kcompactd, khugepaged).
> 
> Note here the target vma ie the mirroring one is a mmap of device file
> and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
> fully ignore by core mm.
> 
> Thus you do not need to fix the refcount in any way. If any of the core
> mm try to reclaim memory from the original vma then you will get mmu
> notifier callbacks and all you have to do is clear the page table of your
> device vma.
> 
> I did exactly that as a tools in the past and it works just fine with
> no change to core mm whatsoever.
> 
> Cheers,
> Jérôme
> 


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 13:46                     ` Paolo Bonzini
@ 2019-10-02 14:15                       ` Jerome Glisse
  2019-10-02 16:18                         ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-10-02 14:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Wed, Oct 02, 2019 at 03:46:30PM +0200, Paolo Bonzini wrote:
> On 02/10/19 21:27, Jerome Glisse wrote:
> > On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
> >>> On 05/09/19 20:09, Jerome Glisse wrote:
> >>>> Not sure i understand, you are saying that the solution i outline
> >>>> above does not work ? If so then i think you are wrong, in the above
> >>>> solution the importing process mmap a device file and the resulting
> >>>> vma is then populated using insert_pfn() and constantly keep
> >>>> synchronize with the target process through mirroring which means that
> >>>> you never have to look at the struct page ... you can mirror any kind
> >>>> of memory from the remote process.
> >>>
> >>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> >>> the KVM MMU notifier), then that would work.  Though I guess it would be
> >>> possible to call MMU notifier update callbacks around the call to insert_pfn.
> >>
> >> Can't do that.
> >> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
> >> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
> >> or vmf_insert_mixed().
> > 
> > Why would you need to target mmu notifier on target vma ?
> 
> If the mapping of the source VMA changes, mirroring can update the
> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> dismantles its own existing page tables (so that they can be recreated
> with the new mapping from the source VMA)?
> 

So just to make sure i follow we have:
      - qemu process on host with anonymous vma
            -> host cpu page table
      - kvm which maps host anonymous vma to guest
            -> kvm guest page table
      - kvm inspector process which mirror vma from qemu process
            -> inspector process page table

AFAIK the KVM notifier's will clear the kvm guest page table whenever
necessary (through kvm_mmu_notifier_invalidate_range_start). This is
what ensure that KVM's dismatles its own mapping, it abides to mmu-
notifier callbacks. If you did not you would have bugs (at least i
expect so). Am i wrong here ?

The mirroring kernel driver would also register the notifier against
the quemu process and would also abide to notifier callbacks.

What you want to maintain at all times is that none of the actors
above ever look at different page for the same virtual address (ie
one looking at older page while another look at new page).

This is where you have helper like HMM that make sure that you can
not populate the mirroring vma while a notifier is on going. Which
means that everything is serialize on the notifier.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 14:15                       ` Jerome Glisse
@ 2019-10-02 16:18                         ` Paolo Bonzini
  2019-10-02 17:04                           ` Jerome Glisse
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-02 16:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 02/10/19 16:15, Jerome Glisse wrote:
>>> Why would you need to target mmu notifier on target vma ?
>> If the mapping of the source VMA changes, mirroring can update the
>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>> dismantles its own existing page tables (so that they can be recreated
>> with the new mapping from the source VMA)?
>>
> So just to make sure i follow we have:
>       - qemu process on host with anonymous vma
>             -> host cpu page table
>       - kvm which maps host anonymous vma to guest
>             -> kvm guest page table
>       - kvm inspector process which mirror vma from qemu process
>             -> inspector process page table
> 
> AFAIK the KVM notifier's will clear the kvm guest page table whenever
> necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> what ensure that KVM's dismatles its own mapping, it abides to mmu-
> notifier callbacks. If you did not you would have bugs (at least i
> expect so). Am i wrong here ?

The KVM inspector process is also (or can be) a QEMU that will have to
create its own KVM guest page table.

So if a page in the source VMA is unmapped we want:

- the source KVM to invalidate its guest page table (done by the KVM MMU
notifier)

- the target VMA to be invalidated (easy using mirroring)

- the target KVM to invalidate its guest page table, as a result of
invalidation of the target VMA

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 16:18                         ` Paolo Bonzini
@ 2019-10-02 17:04                           ` Jerome Glisse
  2019-10-02 20:10                             ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-10-02 17:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> On 02/10/19 16:15, Jerome Glisse wrote:
> >>> Why would you need to target mmu notifier on target vma ?
> >> If the mapping of the source VMA changes, mirroring can update the
> >> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> >> dismantles its own existing page tables (so that they can be recreated
> >> with the new mapping from the source VMA)?
> >>
> > So just to make sure i follow we have:
> >       - qemu process on host with anonymous vma
> >             -> host cpu page table
> >       - kvm which maps host anonymous vma to guest
> >             -> kvm guest page table
> >       - kvm inspector process which mirror vma from qemu process
> >             -> inspector process page table
> > 
> > AFAIK the KVM notifier's will clear the kvm guest page table whenever
> > necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> > what ensure that KVM's dismatles its own mapping, it abides to mmu-
> > notifier callbacks. If you did not you would have bugs (at least i
> > expect so). Am i wrong here ?
> 
> The KVM inspector process is also (or can be) a QEMU that will have to
> create its own KVM guest page table.

Ok missed that part, thank you for explaining

> 
> So if a page in the source VMA is unmapped we want:
> 
> - the source KVM to invalidate its guest page table (done by the KVM MMU
> notifier)
> 
> - the target VMA to be invalidated (easy using mirroring)
> 
> - the target KVM to invalidate its guest page table, as a result of
> invalidation of the target VMA

You can do the target KVM invalidation inside the mirroring invalidation
code.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-09-10  7:49                 ` Mircea CIRJALIU - MELIU
@ 2019-10-02 19:27                   ` Jerome Glisse
  2019-10-02 13:46                     ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-10-02 19:27 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Paolo Bonzini, Adalbert Lazăr, Matthew Wilcox, kvm,
	linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
> > On 05/09/19 20:09, Jerome Glisse wrote:
> > > Not sure i understand, you are saying that the solution i outline
> > > above does not work ? If so then i think you are wrong, in the above
> > > solution the importing process mmap a device file and the resulting
> > > vma is then populated using insert_pfn() and constantly keep
> > > synchronize with the target process through mirroring which means that
> > > you never have to look at the struct page ... you can mirror any kind
> > > of memory from the remote process.
> > 
> > If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> > the KVM MMU notifier), then that would work.  Though I guess it would be
> > possible to call MMU notifier update callbacks around the call to insert_pfn.
> 
> Can't do that.
> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
> or vmf_insert_mixed().

Why would you need to target mmu notifier on target vma ? You do not need
that. The workflow is:

    userspace:
        ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)

Then when the mirroring process access ptr it triggers page fault that
endup in the vm_operation_struct->fault() which is just doing:

    kernel-kvm-mirroring-function:
        kvm_mirror_page_fault(struct vm_fault *vmf) {
            struct kvm_mirror_struct *kvmms;

            kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
            ...
        again:
            hmm_range_register(&range);
            hmm_range_snapshot(&range);
            take_lock(kvmms->update);
            if (!hmm_range_valid(&range)) {
                vm_insert_pfn();
                drop_lock(kvmms->update);
                hmm_range_unregister(&range);
                return VM_FAULT_NOPAGE;
            }
            drop_lock(kvmms->update);
            goto again;
        }

The notifier callback:
        kvmms_notifier_start() {
            take_lock(kvmms->update);
            clear_pte(start, end);
            drop_lock(kvmms->update);
        }

> 
> Our model (the importing process is encapsulated in another VM) forces us
> to mirror certain pages from the anon VMA backing one VM's system RAM to 
> the other VM's anon VMA. 

The mirror does not have to be an anon vma it can very well be a
device vma ie mmap of a device file. I do not see any reasons why
the mirror need to be an anon vma. Please explain why.

> 
> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
> the target anon VMA, but I guess this breaks the VMA. Is this recommended?

The mirror vma should not be an anon vma.

> 
> Then, mapping anon pages from one VMA to another without fixing the 
> refcount and the mapcount breaks the daemons that think they're working 
> on a pure anon VMA (kcompactd, khugepaged).

Note here the target vma ie the mirroring one is a mmap of device file
and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
fully ignore by core mm.

Thus you do not need to fix the refcount in any way. If any of the core
mm try to reclaim memory from the original vma then you will get mmu
notifier callbacks and all you have to do is clear the page table of your
device vma.

I did exactly that as a tools in the past and it works just fine with
no change to core mm whatsoever.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 17:04                           ` Jerome Glisse
@ 2019-10-02 20:10                             ` Paolo Bonzini
  2019-10-03 15:42                               ` Jerome Glisse
  2019-10-03 16:36                               ` Mircea CIRJALIU - MELIU
  0 siblings, 2 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-02 20:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 02/10/19 19:04, Jerome Glisse wrote:
> On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
>>>> If the mapping of the source VMA changes, mirroring can update the
>>>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>>>> dismantles its own existing page tables (so that they can be recreated
>>>> with the new mapping from the source VMA)?
>>
>> The KVM inspector process is also (or can be) a QEMU that will have to
>> create its own KVM guest page table.  So if a page in the source VMA is
>> unmapped we want:
>>
>> - the source KVM to invalidate its guest page table (done by the KVM MMU
>> notifier)
>>
>> - the target VMA to be invalidated (easy using mirroring)
>>
>> - the target KVM to invalidate its guest page table, as a result of
>> invalidation of the target VMA
> 
> You can do the target KVM invalidation inside the mirroring invalidation
> code.

Why should the source and target KVMs behave differently?  If the source
invalidates its guest page table via MMU notifiers, so should the target.

The KVM MMU notifier exists so that nothing (including mirroring) needs
to know that there is KVM on the other side.  Any interaction between
KVM page tables and VMAs must be mediated by MMU notifiers, anything
else is unacceptable.

If it is possible to invoke the MMU notifiers around the calls to
insert_pfn, that of course would be perfect.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-02 20:10                             ` Paolo Bonzini
@ 2019-10-03 15:42                               ` Jerome Glisse
  2019-10-03 15:50                                 ` Paolo Bonzini
  2019-10-03 16:36                               ` Mircea CIRJALIU - MELIU
  1 sibling, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-10-03 15:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Wed, Oct 02, 2019 at 10:10:18PM +0200, Paolo Bonzini wrote:
> On 02/10/19 19:04, Jerome Glisse wrote:
> > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> >>>> If the mapping of the source VMA changes, mirroring can update the
> >>>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> >>>> dismantles its own existing page tables (so that they can be recreated
> >>>> with the new mapping from the source VMA)?
> >>
> >> The KVM inspector process is also (or can be) a QEMU that will have to
> >> create its own KVM guest page table.  So if a page in the source VMA is
> >> unmapped we want:
> >>
> >> - the source KVM to invalidate its guest page table (done by the KVM MMU
> >> notifier)
> >>
> >> - the target VMA to be invalidated (easy using mirroring)
> >>
> >> - the target KVM to invalidate its guest page table, as a result of
> >> invalidation of the target VMA
> > 
> > You can do the target KVM invalidation inside the mirroring invalidation
> > code.
> 
> Why should the source and target KVMs behave differently?  If the source
> invalidates its guest page table via MMU notifiers, so should the target.
> 
> The KVM MMU notifier exists so that nothing (including mirroring) needs
> to know that there is KVM on the other side.  Any interaction between
> KVM page tables and VMAs must be mediated by MMU notifiers, anything
> else is unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to
> insert_pfn, that of course would be perfect.

Ok and yes you can do that exactly ie inside the mmu notifier callback
from the target. For instance it is as easy as:
    target_mirror_notifier_start_callback(start, end) {
        struct kvm_mirror_struct *kvmms = from_mmun(...);
        unsigned long target_foff, size;

        size = end - start;
        target_foff = kvmms_convert_mirror_address(start);
        take_lock(kvmms->mirror_fault_exclusion_lock);
        unmap_mapping_range(kvmms->address_space, target_foff, size, 1);
        drop_lock(kvmms->mirror_fault_exclusion_lock);
    }

All that is needed is to make sure that vm_normal_page() will see those
pte (inside the process that is mirroring the other process) as special
which is the case either because insert_pfn() mark the pte as special or
the kvm device driver which control the vm_operation struct set a
find_special_page() callback that always return NULL, or the vma has
either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).

So you can keep the existing kvm code unmodified.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-03 15:42                               ` Jerome Glisse
@ 2019-10-03 15:50                                 ` Paolo Bonzini
  2019-10-03 16:42                                   ` Mircea CIRJALIU - MELIU
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-03 15:50 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mircea CIRJALIU - MELIU, Adalbert Lazăr, Matthew Wilcox,
	kvm, linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 03/10/19 17:42, Jerome Glisse wrote:
> All that is needed is to make sure that vm_normal_page() will see those
> pte (inside the process that is mirroring the other process) as special
> which is the case either because insert_pfn() mark the pte as special or
> the kvm device driver which control the vm_operation struct set a
> find_special_page() callback that always return NULL, or the vma has
> either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).
> 
> So you can keep the existing kvm code unmodified.

Great, thanks.  And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP,
so that should work.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* RE: DANGER WILL ROBINSON, DANGER
  2019-10-02 20:10                             ` Paolo Bonzini
  2019-10-03 15:42                               ` Jerome Glisse
@ 2019-10-03 16:36                               ` Mircea CIRJALIU - MELIU
  1 sibling, 0 replies; 158+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2019-10-03 16:36 UTC (permalink / raw)
  To: Paolo Bonzini, Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

> The KVM MMU notifier exists so that nothing (including mirroring) needs to
> know that there is KVM on the other side.  Any interaction between KVM
> page tables and VMAs must be mediated by MMU notifiers, anything else is
> unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to insert_pfn,
> that of course would be perfect.

Looks to me like a work-around.
Any reason why insert_pfn() can't do set_pte_at_notify() so it triggers the KVM MMU notifier instead?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* RE: DANGER WILL ROBINSON, DANGER
  2019-10-03 15:50                                 ` Paolo Bonzini
@ 2019-10-03 16:42                                   ` Mircea CIRJALIU - MELIU
  2019-10-03 18:31                                     ` Jerome Glisse
  0 siblings, 1 reply; 158+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2019-10-03 16:42 UTC (permalink / raw)
  To: Paolo Bonzini, Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

> On 03/10/19 17:42, Jerome Glisse wrote:
> > All that is needed is to make sure that vm_normal_page() will see
> > those pte (inside the process that is mirroring the other process) as
> > special which is the case either because insert_pfn() mark the pte as
> > special or the kvm device driver which control the vm_operation struct
> > set a
> > find_special_page() callback that always return NULL, or the vma has
> > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> insert_pfn).
> >
> > So you can keep the existing kvm code unmodified.
> 
> Great, thanks.  And KVM is already able to handle
> VM_PFNMAP/VM_MIXEDMAP, so that should work.

This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the VM's system RAM.
Will it have any side effects?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-03 16:42                                   ` Mircea CIRJALIU - MELIU
@ 2019-10-03 18:31                                     ` Jerome Glisse
  2019-10-03 19:38                                       ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Jerome Glisse @ 2019-10-03 18:31 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Paolo Bonzini, Adalbert Lazăr, Matthew Wilcox, kvm,
	linux-mm, virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On Thu, Oct 03, 2019 at 04:42:20PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On 03/10/19 17:42, Jerome Glisse wrote:
> > > All that is needed is to make sure that vm_normal_page() will see
> > > those pte (inside the process that is mirroring the other process) as
> > > special which is the case either because insert_pfn() mark the pte as
> > > special or the kvm device driver which control the vm_operation struct
> > > set a
> > > find_special_page() callback that always return NULL, or the vma has
> > > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> > insert_pfn).
> > >
> > > So you can keep the existing kvm code unmodified.
> > 
> > Great, thanks.  And KVM is already able to handle
> > VM_PFNMAP/VM_MIXEDMAP, so that should work.
> 
> This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the VM's system RAM.
> Will it have any side effects?

You do not set it up on the anonymous vma but on the mmap of the
kvm device file, the resulting vma is under the control of the
kvm device file and is not an anonymous vma but a "device" special
vma.

So in summary, the source qemu process has anonymous vma (regular
libc malloc for instance). The introspector qemu process which
mirror the the source qemu use mmap on /dev/kvm (assuming you can
reuse the kvm device file for this otherwise you can introduce a
new kvm device file). The resulting mmap inside the introspector
qemu process is a vma which has vma->vm_file pointing to the kvm
device file and has VM_PFNMAP or VM_MIXEDMAP (i think you want the
former). On architecture with ARCH_SPECIAL_PTE the pte will be
mark as special when using insert_pfn() on other architecture you
can either rely on VM_PFNMAP/VM_MIXEDMAP flag or set a specific
find_special_page() callbacks in vm_ops.


I am at a conference right now but i will put an example of what
i mean next week.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-03 18:31                                     ` Jerome Glisse
@ 2019-10-03 19:38                                       ` Paolo Bonzini
  2019-10-04  9:41                                         ` Mircea CIRJALIU - MELIU
  0 siblings, 1 reply; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-03 19:38 UTC (permalink / raw)
  To: Jerome Glisse, Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 03/10/19 20:31, Jerome Glisse wrote:
> So in summary, the source qemu process has anonymous vma (regular
> libc malloc for instance). The introspector qemu process which
> mirror the the source qemu use mmap on /dev/kvm (assuming you can
> reuse the kvm device file for this otherwise you can introduce a
> new kvm device file). 

It should be a new device, something like /dev/kvmmem.  BitDefender's
RFC patches already have the right userspace API, that was not an issue.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

* RE: DANGER WILL ROBINSON, DANGER
  2019-10-03 19:38                                       ` Paolo Bonzini
@ 2019-10-04  9:41                                         ` Mircea CIRJALIU - MELIU
  2019-10-04 11:46                                           ` Paolo Bonzini
  0 siblings, 1 reply; 158+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2019-10-04  9:41 UTC (permalink / raw)
  To: Paolo Bonzini, Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

> On 03/10/19 20:31, Jerome Glisse wrote:
> > So in summary, the source qemu process has anonymous vma (regular libc
> > malloc for instance). The introspector qemu process which mirror the
> > the source qemu use mmap on /dev/kvm (assuming you can reuse the
> kvm
> > device file for this otherwise you can introduce a new kvm device
> > file).
> 
> It should be a new device, something like /dev/kvmmem.  BitDefender's RFC
> patches already have the right userspace API, that was not an issue.

I get it so far. I have a patch that does mirroring in a separate VMA.
We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
source VMA in the other QEMU and is refreshed by the device MMU notifier.

This is a simple choice for an introspector process that runs on the same host 
as the source QEMU. But how do I make the new VMA accessible as memory 
to the guest VM inside the introspector QEMU? I was thinking of 2 solutions:

Create a new memslot based on the mirror VMA, hotplug it into the guest as
new memory device (is this possible?) and have a guest-side driver allocate 
pages from that area.

or

Redirect (some) GFN->HVA translations into the new VMA based on a table 
of addresses required by the introspector process.

Mircea

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: DANGER WILL ROBINSON, DANGER
  2019-10-04  9:41                                         ` Mircea CIRJALIU - MELIU
@ 2019-10-04 11:46                                           ` Paolo Bonzini
  0 siblings, 0 replies; 158+ messages in thread
From: Paolo Bonzini @ 2019-10-04 11:46 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU, Jerome Glisse
  Cc: Adalbert Lazăr, Matthew Wilcox, kvm, linux-mm,
	virtualization, Radim Krčmář,
	Konrad Rzeszutek Wilk, Tamas K Lengyel, Mathieu Tarral,
	Samuel Laurén, Patrick Colp, Jan Kiszka, Stefan Hajnoczi,
	Weijiang Yang, Yu C, Mihai Donțu

On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote:
> I get it so far. I have a patch that does mirroring in a separate VMA.
> We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
> source VMA in the other QEMU and is refreshed by the device MMU notifier.

So for example on the host you'd have a new ioctl on the kvm file
descriptor.  You pass a size and you get back a file descriptor for that
guest's physical memory, which is mmap-able up to the size you specified
in the ioctl.

In turn, the file descriptor would have ioctls to map/unmap ranges of
the guest memory into its mmap-able range.  Accessing an unmapped range
produces a SIGSEGV.

When asked via the QEMU monitor, QEMU will create the file descriptor
and pass it back via SCM_RIGHTS.  The management application can then
use it to hotplug memory into the destination...

> Create a new memslot based on the mirror VMA, hotplug it into the guest as
> new memory device (is this possible?) and have a guest-side driver allocate 
> pages from that area.

... using the existing ivshmem device, whose BAR can be accessed and
mmap-ed from the guest via sysfs.  In other words, the hotplugging will
use the file descriptor returned by QEMU when creating the ivshmem device.

We then need an additional mechanism to invoke the map/unmap ioctls from
the guest.  Without writing a guest-side driver it is possible to:

- pass a socket into the "create guest physical memory view" ioctl
above.  KVM will then associate that KVMI socket with the newly created
file descriptor.

- use KVMI messages to that socket to map/unmap sections of memory

> Redirect (some) GFN->HVA translations into the new VMA based on a table 
> of addresses required by the introspector process.

That would be tricky because there are multiple paths (gfn_to_page,
gfn_to_pfn, etc.).

There is some complication in this because the new device has to be
plumbed at multiple levels (KVM, QEMU, libvirt).  But it seems like a
very easily separated piece of code (except for the KVMI socket part,
which can be added later), so I suggest that you contribute the KVM
parts first.

Paolo

^ permalink raw reply	[flat|nested] 158+ messages in thread

end of thread, other threads:[~2019-10-04 11:47 UTC | newest]

Thread overview: 158+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 15:59 [RFC PATCH v6 00/92] VM introspection Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem) Adalbert Lazăr
2019-08-12 20:20   ` Sean Christopherson
2019-08-13  9:11     ` Paolo Bonzini
     [not found]     ` <5d52a5ae.1c69fb81.5c260.1573SMTPIN_ADDED_BROKEN@mx.google.com>
2019-08-13 12:09       ` Paolo Bonzini
2019-08-13 15:01         ` Sean Christopherson
2019-08-13 21:03           ` Paolo Bonzini
     [not found]           ` <5d53d8d1.1c69fb81.7d32.0bedSMTPIN_ADDED_BROKEN@mx.google.com>
2019-08-14 10:37             ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook) Adalbert Lazăr
2019-08-13  8:44   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 03/92] kvm: introspection: add permission access ioctls Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 04/92] kvm: introspection: add the read/dispatch message function Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 05/92] kvm: introspection: add KVMI_GET_VERSION Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE Adalbert Lazăr
2019-08-13  9:15   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command Adalbert Lazăr
2019-08-13  9:16   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 08/92] kvm: introspection: add KVMI_CHECK_COMMAND and KVMI_CHECK_EVENT Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 09/92] kvm: introspection: add KVMI_GET_GUEST_INFO Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 10/92] kvm: introspection: add KVMI_CONTROL_VM_EVENTS Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 11/92] kvm: introspection: add vCPU related data Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 12/92] kvm: introspection: add a jobs list to every introspected vCPU Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty Adalbert Lazăr
2019-08-13  8:43   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest Adalbert Lazăr
2019-08-13  8:26   ` Paolo Bonzini
     [not found]     ` <5d52c10e.1c69fb81.26904.fd34SMTPIN_ADDED_BROKEN@mx.google.com>
2019-08-13 14:45       ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 15/92] kvm: introspection: handle vCPU related introspection commands Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies Adalbert Lazăr
2019-08-13  8:55   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 17/92] kvm: introspection: introduce event actions Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 18/92] kvm: introspection: add KVMI_EVENT_UNHOOK Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 19/92] kvm: introspection: add KVMI_EVENT_CREATE_VCPU Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 20/92] kvm: introspection: add KVMI_GET_VCPU_INFO Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 21/92] kvm: page track: add track_create_slot() callback Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 22/92] kvm: x86: provide all page tracking hooks with the guest virtual address Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 23/92] kvm: page track: add support for preread, prewrite and preexec Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 24/92] kvm: x86: wire in the preread/prewrite/preexec page trackers Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 25/92] kvm: x86: intercept the write access on sidt and other emulated instructions Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault() Adalbert Lazăr
2019-08-13  8:12   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 27/92] kvm: introspection: use page track Adalbert Lazăr
2019-08-13  9:06   ` Paolo Bonzini
2019-08-09 15:59 ` [RFC PATCH v6 28/92] kvm: x86: consult the page tracking from kvm_mmu_get_page() and __direct_map() Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 29/92] kvm: introspection: add KVMI_CONTROL_EVENTS Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 30/92] kvm: x86: add kvm_spt_fault() Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 31/92] kvm: introspection: add KVMI_EVENT_PF Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 32/92] kvm: introspection: add KVMI_GET_PAGE_ACCESS Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 33/92] kvm: introspection: add KVMI_SET_PAGE_ACCESS Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 34/92] Documentation: Introduce EPT based Subpage Protection Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 35/92] KVM: VMX: Add control flags for SPP enabling Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 36/92] KVM: VMX: Implement functions for SPPT paging setup Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 37/92] KVM: VMX: Introduce SPP access bitmap and operation functions Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 38/92] KVM: VMX: Add init/set/get functions for SPP Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 39/92] KVM: VMX: Introduce SPP user-space IOCTLs Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 40/92] KVM: VMX: Handle SPP induced vmexit and page fault Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 41/92] KVM: MMU: Enable Lazy mode SPPT setup Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 42/92] KVM: MMU: Handle host memory remapping and reclaim Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 43/92] kvm: introspection: add KVMI_CONTROL_SPP Adalbert Lazăr
2019-08-09 15:59 ` [RFC PATCH v6 44/92] kvm: introspection: extend the internal database of tracked pages with write_bitmap info Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 45/92] kvm: introspection: add KVMI_GET_PAGE_WRITE_BITMAP Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 46/92] kvm: introspection: add KVMI_SET_PAGE_WRITE_BITMAP Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 47/92] kvm: introspection: add KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 48/92] kvm: add kvm_vcpu_kick_and_wait() Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 49/92] kvm: introspection: add KVMI_PAUSE_VCPU and KVMI_EVENT_PAUSE_VCPU Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 50/92] kvm: introspection: add KVMI_GET_REGISTERS Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 51/92] kvm: introspection: add KVMI_SET_REGISTERS Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 52/92] kvm: introspection: add KVMI_GET_CPUID Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 53/92] kvm: introspection: add KVMI_INJECT_EXCEPTION + KVMI_EVENT_TRAP Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 54/92] kvm: introspection: add KVMI_CONTROL_CR and KVMI_EVENT_CR Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 55/92] kvm: introspection: add KVMI_CONTROL_MSR and KVMI_EVENT_MSR Adalbert Lazăr
2019-08-12 21:05   ` Sean Christopherson
2019-08-15  6:36     ` Nicusor CITU
2019-08-19 18:36       ` Sean Christopherson
2019-08-20  8:44         ` Nicusor CITU
2019-08-20 11:43           ` Mihai Donțu
2019-08-21 15:18             ` Sean Christopherson
2019-08-19 18:52   ` Sean Christopherson
2019-08-09 16:00 ` [RFC PATCH v6 56/92] kvm: x86: block any attempt to disable MSR interception if tracked by introspection Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 57/92] kvm: introspection: add KVMI_GET_XSAVE Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 58/92] kvm: introspection: add KVMI_GET_MTRR_TYPE Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 59/92] kvm: introspection: add KVMI_EVENT_XSETBV Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 60/92] kvm: x86: add kvm_arch_vcpu_set_guest_debug() Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 61/92] kvm: introspection: add KVMI_EVENT_BREAKPOINT Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 62/92] kvm: introspection: add KVMI_EVENT_HYPERCALL Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 63/92] kvm: introspection: add KVMI_EVENT_DESCRIPTOR Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 64/92] kvm: introspection: add single-stepping Adalbert Lazăr
2019-08-12 20:50   ` Sean Christopherson
2019-08-14 12:36     ` Nicusor CITU
2019-08-14 12:53       ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 65/92] kvm: introspection: add KVMI_EVENT_SINGLESTEP Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 66/92] kvm: introspection: add custom input when single-stepping a vCPU Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 67/92] kvm: introspection: use single stepping on unimplemented instructions Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 68/92] kvm: x86: emulate a guest page table walk on SPT violations due to A/D bit updates Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool Adalbert Lazăr
2019-09-10 14:26   ` Konrad Rzeszutek Wilk
2019-09-10 16:28     ` Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when " Adalbert Lazăr
2019-08-13  9:08   ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 71/92] mm: add support for remote mapping Adalbert Lazăr
2019-08-09 16:24   ` DANGER WILL ROBINSON, DANGER Matthew Wilcox
2019-08-13  9:29     ` Paolo Bonzini
2019-08-13 11:24       ` Matthew Wilcox
2019-08-13 12:02         ` Paolo Bonzini
     [not found]     ` <1565694095.D172a51.28640.@15f23d3a749365d981e968181cce585d2dcb3ffa>
2019-08-15 19:19       ` Jerome Glisse
2019-08-15 20:16         ` Jerome Glisse
2019-08-16 17:45           ` Jason Gunthorpe
2019-08-23 12:39           ` Mircea CIRJALIU - MELIU
2019-09-05 18:09             ` Jerome Glisse
2019-09-09 17:00               ` Paolo Bonzini
2019-09-10  7:49                 ` Mircea CIRJALIU - MELIU
2019-10-02 19:27                   ` Jerome Glisse
2019-10-02 13:46                     ` Paolo Bonzini
2019-10-02 14:15                       ` Jerome Glisse
2019-10-02 16:18                         ` Paolo Bonzini
2019-10-02 17:04                           ` Jerome Glisse
2019-10-02 20:10                             ` Paolo Bonzini
2019-10-03 15:42                               ` Jerome Glisse
2019-10-03 15:50                                 ` Paolo Bonzini
2019-10-03 16:42                                   ` Mircea CIRJALIU - MELIU
2019-10-03 18:31                                     ` Jerome Glisse
2019-10-03 19:38                                       ` Paolo Bonzini
2019-10-04  9:41                                         ` Mircea CIRJALIU - MELIU
2019-10-04 11:46                                           ` Paolo Bonzini
2019-10-03 16:36                               ` Mircea CIRJALIU - MELIU
2019-08-09 16:00 ` [RFC PATCH v6 72/92] kvm: introspection: add memory map/unmap support on the guest side Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 73/92] kvm: introspection: use remote mapping Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation Adalbert Lazăr
2019-08-13  9:20   ` Paolo Bonzini
     [not found]     ` <5d53f965.1c69fb81.cd952.035bSMTPIN_ADDED_BROKEN@mx.google.com>
2019-08-14 12:33       ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage() Adalbert Lazăr
2019-08-13  8:47   ` Paolo Bonzini
     [not found]     ` <5d52ca22.1c69fb81.4ceb8.e90bSMTPIN_ADDED_BROKEN@mx.google.com>
2019-08-13 14:35       ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present Adalbert Lazăr
2019-08-13  9:18   ` Paolo Bonzini
     [not found]     ` <0550f8d65bb97486e98d88255ea45d490da6b802.camel@bitdefender.com>
2019-08-13 21:05       ` Paolo Bonzini
2019-08-14  8:53         ` Mihai Donțu
2019-08-14 10:36           ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 77/92] kvm: introspection: add trace functions Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 78/92] kvm: x86: add tracepoints for interrupt and exception injections Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64 Adalbert Lazăr
2019-08-13  9:17   ` Paolo Bonzini
2019-08-09 16:00 ` [RFC PATCH v6 80/92] kvm: x86: emulate movss xmm, m32 Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 81/92] kvm: x86: emulate movq xmm, m64 Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 82/92] kvm: x86: emulate movq r, xmm Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 83/92] kvm: x86: emulate movd xmm, m32 Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 84/92] kvm: x86: enable the half part of movss, movsd, movups Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 85/92] kvm: x86: emulate lfence Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 86/92] kvm: x86: emulate xorpd xmm2/m128, xmm1 Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 87/92] kvm: x86: emulate xorps xmm/m128, xmm Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 88/92] kvm: x86: emulate fst/fstp m64fp Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 89/92] kvm: x86: make lock cmpxchg r, r/m atomic Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 90/92] kvm: x86: emulate lock cmpxchg8b atomically Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 91/92] kvm: x86: emulate lock cmpxchg16b m128 Adalbert Lazăr
2019-08-09 16:00 ` [RFC PATCH v6 92/92] kvm: x86: fallback to the single-step on multipage CMPXCHG emulation Adalbert Lazăr
2019-08-12 18:23 ` [RFC PATCH v6 00/92] VM introspection Sean Christopherson
2019-08-12 21:40 ` Sean Christopherson
2019-08-13  9:34 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).