All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 00/18] VM introspection
@ 2017-12-18 19:06 Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                   ` (19 more replies)
  0 siblings, 20 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This patch series proposes a VM introspection subsystem for KVM (KVMI).

The previous RFC can be read here: https://marc.info/?l=kvm&m=150514457912721

These patches were tested on kvm/master,
commit 43aabca38aa9668eee3c3c1206207034614c0901 (Merge tag 'kvm-arm-fixes-for-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD).

In this iteration we refactored the code based on the feedback received
from Paolo and others.

The handshake
-------------
We no longer listen on a vsock in kernel, accepting introspectors
to control all the other VM-s. Instead, QEMU (ie. every introspected
guest) initiates the connection with an introspection tool (running on
the same host, in another VM, etc.) and passes the control to KVM where
the in-kernel mechanism will take over.

The administrator has to choose which guests should be introspected, by
which introspectors, what commands and events are allowed and for which
guests. Currently, there is a bitmask for allowed commands/events, but
it seems to be too complicated. For example, being allowed to set page
accesses (eg. r--) and not being allowed to receive page fault events
(eg. -wx) doesn't make sense.

The memory maping
-----------------
Besides the read/write commands to access guest memory, for performance
reasons, we've implemented memory mapping for introspection tools running
in another guest (on the same host, like page sharing between guests,
but without copy-on-write): the KVMI_GET_TOKEN command is used to obtain
a token, which is passed with a hypercall from the introspecting guest
to the KVMI.

While this didn't had a high priority, somehow the stars aligned and we
have it.

Page tracking
-------------
The current page tracking mechanism from KVM has support to track write
accesses (after the write operation took place). We've extended it with
preread, prewrite and preexec tracking.

We also added a notification for when a new memory slot is being
created (see track_create_slot()).

Pause VM
--------
We've removed the commands to pause/resume VM. Having a "pause vCPU"
command and a "paused vCPU" event seems to be enough for now.

Not implemented yet
-------------------

There are a few things documented, but not implemented yet: virtualized
exceptions, single-stepping and EPT views.

We are also working on accomodating SPP (Sub Page Protection).

We hope to make public our repositories (kernel, QEMU,
userland/simple-introspector) in a couple of days and we're looking
forward to add unit tests.

Changes since v3:
  - move the accept/handshake worker to QEMU
  - extend and use the 'page_track' infrastructure to intercept page
    accesses during emulation
  - remove the 0x40000000-0x40001fff range from monitored MSR-s
  - make small changes to the wire protocol (error codes, padding, names)
  - simplify KVMI_PAUSE_VCPU
  - add new commands: KVMI_GET_MAP_TOKEN, KVMI_GET_XSAVE
  - add pat to KVMI_EVENT
  - document KVM_HC_MEM_MAP and KVM_HC_MEM_UNMAP hypercalls

Changes since v2:
  - make small changes to the wire protocol (eg. use kvmi_error_code
    with every command reply, a few renames, etc.)
  - removed '_x86' from x86 specific structure names. Architecture
    specific structures will have the same name.
  - drop KVMI_GET_MTRR_TYPE and KVMI_GET_MTRRS (use KVMI_SET_REGISTERS)
  - drop KVMI_EVENT_ACTION_SET_REGS (use KVMI_SET_REGISTERS)
  - remove KVMI_MAP_PHYSICAL_PAGE_TO_GUEST and KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST
    (to be replaced by a token+hypercall pair)
  - extend KVMI_GET_VERSION with allowed commnd/event masks
  - replace KVMI_PAUSE_GUEST/KVMI_UNPAUSE_GUEST with KVMI_PAUSE_VCPU
  - replace KVMI_SHUTDOWN_GUEST with KVMI_EVENT_ACTION_CRASH
  - replace KVMI_GET_XSAVE_INFO with KVMI_GET_CPUID
  - merge KVMI_INJECT_PAGE_FAULT and KVMI_INJECT_BREAKPOINT
    in KVMI_INJECT_EXCEPTION
  - replace event reply flags with ALLOW/SKIP/RETRY/CRASH actions
  - make KVMI_SET_REGISTERS work with vCPU events only
  - add EPT view support in KVMI_GET_PAGE_ACCESS/KVMI_SET_PAGE_ACCESS
  - add support for multiple pages in KVMI_GET_PAGE_ACCESS/KVMI_SET_PAGE_ACCESS
  - add (back) KVMI_READ_PHYSICAL/KVMI_WRITE_PHYSICAL
  - add KVMI_CONTROL_VE
  - add cstar to KVMI_EVENT
  - add new events: KVMI_EVENT_VCPU_PAUSED, KVMI_EVENT_CREATE_VCPU, 
    KVMI_EVENT_DESCRIPTOR_ACCESS, KVMI_EVENT_SINGLESTEP
  - add new sections: "Introspection capabilities", "Live migrations",
    "Guest snapshots with memory", "Memory access safety"
  - document the hypercall used by the KVMI_EVENT_HYPERCALL command
    (was KVMI_EVENT_USER_CALL)

Changes since v1:
  - add documentation and ABI [Paolo, Jan]
  - drop all the other patches for now [Paolo]
  - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
    and let libvirt/qemu handle this [Stefan, Paolo]
  - change the license from LGPL to GPL [Jan]
  - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
  - make the interface a little more consistent


Adalbert Lazar (18):
  kvm: add documentation and ABI/API headers for the VM introspection
    subsystem
  add memory map/unmap support for VM introspection on the guest side
  kvm: x86: add kvm_arch_msr_intercept()
  kvm: x86: add kvm_mmu_nested_guest_page_fault() and
    kvmi_mmu_fault_gla()
  kvm: x86: add kvm_arch_vcpu_set_regs()
  kvm: vmx: export the availability of EPT views
  kvm: page track: add support for preread, prewrite and preexec
  kvm: add the VM introspection subsystem
  kvm: hook in the VM introspection subsystem
  kvm: x86: handle the new vCPU request (KVM_REQ_INTROSPECTION)
  kvm: x86: hook in the page tracking
  kvm: x86: hook in kvmi_breakpoint_event()
  kvm: x86: hook in kvmi_descriptor_event()
  kvm: x86: hook in kvmi_cr_event()
  kvm: x86: hook in kvmi_xsetbv_event()
  kvm: x86: hook in kvmi_msr_event()
  kvm: x86: handle the introspection hypercalls
  kvm: x86: hook in kvmi_trap_event()

 Documentation/virtual/kvm/00-INDEX       |    2 +
 Documentation/virtual/kvm/hypercalls.txt |   66 ++
 Documentation/virtual/kvm/kvmi.rst       | 1323 ++++++++++++++++++++++++++++
 arch/x86/Kconfig                         |    9 +
 arch/x86/include/asm/kvm_emulate.h       |    1 +
 arch/x86/include/asm/kvm_host.h          |   13 +
 arch/x86/include/asm/kvm_page_track.h    |   24 +-
 arch/x86/include/asm/kvmi_guest.h        |   10 +
 arch/x86/include/asm/vmx.h               |    2 +
 arch/x86/include/uapi/asm/kvmi.h         |  213 +++++
 arch/x86/kernel/Makefile                 |    1 +
 arch/x86/kernel/kvmi_mem_guest.c         |   26 +
 arch/x86/kvm/Makefile                    |    1 +
 arch/x86/kvm/emulate.c                   |    9 +-
 arch/x86/kvm/mmu.c                       |  156 +++-
 arch/x86/kvm/mmu.h                       |    4 +
 arch/x86/kvm/page_track.c                |  129 ++-
 arch/x86/kvm/svm.c                       |   66 ++
 arch/x86/kvm/vmx.c                       |  109 ++-
 arch/x86/kvm/x86.c                       |  141 ++-
 include/linux/kvm_host.h                 |    5 +
 include/linux/kvmi.h                     |   32 +
 include/linux/mm.h                       |    3 +
 include/trace/events/kvmi.h              |  174 ++++
 include/uapi/linux/kvm.h                 |    8 +
 include/uapi/linux/kvm_para.h            |   10 +-
 include/uapi/linux/kvmi.h                |  150 ++++
 mm/internal.h                            |    5 -
 virt/kvm/kvm_main.c                      |   19 +
 virt/kvm/kvmi.c                          | 1410 ++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h                      |  121 +++
 virt/kvm/kvmi_mem.c                      |  730 ++++++++++++++++
 virt/kvm/kvmi_mem_guest.c                |  379 ++++++++
 virt/kvm/kvmi_msg.c                      | 1134 ++++++++++++++++++++++++
 34 files changed, 6438 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 include/linux/kvmi.h
 create mode 100644 include/trace/events/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi_int.h
 create mode 100644 virt/kvm/kvmi_mem.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_msg.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 01/18] kvm: add documentation and ABI/API headers for the VM introspection subsystem
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This includes the new hypercalls and KVM_xxx return values.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/00-INDEX       |    2 +
 Documentation/virtual/kvm/hypercalls.txt |   66 ++
 Documentation/virtual/kvm/kvmi.rst       | 1323 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h         |  213 +++++
 include/uapi/linux/kvm_para.h            |   10 +-
 include/uapi/linux/kvmi.h                |  150 ++++
 6 files changed, 1763 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h

diff --git a/Documentation/virtual/kvm/00-INDEX b/Documentation/virtual/kvm/00-INDEX
index 69fe1a8b7ad1..49ea106ca86b 100644
--- a/Documentation/virtual/kvm/00-INDEX
+++ b/Documentation/virtual/kvm/00-INDEX
@@ -10,6 +10,8 @@ halt-polling.txt
 	- notes on halt-polling
 hypercalls.txt
 	- KVM hypercalls.
+kvmi.rst
+	- VM introspection.
 locking.txt
 	- notes on KVM locks.
 mmu.txt
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index a890529c63ed..e0454fcd058f 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -121,3 +121,69 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
 
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_XEN_HVM_OP
+--------------------
+
+Architecture: x86
+Status: active
+Purpose: To enable communication between a guest agent and a VMI application
+Usage:
+
+An event will be sent to the VMI application (see kvmi.rst) if the following
+registers, which differ between 32bit and 64bit, have the following values:
+
+       32bit       64bit     value
+       ---------------------------
+       ebx (a0)    rdi       KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+       ecx (a1)    rsi       0
+
+This specification copies Xen's { __HYPERVISOR_hvm_op,
+HVMOP_guest_request_vm_event } hypercall and can originate from kernel or
+userspace.
+
+It returns 0 if successful, or a negative POSIX.1 error code if it fails. The
+absence of an active VMI application is not signaled in any way.
+
+The following registers are clobbered:
+
+  * 32bit: edx, esi, edi, ebp
+  * 64bit: rdx, r10, r8, r9
+
+In particular, for KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT, the last two
+registers can be poisoned deliberately and cannot be used for passing
+information.
+
+8. KVM_HC_MEM_MAP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Map a guest physical page to another VM (the introspector).
+Usage:
+
+a0: pointer to a token obtained with a KVMI_GET_MAP_TOKEN command (see kvmi.rst)
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+a1: guest physical address to be mapped
+
+a2: guest physical address from introspector that will be replaced
+
+Both guest physical addresses will end up poiting to the same physical page.
+
+Returns KVM_EFAULT in case of an error.
+
+9. KVM_HC_MEM_UNMAP
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Unmap a previously mapped page.
+Usage:
+
+a0: guest physical address from introspector
+
+The address will stop pointing to the introspected page and a new physical
+page is allocated for this gpa.
diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..e11435f33fe2
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,1323 @@
+=========================================================
+KVMI - The kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPUs (GPRs, MSRs etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preventing an MSR from being written), others are mere informative
+(like breakpoint events which can be used for execution tracing).
+With few exceptions, all events are optional. An application using this
+subsystem will explicitly register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is highly influenced by how the guest
+software (kernel, applications) sees the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for KVMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made KVMI a separate subsystem, even though many of these features are
+available in the device manager (eg. QEMU). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from an user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. QEMU) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (QEMU) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. As such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed) at which point it will wake up and properly cleanup. If the peer
+goes away, KVM will exit to user space and the device manager will try and
+reconnect. If it fails, the device manager will inform KVM to cleanup and
+continue normal guest execution as if the introspection subsystem has never
+been used on that guest. Obviously, whether the guest can really continue
+normal execution depends on whether the introspection tool has made any
+modifications that require an active KVMI channel.
+
+All messages (commands or events) have a common header::
+
+	struct kvmi_msg_hdr {
+		__u16 id;
+		__u16 size;
+		__u32 seq;
+	};
+
+and all need a reply with the same kind of header, having the same
+sequence number (``seq``) and the same message id (``id``).
+
+Because events from different vCPU threads can send messages at the same
+time and the replies can come in any order, the receiver loop uses the
+sequence number (seq) to identify which reply belongs to which vCPU, in
+order to dispatch the message to the right thread waiting for it.
+
+After ``kvmi_msg_hdr``, ``id`` specific data of ``size`` bytes will
+follow.
+
+The message header and its data must be sent with one ``sendmsg()`` call
+to the socket. This simplifies the receiver loop and avoids
+the reconstruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+A command reply begins with::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	}
+
+followed by the command specific data if the error code ``err`` is zero.
+
+The error code -KVM_ENOSYS (packed in a ``kvmi_error_code``) is returned for
+unsupported commands.
+
+The error code is related to the message processing. For all the other
+errors (socket errors, incomplete messages, wrong sequence numbers
+etc.) the socket must be closed. The device manager will be notified
+and it will reconnect.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               ----
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message at the same time::
+
+   Host kernel               Tool
+   -----------               ----
+   event X ->                <- command X
+
+the host kernel will reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures or to round
+the structure size to a multiple of 8 bytes (64bit) to improve the copy
+operations that happen during ``recvmsg()`` or ``sendmsg()``. The members
+should have the native alignment of the host (4 bytes on x86). All padding
+must be initialized with zero otherwise the respective commands will fail
+with -KVM_EINVAL.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures provide this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
+Handshake
+---------
+
+Although this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+Based on the system administration policies, the management tool
+(eg. libvirt) starts device managers (eg. QEMU) with some extra arguments:
+what introspector could monitor/control that specific guest (and how to
+connect to) and what introspection commands/events are allowed.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+In the end, the device manager will pass the file handle (plus the allowed
+commands/events) to KVM, and forget about it. It will be notified by
+KVM when the introspection tool closes the the file handle (in case of
+errors), and should reinitiate the handshake.
+
+Once the file handle reaches KVM, the introspection tool should use
+the *KVMI_GET_VERSION* command to get the API version, the commands and
+the events (see *KVMI_CONTROL_EVENTS*) which are allowed for this
+guest. The error code -KVM_EPERM will be returned if the introspection tool
+uses a command or enables an event which is not allowed.
+
+Live migrations
+---------------
+
+During a VMI session it is possible for the guest to be patched and for
+some of these patches to "talk" with the introspection tool. It thus becomes
+necessary to remove them before a live migration takes place.
+
+A live migration is normally performed by the device manager and such it is
+the best source for migration notifications. In the case of QEMU, an
+introspector tool can use the same facility as the QEMU Guest Agent to be
+notified when a migration is about to begin. QEMU will need to wait for a
+limited amount of time (a few seconds) for a confirmation that is OK to
+proceed. It does this only if a KVMI channel is active.
+
+The QEMU instance on the receiving end, if configured for KVMI, will need to
+establish a connection to the introspection tool after the migration has
+completed.
+
+Obviously, this creates a window in which the guest is not introspected. The
+user will need to be aware of this detail. Future introspection
+technologies can choose not to disconnect and instead transfer the necessary
+context to the introspection tool at the migration destination via a separate
+channel.
+
+Guest snapshots with memory
+---------------------------
+
+Just as for live migrations, before taking a snapshot with memory, the
+introspector might need to disconnect and reconnect after the snapshot
+operation has completed. This is because such snapshots can be restored long
+after the introspection tool was stopped or on a host that does not have KVMI
+enabled. Thus, if during the KVMI session the guest memory was patched, these
+changes will likely need to be undone.
+
+The same communication channel as QEMU Guest Agent can be used for the
+purpose of notifying a guest application when a memory snapshot is about to
+be created and also when the operation has completed.
+
+Memory access safety
+--------------------
+
+The KVMI API gives access to the entire guest physical address space but
+provides no information on which parts of it are system RAM and which are
+device-specific memory (DMA, emulated MMIO, reserved by a passthrough
+device etc.). It is up to the user to determine, using the guest operating
+system data structures, the areas that are safe to access (code, stack, heap
+etc.).
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+0. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_version_reply {
+		__u32 version;
+		__u32 commands;
+		__u32 events;
+		__u32 padding;
+	};
+
+Returns the introspection API version, the bit mask with allowed commands
+and the bit mask with allowed events (see *KVMI_CONTROL_EVENTS*).
+
+These two masks represent all the features allowed by the management tool
+(see **Handshake**) or supported by the host, with some exceptions: this command
+and the *KVMI_EVENT_PAUSE_VCPU* event.
+
+The host kernel and the userland can use the macros bellow to check if
+a command/event is allowed for a guest::
+
+	KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask)
+	KVMI_ALLOWED_EVENT(event_id, event_mask)
+
+This command is always successful.
+
+1. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_guest_info {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_guest_info_reply {
+		__u16 vcpu_count;
+		__u16 padding[3];
+		__u64 tsc_speed;
+	};
+
+Returns the number of online vCPUs and the TSC frequency (in HZ)
+if available.
+
+The parameter ``vcpu`` must be zero. It is required for consistency with
+all other commands and in the future it might be used to return true
+vCPU-specific information.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+2. KVMI_PAUSE_VCPU
+------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_pause_vcpu {
+		__u16 vcpu;
+		__u16 padding[3]; /* multiple of 8 bytes */
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Requests a pause for the specified vCPU. The vCPU thread will issue a
+*KVMI_EVENT_PAUSE_VCPU* event to let the introspection tool know it has
+enter the 'paused' state.
+
+If the command is issued while the vCPU was about to send an event, the
+*KVMI_EVENT_PAUSE_VCPU* event will be delayed until after the vCPU has
+received a response for its pending guest event.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EBUSY - the vCPU thread has a pending pause request
+
+3. KVMI_GET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_registers {
+		__u16 vcpu;
+		__u16 nmsrs;
+		__u16 padding[2];
+		__u32 msrs_idx[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_registers_reply {
+		__u32 mode;
+		__u32 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs msrs;
+	};
+
+For the given vCPU and the ``nmsrs`` sized array of MSRs registers,
+returns the current vCPU mode (in bytes: 2, 4 or 8), the general purpose
+registers, the special registers and the requested set of MSRs.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the indicated MSR-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+4. KVMI_SET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_set_registers {
+		__u16 vcpu;
+		__u16 padding[3];
+		struct kvm_regs regs;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the general purpose registers for the given vCPU. The changes become
+visible to other threads accessing the KVM vCPU structure after the event
+currently being handled is replied to.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+5. KVMI_GET_CPUID
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_cpuid {
+		__u16 vcpu;
+		__u16 padding[3];
+		__u32 function;
+		__u32 index;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_cpuid_reply {
+		__u32 eax;
+		__u32 ebx;
+		__u32 ecx;
+		__u32 edx;
+	};
+
+Returns a CPUID leaf (as seen by the guest OS).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOENT - the selected leaf is not present or is invalid
+
+6. KVMI_GET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		__u64 gpa[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_page_access_reply {
+		__u8 access[0];
+	};
+
+Returns the spte access bits (rwx) for the specified vCPU and for an array of
+``count`` guest physical addresses.
+
+The valid access bits for *KVMI_GET_PAGE_ACCESS* and *KVMI_SET_PAGE_ACCESS*
+are::
+
+	KVMI_PAGE_ACCESS_R
+	KVMI_PAGE_ACCESS_W
+	KVMI_PAGE_ACCESS_X
+
+On Intel hardware with multiple EPT views, the ``view`` argument selects the
+EPT view (0 is primary). On all other hardware it must be zero.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOSYS - an SPT view was selected but the hardware has no support for
+  it
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+7. KVMI_SET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_page_access_entry {
+		__u64 gpa;
+		__u8 access;
+		__u8 padding[7];
+	};
+
+	struct kvmi_set_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		struct kvmi_page_access_entry entries[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the spte access bits (rwx) for an array of ``count`` guest physical
+addresses;
+
+The command will fail with -KVM_EINVAL if any of the specified combination
+of access bits is not supported.
+
+The command will make the changes in order and it will not stop on errors. The
+introspector tool should handle the rollback.
+
+In order to 'forget' an address, all the access bits ('rwx') must be set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified access bits combination is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOSYS - a SPT view was selected but the hardware has no support for
+   it
+* -KVM_ENOMEM - not enough memory to add the page tracking structures
+
+8. KVMI_INJECT_EXCEPTION
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_inject_exception {
+		__u16 vcpu;
+		__u8 nr;
+		__u8 has_error;
+		__u16 error_code;
+		__u16 padding;
+		__u64 address;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Injects a vCPU exception with or without an error code. In case of page fault
+exception, the guest virtual address has to be specified.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified exception number is invalid
+* -KVM_EINVAL - the specified address is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+9. KVMI_READ_PHYSICAL
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_read_physical {
+		__u64 gpa;
+		__u64 size;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	__u8 data[0];
+
+Reads from the guest memory.
+
+Currently, the size must be non-zero and the read must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+10. KVMI_WRITE_PHYSICAL
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_write_physical {
+		__u64 gpa;
+		__u64 size;
+		__u8  data[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Writes into the guest memory.
+
+Currently, the size must be non-zero and the write must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+11. KVMI_CONTROL_EVENTS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_events {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 events;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables vCPU introspection events, by setting or clearing one or
+more of the following bits::
+
+	KVMI_EVENT_CR
+	KVMI_EVENT_MSR
+	KVMI_EVENT_XSETBV
+	KVMI_EVENT_BREAKPOINT
+	KVMI_EVENT_HYPERCALL
+	KVMI_EVENT_PAGE_FAULT
+	KVMI_EVENT_TRAP
+	KVMI_EVENT_SINGLESTEP
+	KVMI_EVENT_DESCRIPTOR
+
+For example:
+
+	``events = KVMI_EVENT_BREAKPOINT | KVMI_EVENT_PAGE_FAULT``
+
+it will disable all events but breakpoints and page faults.
+
+When an event is enabled, the introspection tool is notified and it
+must return a reply: allow, skip, etc. (see 'Events' below).
+
+The *KVMI_EVENT_PAUSE_VCPU* event is always allowed.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified mask of events is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EPERM - access to one or more events specified in the events mask is
+  restricted by the host
+
+12. KVMI_CONTROL_CR
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_cr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 cr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific control register and must
+be used in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_CR* bit
+set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified control register is not part of the CR0, CR3
+   or CR4 set
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+13. KVMI_CONTROL_MSR
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_msr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 msr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific MSR and must be used
+in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_MSR* bit set.
+
+Currently, only MSRs within the following two ranges are supported. Trying
+to control events for any other register will fail with -KVM_EINVAL::
+
+	0          ... 0x00001fff
+	0xc0000000 ... 0xc0001fff
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified MSR is invalid
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+14. KVMI_CONTROL_VE
+-------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_ve {
+		__u16 vcpu;
+		__u16 count;
+		__u8 enable;
+		__u8 padding[3];
+		__u64 gpa[0]
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+On hardware supporting virtualized exceptions, this command can control
+the #VE bit for the listed guest physical addresses. If #VE is not
+supported the command returns -KVM_ENOSYS.
+
+Check the bitmask obtained with *KVMI_GET_VERSION* to see ahead if the
+command is supported.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOSYS - the hardware does not support #VE
+
+.. note::
+
+  Virtualized exceptions are designed such that they can be controlled by
+  the guest itself and used for (among others) accelerate network
+  operations. Since this will obviously interfere with VMI, the guest
+  is denied access to VE while the introspection channel is active.
+
+15. KVMI_GET_MAP_TOKEN
+----------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_map_token_reply {
+		struct kvmi_map_mem_token token;
+	};
+
+Where::
+
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+Requests a token for a memory map operation.
+
+On this command, the host generates a random token to be used (once)
+to map a physical page from the introspected guest. The introspector
+could use the token with the KVM_INTRO_MEM_MAP ioctl (on /dev/kvmmem)
+to map a guest physical page to one of its memory pages. The ioctl,
+in turn, will use the KVM_HC_MEM_MAP hypercall (see hypercalls.txt).
+
+The guest kernel exposing /dev/kvmmem keeps a list with all the mappings
+(to all the guests introspected by the tool) in order to unmap them
+(using the KVM_HC_MEM_UNMAP hypercall) when /dev/kvmmem is closed or on
+demand (using the KVM_INTRO_MEM_UNMAP ioctl).
+
+:Errors:
+
+* -KVM_ENOMEM - not enough memory to allocate the token
+
+16. KVMI_GET_XSAVE
+------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_xsave {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_xsave_reply {
+		__u32 region[0];
+	};
+
+Returns a buffer containing the XSAVE area. Currently, the size of
+``kvm_xsave`` is used, but it could change. The userspace should get
+the buffer size from the message size.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+Events
+======
+
+All vCPU events are sent using the *KVMI_EVENT* message id. No event
+will be sent (except for *KVMI_EVENT_PAUSE_VCPU*) unless enabled
+with a *KVMI_CONTROL_EVENTS* command.
+
+The message data begins with a common structure, having the vCPU id,
+its mode (in bytes: 2, 4 and 8) and the event::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		/* arch specific data */
+	}
+
+On x86 the structure looks like this::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+			__u64 cstar;
+			__u64 pat;
+		} msrs;
+	};
+
+If contains information about the vCPU state at the time of the event.
+
+The replies to events have the *KVMI_EVENT_REPLY* message id and begin
+with a common structure::
+
+	struct kvmi_event_reply {
+		__u32 action;
+		__u32 padding;
+	};
+
+
+All events accept the KVMI_EVENT_ACTION_CRASH action, which stops the
+guest ungracefully but as soon as possible.
+
+Most of the events accept the KVMI_EVENT_ACTION_CONTINUE action, which
+lets the instruction that caused the event to continue (unless specified
+otherwise).
+
+Some of the events accept the KVMI_EVENT_ACTION_RETRY action, to continue
+by re-entering the quest.
+
+Specific data can follow these common structures.
+
+0. KVMI_EVENT_PAUSE_VCPU
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent in response to a *KVMI_PAUSE_VCPU* command, unless it
+is canceled by another *KVMI_PAUSE_VCPU* command (with ``cancel`` set to 1).
+
+This event cannot be disabled via *KVMI_CONTROL_EVENTS*.
+
+1. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a control register is going to be changed and the
+introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_CR*).
+
+``kvmi_event``, the control register number, the old value and the new value
+are sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+2. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a model specific register is going to be changed
+and the introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_MSR*).
+
+``kvmi_event``, the MSR number, the old value and the new value are
+sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+3. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when the extended control register XCR0 was
+modified and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event`` is sent to the introspector.
+
+4. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+Some of these breakpoints could have been injected by the introspector,
+placed in the slack space of various functions and used as notification
+for when the OS or an application has reached a certain state or is
+trying to perform a certain operation (like creating a process).
+
+``kvmi_event`` and the guest physical address are sent to the introspector.
+
+The *RETRY* action is used by the introspector for its own breakpoints.
+
+5. KVMI_EVENT_HYPERCALL
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent on a specific user hypercall when the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+The hypercall number must be ``KVM_HC_XEN_HVM_OP`` with the
+``KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT`` sub-function
+(see hypercalls.txt).
+
+It is used by the code residing inside the introspected guest to call the
+introspection tool and to report certain details about its operation. For
+example, a classic antimalware remediation tool can report what it has
+found during a scan.
+
+6. KVMI_EVENT_PAGE_FAULT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_page_fault {
+		__u64 gva;
+		__u64 gpa;
+		__u32 mode;
+		__u32 padding;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_page_fault_reply {
+		__u8 trap_access;
+		__u8 padding[3];
+		__u32 ctx_size;
+		__u8 ctx_data[256];
+	};
+
+This event is sent when a hypervisor page fault occurs due to a failed
+permission check in the shadow page tables, the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*) and the event was
+generated for a page in which the introspector has shown interest
+(ie. has previously touched it by adjusting the spte permissions).
+
+The shadow page tables can be used by the introspection tool to guarantee
+the purpose of code areas inside the guest (code, rodata, stack, heap
+etc.) Each attempt at an operation unfitting for a certain memory
+range (eg. execute code in heap) triggers a page fault and gives the
+introspection tool the chance to audit the code attempting the operation.
+
+``kvmi_event``, guest virtual address, guest physical address and the
+exit qualification (mode) are sent to the introspector.
+
+The *CONTINUE* action will continue the page fault handling via emulation
+(with custom input if ``ctx_size`` > 0). The use of custom input is
+to trick the guest software into believing it has read certain data,
+in order to hide the content of certain memory areas (eg. hide injected
+code from integrity checkers). If ``trap_access`` is not zero, the REP
+prefixed instruction should be emulated just once.
+
+7. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 error_code;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent if a trap will be delivered to the guest (page fault,
+breakpoint, etc.) and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+It is used to inform the introspector of all pending traps giving
+it a chance to determine if it should try again later in case a
+previous *KVMI_INJECT_EXCEPTION* command or a breakpoint/retry (see
+*KVMI_EVENT_BREAKPOINT*) has been overwritten by an interrupt picked up
+during guest reentry.
+
+``kvmi_event``, exception/interrupt number (vector), exception/interrupt
+type, exception code (``error_code``) and CR2 are sent to the introspector.
+
+8. KVMI_EVENT_CREATE_VCPU
+-------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent when a new vCPU is created and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+9. KVMI_EVENT_SINGLESTEP
+------------------------
+
+:Architecture: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is generated as a result of enabling guest single stepping (see
+*KVMI_CONTROL_EVENTS*).
+
+The *CONTINUE* action disables the single-stepping.
+
+10. KVMI_EVENT_DESCRIPTOR
+-------------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+	struct kvmi_event_descriptor {
+		union {
+			struct {
+				__u32 instr_info;
+				__u32 padding;
+				__u64 exit_qualification;
+			} vmx;
+			struct {
+				__u64 exit_info;
+				__u64 padding;
+			} svm;
+		} arch;
+		__u8 descriptor;
+		__u8 write;
+		__u8 padding[6];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is generated as a result of enabling descriptor access events
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event_descriptor`` contains the relevant event information.
+
+``kvmi_event_descriptor.descriptor`` can be one of::
+
+	KVMI_DESC_IDTR
+	KVMI_DESC_GDTR
+	KVMI_DESC_LDTR
+	KVMI_DESC_TR
+
+``kvmi_event_descriptor.write`` is 1 if the descriptor was written, 0
+otherwise.
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
new file mode 100644
index 000000000000..d7ae53c1f22f
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -0,0 +1,213 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_KVMI_H
+#define _ASM_X86_KVMI_H
+
+/*
+ * KVMI x86 specific structures and definitions
+ *
+ */
+
+#include <asm/kvm.h>
+#include <linux/types.h>
+
+#define KVMI_EVENT_CR          (1 << 1)	/* control register was modified */
+#define KVMI_EVENT_MSR         (1 << 2)	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV      (1 << 3)	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT  (1 << 4)	/* breakpoint was reached */
+#define KVMI_EVENT_HYPERCALL   (1 << 5)	/* user hypercall */
+#define KVMI_EVENT_PAGE_FAULT  (1 << 6)	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP        (1 << 7)	/* trap was injected */
+#define KVMI_EVENT_DESCRIPTOR  (1 << 8)	/* descriptor table access */
+#define KVMI_EVENT_CREATE_VCPU (1 << 9)
+#define KVMI_EVENT_PAUSE_VCPU  (1 << 10)
+
+/* TODO: find a way to split the events between common and arch dependent */
+
+#define KVMI_EVENT_ACTION_CONTINUE (1 << 0)
+#define KVMI_EVENT_ACTION_RETRY    (1 << 1)
+#define KVMI_EVENT_ACTION_CRASH    (1 << 2)
+
+#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
+			   KVMI_EVENT_MSR | \
+			   KVMI_EVENT_XSETBV | \
+			   KVMI_EVENT_BREAKPOINT | \
+			   KVMI_EVENT_HYPERCALL | \
+			   KVMI_EVENT_PAGE_FAULT | \
+			   KVMI_EVENT_TRAP | \
+			   KVMI_EVENT_CREATE_VCPU | \
+			   KVMI_EVENT_PAUSE_VCPU | \
+			   KVMI_EVENT_DESCRIPTOR)
+
+#define KVMI_ALLOWED_EVENT(event_id, event_mask)                       \
+		((!(event_id)) || (                                    \
+			(event_id)                                     \
+				& ((event_mask) & KVMI_KNOWN_EVENTS)))
+
+#define KVMI_PAGE_ACCESS_R (1 << 0)
+#define KVMI_PAGE_ACCESS_W (1 << 1)
+#define KVMI_PAGE_ACCESS_X (1 << 2)
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_page_fault {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 error_code;
+	__u32 padding;
+	__u64 cr2;
+};
+
+#define KVMI_DESC_IDTR	1
+#define KVMI_DESC_GDTR	2
+#define KVMI_DESC_LDTR	3
+#define KVMI_DESC_TR	4
+
+struct kvmi_event_descriptor {
+	union {
+		struct {
+			__u32 instr_info;
+			__u32 padding;
+			__u64 exit_qualification;
+		} vmx;
+		struct {
+			__u64 exit_info;
+			__u64 padding;
+		} svm;
+	} arch;
+	__u8 descriptor;
+	__u8 write;
+	__u8 padding[6];
+};
+
+struct kvmi_event {
+	__u32 event;
+	__u16 vcpu;
+	__u8 mode;		/* 2, 4 or 8 */
+	__u8 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+		__u64 cstar;
+		__u64 pat;
+	} msrs;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_page_fault_reply {
+	__u8 trap_access;
+	__u8 padding[3];
+	__u32 ctx_size;
+	__u8 ctx_data[256];
+};
+
+struct kvmi_control_cr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 cr;
+};
+
+struct kvmi_control_msr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 msr;
+};
+
+struct kvmi_guest_info {
+	__u16 vcpu_count;
+	__u16 padding1;
+	__u32 padding2;
+	__u64 tsc_speed;
+};
+
+struct kvmi_inject_exception {
+	__u16 vcpu;
+	__u8 nr;
+	__u8 has_error;
+	__u16 error_code;
+	__u16 padding;
+	__u64 address;
+};
+
+struct kvmi_get_registers {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u16 padding[2];
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_reply {
+	__u32 mode;
+	__u32 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers {
+	__u16 vcpu;
+	__u16 padding[3];
+	struct kvm_regs regs;
+};
+
+struct kvmi_get_cpuid {
+	__u16 vcpu;
+	__u16 padding[3];
+	__u32 function;
+	__u32 index;
+};
+
+struct kvmi_get_cpuid_reply {
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvmi_get_xsave {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_xsave_reply {
+	__u32 region[0];
+};
+
+#endif /* _ASM_X86_KVMI_H */
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index dcf629dd2889..34fd3d3108c6 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -10,12 +10,17 @@
  * - kvm_para_available
  */
 
-/* Return values for hypercalls */
+/* Return values for hypercalls and VM introspection */
 #define KVM_ENOSYS		1000
 #define KVM_EFAULT		EFAULT
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EAGAIN		11
+#define KVM_EBUSY		EBUSY
+#define KVM_EINVAL		EINVAL
+#define KVM_ENOENT		ENOENT
+#define KVM_ENOMEM		ENOMEM
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
@@ -26,6 +31,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_MEM_MAP			32
+#define KVM_HC_MEM_UNMAP		33
+#define KVM_HC_XEN_HVM_OP		34 /* Xen's __HYPERVISOR_hvm_op */
 
 /*
  * hypercalls use architecture specific
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..b1da800541ac
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __KVMI_H_INCLUDED__
+#define __KVMI_H_INCLUDED__
+
+/*
+ * KVMI specific structures and definitions
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <asm/kvmi.h>
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_GET_VERSION                  1
+#define KVMI_PAUSE_VCPU                   2
+#define KVMI_GET_GUEST_INFO               3
+#define KVMI_GET_REGISTERS                6
+#define KVMI_SET_REGISTERS                7
+#define KVMI_GET_PAGE_ACCESS              10
+#define KVMI_SET_PAGE_ACCESS              11
+#define KVMI_INJECT_EXCEPTION             12
+#define KVMI_READ_PHYSICAL                13
+#define KVMI_WRITE_PHYSICAL               14
+#define KVMI_GET_MAP_TOKEN                15
+#define KVMI_CONTROL_EVENTS               17
+#define KVMI_CONTROL_CR                   18
+#define KVMI_CONTROL_MSR                  19
+#define KVMI_EVENT                        23
+#define KVMI_EVENT_REPLY                  24
+#define KVMI_GET_CPUID                    25
+#define KVMI_GET_XSAVE                    26
+
+/* TODO: find a way to split the commands between common and arch dependent */
+
+#define KVMI_KNOWN_COMMANDS (-1) /* TODO: fix me */
+
+#define KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask)                         \
+		((!(cmd_id)) || (                                      \
+			(1 << ((cmd_id)-1))                            \
+				& ((cmd_mask) & KVMI_KNOWN_COMMANDS)))
+struct kvmi_msg_hdr {
+	__u16 id;
+	__u16 size;
+	__u32 seq;
+};
+
+#define KVMI_MAX_MSG_SIZE (sizeof(struct kvmi_msg_hdr) \
+			+ (1 << FIELD_SIZEOF(struct kvmi_msg_hdr, size)*8) \
+			- 1)
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
+struct kvmi_get_version_reply {
+	__u32 version;
+	__u32 commands;
+	__u32 events;
+	__u32 padding;
+};
+
+struct kvmi_get_guest_info {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_guest_info_reply {
+	__u16 vcpu_count;
+	__u16 padding[3];
+	__u64 tsc_speed;
+};
+
+struct kvmi_pause_vcpu {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_event_reply {
+	__u32 action;
+	__u32 padding;
+};
+
+struct kvmi_control_events {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_get_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	__u64 gpa[0];
+};
+
+struct kvmi_get_page_access_reply {
+	__u8 access[0];
+};
+
+struct kvmi_page_access_entry {
+	__u64 gpa;
+	__u8 access;
+	__u8 padding[7];
+};
+
+struct kvmi_set_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	struct kvmi_page_access_entry entries[0];
+};
+
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8  data[0];
+};
+
+struct kvmi_map_mem_token {
+	__u64 token[4];
+};
+
+struct kvmi_get_map_token_reply {
+	struct kvmi_map_mem_token token;
+};
+
+/* Map other guest's gpa to local gva */
+struct kvmi_mem_map {
+	struct kvmi_map_mem_token token;
+	__u64 gpa;
+	__u64 gva;
+};
+
+/*
+ * ioctls for /dev/kvmmem
+ */
+#define KVM_INTRO_MEM_MAP	_IOW('i', 0x01, struct kvmi_mem_map)
+#define KVM_INTRO_MEM_UNMAP	_IOW('i', 0x02, unsigned long)
+
+#endif /* __KVMI_H_INCLUDED__ */

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 01/18] kvm: add documentation and ABI/API headers for the VM introspection subsystem
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This includes the new hypercalls and KVM_xxx return values.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/00-INDEX       |    2 +
 Documentation/virtual/kvm/hypercalls.txt |   66 ++
 Documentation/virtual/kvm/kvmi.rst       | 1323 ++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvmi.h         |  213 +++++
 include/uapi/linux/kvm_para.h            |   10 +-
 include/uapi/linux/kvmi.h                |  150 ++++
 6 files changed, 1763 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 arch/x86/include/uapi/asm/kvmi.h
 create mode 100644 include/uapi/linux/kvmi.h

diff --git a/Documentation/virtual/kvm/00-INDEX b/Documentation/virtual/kvm/00-INDEX
index 69fe1a8b7ad1..49ea106ca86b 100644
--- a/Documentation/virtual/kvm/00-INDEX
+++ b/Documentation/virtual/kvm/00-INDEX
@@ -10,6 +10,8 @@ halt-polling.txt
 	- notes on halt-polling
 hypercalls.txt
 	- KVM hypercalls.
+kvmi.rst
+	- VM introspection.
 locking.txt
 	- notes on KVM locks.
 mmu.txt
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index a890529c63ed..e0454fcd058f 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -121,3 +121,69 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
 
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_XEN_HVM_OP
+--------------------
+
+Architecture: x86
+Status: active
+Purpose: To enable communication between a guest agent and a VMI application
+Usage:
+
+An event will be sent to the VMI application (see kvmi.rst) if the following
+registers, which differ between 32bit and 64bit, have the following values:
+
+       32bit       64bit     value
+       ---------------------------
+       ebx (a0)    rdi       KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+       ecx (a1)    rsi       0
+
+This specification copies Xen's { __HYPERVISOR_hvm_op,
+HVMOP_guest_request_vm_event } hypercall and can originate from kernel or
+userspace.
+
+It returns 0 if successful, or a negative POSIX.1 error code if it fails. The
+absence of an active VMI application is not signaled in any way.
+
+The following registers are clobbered:
+
+  * 32bit: edx, esi, edi, ebp
+  * 64bit: rdx, r10, r8, r9
+
+In particular, for KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT, the last two
+registers can be poisoned deliberately and cannot be used for passing
+information.
+
+8. KVM_HC_MEM_MAP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Map a guest physical page to another VM (the introspector).
+Usage:
+
+a0: pointer to a token obtained with a KVMI_GET_MAP_TOKEN command (see kvmi.rst)
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+a1: guest physical address to be mapped
+
+a2: guest physical address from introspector that will be replaced
+
+Both guest physical addresses will end up poiting to the same physical page.
+
+Returns KVM_EFAULT in case of an error.
+
+9. KVM_HC_MEM_UNMAP
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Unmap a previously mapped page.
+Usage:
+
+a0: guest physical address from introspector
+
+The address will stop pointing to the introspected page and a new physical
+page is allocated for this gpa.
diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..e11435f33fe2
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,1323 @@
+=========================================================
+KVMI - The kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPUs (GPRs, MSRs etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preventing an MSR from being written), others are mere informative
+(like breakpoint events which can be used for execution tracing).
+With few exceptions, all events are optional. An application using this
+subsystem will explicitly register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is highly influenced by how the guest
+software (kernel, applications) sees the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for KVMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made KVMI a separate subsystem, even though many of these features are
+available in the device manager (eg. QEMU). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from an user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. QEMU) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (QEMU) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. As such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed) at which point it will wake up and properly cleanup. If the peer
+goes away, KVM will exit to user space and the device manager will try and
+reconnect. If it fails, the device manager will inform KVM to cleanup and
+continue normal guest execution as if the introspection subsystem has never
+been used on that guest. Obviously, whether the guest can really continue
+normal execution depends on whether the introspection tool has made any
+modifications that require an active KVMI channel.
+
+All messages (commands or events) have a common header::
+
+	struct kvmi_msg_hdr {
+		__u16 id;
+		__u16 size;
+		__u32 seq;
+	};
+
+and all need a reply with the same kind of header, having the same
+sequence number (``seq``) and the same message id (``id``).
+
+Because events from different vCPU threads can send messages at the same
+time and the replies can come in any order, the receiver loop uses the
+sequence number (seq) to identify which reply belongs to which vCPU, in
+order to dispatch the message to the right thread waiting for it.
+
+After ``kvmi_msg_hdr``, ``id`` specific data of ``size`` bytes will
+follow.
+
+The message header and its data must be sent with one ``sendmsg()`` call
+to the socket. This simplifies the receiver loop and avoids
+the reconstruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+A command reply begins with::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	}
+
+followed by the command specific data if the error code ``err`` is zero.
+
+The error code -KVM_ENOSYS (packed in a ``kvmi_error_code``) is returned for
+unsupported commands.
+
+The error code is related to the message processing. For all the other
+errors (socket errors, incomplete messages, wrong sequence numbers
+etc.) the socket must be closed. The device manager will be notified
+and it will reconnect.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               ----
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message at the same time::
+
+   Host kernel               Tool
+   -----------               ----
+   event X ->                <- command X
+
+the host kernel will reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures or to round
+the structure size to a multiple of 8 bytes (64bit) to improve the copy
+operations that happen during ``recvmsg()`` or ``sendmsg()``. The members
+should have the native alignment of the host (4 bytes on x86). All padding
+must be initialized with zero otherwise the respective commands will fail
+with -KVM_EINVAL.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures provide this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
+Handshake
+---------
+
+Although this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+Based on the system administration policies, the management tool
+(eg. libvirt) starts device managers (eg. QEMU) with some extra arguments:
+what introspector could monitor/control that specific guest (and how to
+connect to) and what introspection commands/events are allowed.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+In the end, the device manager will pass the file handle (plus the allowed
+commands/events) to KVM, and forget about it. It will be notified by
+KVM when the introspection tool closes the the file handle (in case of
+errors), and should reinitiate the handshake.
+
+Once the file handle reaches KVM, the introspection tool should use
+the *KVMI_GET_VERSION* command to get the API version, the commands and
+the events (see *KVMI_CONTROL_EVENTS*) which are allowed for this
+guest. The error code -KVM_EPERM will be returned if the introspection tool
+uses a command or enables an event which is not allowed.
+
+Live migrations
+---------------
+
+During a VMI session it is possible for the guest to be patched and for
+some of these patches to "talk" with the introspection tool. It thus becomes
+necessary to remove them before a live migration takes place.
+
+A live migration is normally performed by the device manager and such it is
+the best source for migration notifications. In the case of QEMU, an
+introspector tool can use the same facility as the QEMU Guest Agent to be
+notified when a migration is about to begin. QEMU will need to wait for a
+limited amount of time (a few seconds) for a confirmation that is OK to
+proceed. It does this only if a KVMI channel is active.
+
+The QEMU instance on the receiving end, if configured for KVMI, will need to
+establish a connection to the introspection tool after the migration has
+completed.
+
+Obviously, this creates a window in which the guest is not introspected. The
+user will need to be aware of this detail. Future introspection
+technologies can choose not to disconnect and instead transfer the necessary
+context to the introspection tool at the migration destination via a separate
+channel.
+
+Guest snapshots with memory
+---------------------------
+
+Just as for live migrations, before taking a snapshot with memory, the
+introspector might need to disconnect and reconnect after the snapshot
+operation has completed. This is because such snapshots can be restored long
+after the introspection tool was stopped or on a host that does not have KVMI
+enabled. Thus, if during the KVMI session the guest memory was patched, these
+changes will likely need to be undone.
+
+The same communication channel as QEMU Guest Agent can be used for the
+purpose of notifying a guest application when a memory snapshot is about to
+be created and also when the operation has completed.
+
+Memory access safety
+--------------------
+
+The KVMI API gives access to the entire guest physical address space but
+provides no information on which parts of it are system RAM and which are
+device-specific memory (DMA, emulated MMIO, reserved by a passthrough
+device etc.). It is up to the user to determine, using the guest operating
+system data structures, the areas that are safe to access (code, stack, heap
+etc.).
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+0. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_version_reply {
+		__u32 version;
+		__u32 commands;
+		__u32 events;
+		__u32 padding;
+	};
+
+Returns the introspection API version, the bit mask with allowed commands
+and the bit mask with allowed events (see *KVMI_CONTROL_EVENTS*).
+
+These two masks represent all the features allowed by the management tool
+(see **Handshake**) or supported by the host, with some exceptions: this command
+and the *KVMI_EVENT_PAUSE_VCPU* event.
+
+The host kernel and the userland can use the macros bellow to check if
+a command/event is allowed for a guest::
+
+	KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask)
+	KVMI_ALLOWED_EVENT(event_id, event_mask)
+
+This command is always successful.
+
+1. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_guest_info {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_guest_info_reply {
+		__u16 vcpu_count;
+		__u16 padding[3];
+		__u64 tsc_speed;
+	};
+
+Returns the number of online vCPUs and the TSC frequency (in HZ)
+if available.
+
+The parameter ``vcpu`` must be zero. It is required for consistency with
+all other commands and in the future it might be used to return true
+vCPU-specific information.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+2. KVMI_PAUSE_VCPU
+------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_pause_vcpu {
+		__u16 vcpu;
+		__u16 padding[3]; /* multiple of 8 bytes */
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Requests a pause for the specified vCPU. The vCPU thread will issue a
+*KVMI_EVENT_PAUSE_VCPU* event to let the introspection tool know it has
+enter the 'paused' state.
+
+If the command is issued while the vCPU was about to send an event, the
+*KVMI_EVENT_PAUSE_VCPU* event will be delayed until after the vCPU has
+received a response for its pending guest event.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EBUSY - the vCPU thread has a pending pause request
+
+3. KVMI_GET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_registers {
+		__u16 vcpu;
+		__u16 nmsrs;
+		__u16 padding[2];
+		__u32 msrs_idx[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_registers_reply {
+		__u32 mode;
+		__u32 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs msrs;
+	};
+
+For the given vCPU and the ``nmsrs`` sized array of MSRs registers,
+returns the current vCPU mode (in bytes: 2, 4 or 8), the general purpose
+registers, the special registers and the requested set of MSRs.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the indicated MSR-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+4. KVMI_SET_REGISTERS
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_set_registers {
+		__u16 vcpu;
+		__u16 padding[3];
+		struct kvm_regs regs;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the general purpose registers for the given vCPU. The changes become
+visible to other threads accessing the KVM vCPU structure after the event
+currently being handled is replied to.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+5. KVMI_GET_CPUID
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_cpuid {
+		__u16 vcpu;
+		__u16 padding[3];
+		__u32 function;
+		__u32 index;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_cpuid_reply {
+		__u32 eax;
+		__u32 ebx;
+		__u32 ecx;
+		__u32 edx;
+	};
+
+Returns a CPUID leaf (as seen by the guest OS).
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOENT - the selected leaf is not present or is invalid
+
+6. KVMI_GET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		__u64 gpa[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_page_access_reply {
+		__u8 access[0];
+	};
+
+Returns the spte access bits (rwx) for the specified vCPU and for an array of
+``count`` guest physical addresses.
+
+The valid access bits for *KVMI_GET_PAGE_ACCESS* and *KVMI_SET_PAGE_ACCESS*
+are::
+
+	KVMI_PAGE_ACCESS_R
+	KVMI_PAGE_ACCESS_W
+	KVMI_PAGE_ACCESS_X
+
+On Intel hardware with multiple EPT views, the ``view`` argument selects the
+EPT view (0 is primary). On all other hardware it must be zero.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the selected SPT view is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOSYS - an SPT view was selected but the hardware has no support for
+  it
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+7. KVMI_SET_PAGE_ACCESS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_page_access_entry {
+		__u64 gpa;
+		__u8 access;
+		__u8 padding[7];
+	};
+
+	struct kvmi_set_page_access {
+		__u16 vcpu;
+		__u16 count;
+		__u16 view;
+		__u16 padding;
+		struct kvmi_page_access_entry entries[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Sets the spte access bits (rwx) for an array of ``count`` guest physical
+addresses;
+
+The command will fail with -KVM_EINVAL if any of the specified combination
+of access bits is not supported.
+
+The command will make the changes in order and it will not stop on errors. The
+introspector tool should handle the rollback.
+
+In order to 'forget' an address, all the access bits ('rwx') must be set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified access bits combination is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOSYS - a SPT view was selected but the hardware has no support for
+   it
+* -KVM_ENOMEM - not enough memory to add the page tracking structures
+
+8. KVMI_INJECT_EXCEPTION
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_inject_exception {
+		__u16 vcpu;
+		__u8 nr;
+		__u8 has_error;
+		__u16 error_code;
+		__u16 padding;
+		__u64 address;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Injects a vCPU exception with or without an error code. In case of page fault
+exception, the guest virtual address has to be specified.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified exception number is invalid
+* -KVM_EINVAL - the specified address is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+
+9. KVMI_READ_PHYSICAL
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_read_physical {
+		__u64 gpa;
+		__u64 size;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	__u8 data[0];
+
+Reads from the guest memory.
+
+Currently, the size must be non-zero and the read must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+10. KVMI_WRITE_PHYSICAL
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_write_physical {
+		__u64 gpa;
+		__u64 size;
+		__u8  data[0];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Writes into the guest memory.
+
+Currently, the size must be non-zero and the write must be restricted to
+one page (offset + size <= PAGE_SIZE).
+
+:Errors:
+
+* -KVM_EINVAL - the specified gpa is invalid
+
+11. KVMI_CONTROL_EVENTS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_events {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 events;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables vCPU introspection events, by setting or clearing one or
+more of the following bits::
+
+	KVMI_EVENT_CR
+	KVMI_EVENT_MSR
+	KVMI_EVENT_XSETBV
+	KVMI_EVENT_BREAKPOINT
+	KVMI_EVENT_HYPERCALL
+	KVMI_EVENT_PAGE_FAULT
+	KVMI_EVENT_TRAP
+	KVMI_EVENT_SINGLESTEP
+	KVMI_EVENT_DESCRIPTOR
+
+For example:
+
+	``events = KVMI_EVENT_BREAKPOINT | KVMI_EVENT_PAGE_FAULT``
+
+it will disable all events but breakpoints and page faults.
+
+When an event is enabled, the introspection tool is notified and it
+must return a reply: allow, skip, etc. (see 'Events' below).
+
+The *KVMI_EVENT_PAUSE_VCPU* event is always allowed.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified mask of events is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EPERM - access to one or more events specified in the events mask is
+  restricted by the host
+
+12. KVMI_CONTROL_CR
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_cr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 cr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific control register and must
+be used in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_CR* bit
+set.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified control register is not part of the CR0, CR3
+   or CR4 set
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+13. KVMI_CONTROL_MSR
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_msr {
+		__u16 vcpu;
+		__u8 enable;
+		__u8 padding;
+		__u32 msr;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+Enables/disables introspection for a specific MSR and must be used
+in addition to *KVMI_CONTROL_EVENTS* with the *KVMI_EVENT_MSR* bit set.
+
+Currently, only MSRs within the following two ranges are supported. Trying
+to control events for any other register will fail with -KVM_EINVAL::
+
+	0          ... 0x00001fff
+	0xc0000000 ... 0xc0001fff
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - the specified MSR is invalid
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+
+14. KVMI_CONTROL_VE
+-------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_control_ve {
+		__u16 vcpu;
+		__u16 count;
+		__u8 enable;
+		__u8 padding[3];
+		__u64 gpa[0]
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code
+
+On hardware supporting virtualized exceptions, this command can control
+the #VE bit for the listed guest physical addresses. If #VE is not
+supported the command returns -KVM_ENOSYS.
+
+Check the bitmask obtained with *KVMI_GET_VERSION* to see ahead if the
+command is supported.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EINVAL - one of the specified gpa-s is invalid
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_ENOSYS - the hardware does not support #VE
+
+.. note::
+
+  Virtualized exceptions are designed such that they can be controlled by
+  the guest itself and used for (among others) accelerate network
+  operations. Since this will obviously interfere with VMI, the guest
+  is denied access to VE while the introspection channel is active.
+
+15. KVMI_GET_MAP_TOKEN
+----------------------
+
+:Architecture: all
+:Versions: >= 1
+:Parameters: none
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_map_token_reply {
+		struct kvmi_map_mem_token token;
+	};
+
+Where::
+
+	struct kvmi_map_mem_token {
+		__u64 token[4];
+	};
+
+Requests a token for a memory map operation.
+
+On this command, the host generates a random token to be used (once)
+to map a physical page from the introspected guest. The introspector
+could use the token with the KVM_INTRO_MEM_MAP ioctl (on /dev/kvmmem)
+to map a guest physical page to one of its memory pages. The ioctl,
+in turn, will use the KVM_HC_MEM_MAP hypercall (see hypercalls.txt).
+
+The guest kernel exposing /dev/kvmmem keeps a list with all the mappings
+(to all the guests introspected by the tool) in order to unmap them
+(using the KVM_HC_MEM_UNMAP hypercall) when /dev/kvmmem is closed or on
+demand (using the KVM_INTRO_MEM_UNMAP ioctl).
+
+:Errors:
+
+* -KVM_ENOMEM - not enough memory to allocate the token
+
+16. KVMI_GET_XSAVE
+------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Parameters:
+
+::
+
+	struct kvmi_get_xsave {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_error_code;
+	struct kvmi_get_xsave_reply {
+		__u32 region[0];
+	};
+
+Returns a buffer containing the XSAVE area. Currently, the size of
+``kvm_xsave`` is used, but it could change. The userspace should get
+the buffer size from the message size.
+
+:Errors:
+
+* -KVM_EINVAL - the selected vCPU is invalid
+* -KVM_EAGAIN - the selected vCPU can't be introspected yet
+* -KVM_EBUSY - the selected vCPU has another queued command
+* -KVM_ENOMEM - not enough memory to allocate the reply
+
+Events
+======
+
+All vCPU events are sent using the *KVMI_EVENT* message id. No event
+will be sent (except for *KVMI_EVENT_PAUSE_VCPU*) unless enabled
+with a *KVMI_CONTROL_EVENTS* command.
+
+The message data begins with a common structure, having the vCPU id,
+its mode (in bytes: 2, 4 and 8) and the event::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		/* arch specific data */
+	}
+
+On x86 the structure looks like this::
+
+	struct kvmi_event {
+		__u32 event;
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+			__u64 cstar;
+			__u64 pat;
+		} msrs;
+	};
+
+If contains information about the vCPU state at the time of the event.
+
+The replies to events have the *KVMI_EVENT_REPLY* message id and begin
+with a common structure::
+
+	struct kvmi_event_reply {
+		__u32 action;
+		__u32 padding;
+	};
+
+
+All events accept the KVMI_EVENT_ACTION_CRASH action, which stops the
+guest ungracefully but as soon as possible.
+
+Most of the events accept the KVMI_EVENT_ACTION_CONTINUE action, which
+lets the instruction that caused the event to continue (unless specified
+otherwise).
+
+Some of the events accept the KVMI_EVENT_ACTION_RETRY action, to continue
+by re-entering the quest.
+
+Specific data can follow these common structures.
+
+0. KVMI_EVENT_PAUSE_VCPU
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent in response to a *KVMI_PAUSE_VCPU* command, unless it
+is canceled by another *KVMI_PAUSE_VCPU* command (with ``cancel`` set to 1).
+
+This event cannot be disabled via *KVMI_CONTROL_EVENTS*.
+
+1. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a control register is going to be changed and the
+introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_CR*).
+
+``kvmi_event``, the control register number, the old value and the new value
+are sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+2. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a model specific register is going to be changed
+and the introspection has been enabled for this event and for this specific
+register (see *KVMI_CONTROL_EVENTS* and *KVMI_CONTROL_MSR*).
+
+``kvmi_event``, the MSR number, the old value and the new value are
+sent to the introspector. The *CONTINUE* action will set the ``new_val``.
+
+3. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when the extended control register XCR0 was
+modified and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event`` is sent to the introspector.
+
+4. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+Some of these breakpoints could have been injected by the introspector,
+placed in the slack space of various functions and used as notification
+for when the OS or an application has reached a certain state or is
+trying to perform a certain operation (like creating a process).
+
+``kvmi_event`` and the guest physical address are sent to the introspector.
+
+The *RETRY* action is used by the introspector for its own breakpoints.
+
+5. KVMI_EVENT_HYPERCALL
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent on a specific user hypercall when the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+The hypercall number must be ``KVM_HC_XEN_HVM_OP`` with the
+``KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT`` sub-function
+(see hypercalls.txt).
+
+It is used by the code residing inside the introspected guest to call the
+introspection tool and to report certain details about its operation. For
+example, a classic antimalware remediation tool can report what it has
+found during a scan.
+
+6. KVMI_EVENT_PAGE_FAULT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_page_fault {
+		__u64 gva;
+		__u64 gpa;
+		__u32 mode;
+		__u32 padding;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+	struct kvmi_event_page_fault_reply {
+		__u8 trap_access;
+		__u8 padding[3];
+		__u32 ctx_size;
+		__u8 ctx_data[256];
+	};
+
+This event is sent when a hypervisor page fault occurs due to a failed
+permission check in the shadow page tables, the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*) and the event was
+generated for a page in which the introspector has shown interest
+(ie. has previously touched it by adjusting the spte permissions).
+
+The shadow page tables can be used by the introspection tool to guarantee
+the purpose of code areas inside the guest (code, rodata, stack, heap
+etc.) Each attempt at an operation unfitting for a certain memory
+range (eg. execute code in heap) triggers a page fault and gives the
+introspection tool the chance to audit the code attempting the operation.
+
+``kvmi_event``, guest virtual address, guest physical address and the
+exit qualification (mode) are sent to the introspector.
+
+The *CONTINUE* action will continue the page fault handling via emulation
+(with custom input if ``ctx_size`` > 0). The use of custom input is
+to trick the guest software into believing it has read certain data,
+in order to hide the content of certain memory areas (eg. hide injected
+code from integrity checkers). If ``trap_access`` is not zero, the REP
+prefixed instruction should be emulated just once.
+
+7. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 error_code;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply;
+
+This event is sent if a trap will be delivered to the guest (page fault,
+breakpoint, etc.) and the introspection has been enabled for this event
+(see *KVMI_CONTROL_EVENTS*).
+
+It is used to inform the introspector of all pending traps giving
+it a chance to determine if it should try again later in case a
+previous *KVMI_INJECT_EXCEPTION* command or a breakpoint/retry (see
+*KVMI_EVENT_BREAKPOINT*) has been overwritten by an interrupt picked up
+during guest reentry.
+
+``kvmi_event``, exception/interrupt number (vector), exception/interrupt
+type, exception code (``error_code``) and CR2 are sent to the introspector.
+
+8. KVMI_EVENT_CREATE_VCPU
+-------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is sent when a new vCPU is created and the introspection has
+been enabled for this event (see *KVMI_CONTROL_EVENTS*).
+
+9. KVMI_EVENT_SINGLESTEP
+------------------------
+
+:Architecture: all
+:Versions: >= 1
+:Actions: CONTINUE, CRASH, RETRY
+:Parameters:
+
+::
+
+	struct kvmi_event
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is generated as a result of enabling guest single stepping (see
+*KVMI_CONTROL_EVENTS*).
+
+The *CONTINUE* action disables the single-stepping.
+
+10. KVMI_EVENT_DESCRIPTOR
+-------------------------
+
+:Architecture: x86
+:Versions: >= 1
+:Actions: CONTINUE, CRASH
+:Parameters:
+
+::
+
+	struct kvmi_event
+	struct kvmi_event_descriptor {
+		union {
+			struct {
+				__u32 instr_info;
+				__u32 padding;
+				__u64 exit_qualification;
+			} vmx;
+			struct {
+				__u64 exit_info;
+				__u64 padding;
+			} svm;
+		} arch;
+		__u8 descriptor;
+		__u8 write;
+		__u8 padding[6];
+	};
+
+:Returns:
+
+::
+
+	struct kvmi_event_reply
+
+This event is generated as a result of enabling descriptor access events
+(see *KVMI_CONTROL_EVENTS*).
+
+``kvmi_event_descriptor`` contains the relevant event information.
+
+``kvmi_event_descriptor.descriptor`` can be one of::
+
+	KVMI_DESC_IDTR
+	KVMI_DESC_GDTR
+	KVMI_DESC_LDTR
+	KVMI_DESC_TR
+
+``kvmi_event_descriptor.write`` is 1 if the descriptor was written, 0
+otherwise.
diff --git a/arch/x86/include/uapi/asm/kvmi.h b/arch/x86/include/uapi/asm/kvmi.h
new file mode 100644
index 000000000000..d7ae53c1f22f
--- /dev/null
+++ b/arch/x86/include/uapi/asm/kvmi.h
@@ -0,0 +1,213 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_KVMI_H
+#define _ASM_X86_KVMI_H
+
+/*
+ * KVMI x86 specific structures and definitions
+ *
+ */
+
+#include <asm/kvm.h>
+#include <linux/types.h>
+
+#define KVMI_EVENT_CR          (1 << 1)	/* control register was modified */
+#define KVMI_EVENT_MSR         (1 << 2)	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV      (1 << 3)	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT  (1 << 4)	/* breakpoint was reached */
+#define KVMI_EVENT_HYPERCALL   (1 << 5)	/* user hypercall */
+#define KVMI_EVENT_PAGE_FAULT  (1 << 6)	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP        (1 << 7)	/* trap was injected */
+#define KVMI_EVENT_DESCRIPTOR  (1 << 8)	/* descriptor table access */
+#define KVMI_EVENT_CREATE_VCPU (1 << 9)
+#define KVMI_EVENT_PAUSE_VCPU  (1 << 10)
+
+/* TODO: find a way to split the events between common and arch dependent */
+
+#define KVMI_EVENT_ACTION_CONTINUE (1 << 0)
+#define KVMI_EVENT_ACTION_RETRY    (1 << 1)
+#define KVMI_EVENT_ACTION_CRASH    (1 << 2)
+
+#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
+			   KVMI_EVENT_MSR | \
+			   KVMI_EVENT_XSETBV | \
+			   KVMI_EVENT_BREAKPOINT | \
+			   KVMI_EVENT_HYPERCALL | \
+			   KVMI_EVENT_PAGE_FAULT | \
+			   KVMI_EVENT_TRAP | \
+			   KVMI_EVENT_CREATE_VCPU | \
+			   KVMI_EVENT_PAUSE_VCPU | \
+			   KVMI_EVENT_DESCRIPTOR)
+
+#define KVMI_ALLOWED_EVENT(event_id, event_mask)                       \
+		((!(event_id)) || (                                    \
+			(event_id)                                     \
+				& ((event_mask) & KVMI_KNOWN_EVENTS)))
+
+#define KVMI_PAGE_ACCESS_R (1 << 0)
+#define KVMI_PAGE_ACCESS_W (1 << 1)
+#define KVMI_PAGE_ACCESS_X (1 << 2)
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_page_fault {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 error_code;
+	__u32 padding;
+	__u64 cr2;
+};
+
+#define KVMI_DESC_IDTR	1
+#define KVMI_DESC_GDTR	2
+#define KVMI_DESC_LDTR	3
+#define KVMI_DESC_TR	4
+
+struct kvmi_event_descriptor {
+	union {
+		struct {
+			__u32 instr_info;
+			__u32 padding;
+			__u64 exit_qualification;
+		} vmx;
+		struct {
+			__u64 exit_info;
+			__u64 padding;
+		} svm;
+	} arch;
+	__u8 descriptor;
+	__u8 write;
+	__u8 padding[6];
+};
+
+struct kvmi_event {
+	__u32 event;
+	__u16 vcpu;
+	__u8 mode;		/* 2, 4 or 8 */
+	__u8 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+		__u64 cstar;
+		__u64 pat;
+	} msrs;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_page_fault_reply {
+	__u8 trap_access;
+	__u8 padding[3];
+	__u32 ctx_size;
+	__u8 ctx_data[256];
+};
+
+struct kvmi_control_cr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 cr;
+};
+
+struct kvmi_control_msr {
+	__u16 vcpu;
+	__u8 enable;
+	__u8 padding;
+	__u32 msr;
+};
+
+struct kvmi_guest_info {
+	__u16 vcpu_count;
+	__u16 padding1;
+	__u32 padding2;
+	__u64 tsc_speed;
+};
+
+struct kvmi_inject_exception {
+	__u16 vcpu;
+	__u8 nr;
+	__u8 has_error;
+	__u16 error_code;
+	__u16 padding;
+	__u64 address;
+};
+
+struct kvmi_get_registers {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u16 padding[2];
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_reply {
+	__u32 mode;
+	__u32 padding;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers {
+	__u16 vcpu;
+	__u16 padding[3];
+	struct kvm_regs regs;
+};
+
+struct kvmi_get_cpuid {
+	__u16 vcpu;
+	__u16 padding[3];
+	__u32 function;
+	__u32 index;
+};
+
+struct kvmi_get_cpuid_reply {
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvmi_get_xsave {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_xsave_reply {
+	__u32 region[0];
+};
+
+#endif /* _ASM_X86_KVMI_H */
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index dcf629dd2889..34fd3d3108c6 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -10,12 +10,17 @@
  * - kvm_para_available
  */
 
-/* Return values for hypercalls */
+/* Return values for hypercalls and VM introspection */
 #define KVM_ENOSYS		1000
 #define KVM_EFAULT		EFAULT
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EAGAIN		11
+#define KVM_EBUSY		EBUSY
+#define KVM_EINVAL		EINVAL
+#define KVM_ENOENT		ENOENT
+#define KVM_ENOMEM		ENOMEM
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
@@ -26,6 +31,9 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_MEM_MAP			32
+#define KVM_HC_MEM_UNMAP		33
+#define KVM_HC_XEN_HVM_OP		34 /* Xen's __HYPERVISOR_hvm_op */
 
 /*
  * hypercalls use architecture specific
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..b1da800541ac
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __KVMI_H_INCLUDED__
+#define __KVMI_H_INCLUDED__
+
+/*
+ * KVMI specific structures and definitions
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <asm/kvmi.h>
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_GET_VERSION                  1
+#define KVMI_PAUSE_VCPU                   2
+#define KVMI_GET_GUEST_INFO               3
+#define KVMI_GET_REGISTERS                6
+#define KVMI_SET_REGISTERS                7
+#define KVMI_GET_PAGE_ACCESS              10
+#define KVMI_SET_PAGE_ACCESS              11
+#define KVMI_INJECT_EXCEPTION             12
+#define KVMI_READ_PHYSICAL                13
+#define KVMI_WRITE_PHYSICAL               14
+#define KVMI_GET_MAP_TOKEN                15
+#define KVMI_CONTROL_EVENTS               17
+#define KVMI_CONTROL_CR                   18
+#define KVMI_CONTROL_MSR                  19
+#define KVMI_EVENT                        23
+#define KVMI_EVENT_REPLY                  24
+#define KVMI_GET_CPUID                    25
+#define KVMI_GET_XSAVE                    26
+
+/* TODO: find a way to split the commands between common and arch dependent */
+
+#define KVMI_KNOWN_COMMANDS (-1) /* TODO: fix me */
+
+#define KVMI_ALLOWED_COMMAND(cmd_id, cmd_mask)                         \
+		((!(cmd_id)) || (                                      \
+			(1 << ((cmd_id)-1))                            \
+				& ((cmd_mask) & KVMI_KNOWN_COMMANDS)))
+struct kvmi_msg_hdr {
+	__u16 id;
+	__u16 size;
+	__u32 seq;
+};
+
+#define KVMI_MAX_MSG_SIZE (sizeof(struct kvmi_msg_hdr) \
+			+ (1 << FIELD_SIZEOF(struct kvmi_msg_hdr, size)*8) \
+			- 1)
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
+struct kvmi_get_version_reply {
+	__u32 version;
+	__u32 commands;
+	__u32 events;
+	__u32 padding;
+};
+
+struct kvmi_get_guest_info {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_get_guest_info_reply {
+	__u16 vcpu_count;
+	__u16 padding[3];
+	__u64 tsc_speed;
+};
+
+struct kvmi_pause_vcpu {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_event_reply {
+	__u32 action;
+	__u32 padding;
+};
+
+struct kvmi_control_events {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_get_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	__u64 gpa[0];
+};
+
+struct kvmi_get_page_access_reply {
+	__u8 access[0];
+};
+
+struct kvmi_page_access_entry {
+	__u64 gpa;
+	__u8 access;
+	__u8 padding[7];
+};
+
+struct kvmi_set_page_access {
+	__u16 vcpu;
+	__u16 count;
+	__u16 view;
+	__u16 padding;
+	struct kvmi_page_access_entry entries[0];
+};
+
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8  data[0];
+};
+
+struct kvmi_map_mem_token {
+	__u64 token[4];
+};
+
+struct kvmi_get_map_token_reply {
+	struct kvmi_map_mem_token token;
+};
+
+/* Map other guest's gpa to local gva */
+struct kvmi_mem_map {
+	struct kvmi_map_mem_token token;
+	__u64 gpa;
+	__u64 gva;
+};
+
+/*
+ * ioctls for /dev/kvmmem
+ */
+#define KVM_INTRO_MEM_MAP	_IOW('i', 0x01, struct kvmi_mem_map)
+#define KVM_INTRO_MEM_UNMAP	_IOW('i', 0x02, unsigned long)
+
+#endif /* __KVMI_H_INCLUDED__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Mircea Cîrjaliu

From: Adalbert Lazar <alazar@bitdefender.com>

An introspection tool running in a dedicated VM can use the new device
(/dev/kvmmem) to map memory from other introspected VM-s.

Two ioctl operations are supported:
  - KVM_INTRO_MEM_MAP/struct kvmi_mem_map
  - KVM_INTRO_MEM_UNMAP/unsigned long

In order to map an introspected gpa to the local gva, the process using
this device needs to obtain a token from the host KVMI subsystem (see
Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).

Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_INTRO_MEM_UNMAP)
to pass the requests to the host kernel/KVMi (see hypercalls.txt).

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
---
 arch/x86/Kconfig                  |   9 +
 arch/x86/include/asm/kvmi_guest.h |  10 +
 arch/x86/kernel/Makefile          |   1 +
 arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
 virt/kvm/kvmi_mem_guest.c         | 379 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 425 insertions(+)
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8eed3f94bfc7..6e2548f4d44c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -782,6 +782,15 @@ config KVM_DEBUG_FS
 	  Statistics are displayed in debugfs filesystem. Enabling this option
 	  may incur significant overhead.
 
+config KVMI_MEM_GUEST
+	bool "KVM Memory Introspection support on Guest"
+	depends on KVM_GUEST
+	default n
+	---help---
+	  This option enables functions and hypercalls for security applications
+	  running in a separate VM to control the execution of other VM-s, query
+	  the state of the vCPU-s (GPR-s, MSR-s etc.).
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	depends on PARAVIRT
diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
new file mode 100644
index 000000000000..c7ed53a938e0
--- /dev/null
+++ b/arch/x86/include/asm/kvmi_guest.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_GUEST_H__
+#define __KVMI_GUEST_H__
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+	gpa_t req_gpa, gpa_t map_gpa);
+long kvmi_arch_unmap_hc(gpa_t map_gpa);
+
+
+#endif
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 81bb565f4497..fdb54b65e46e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
+obj-$(CONFIG_KVMI_MEM_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
new file mode 100644
index 000000000000..c4e2613f90f3
--- /dev/null
+++ b/arch/x86/kernel/kvmi_mem_guest.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/kvm_types.h>
+#include <asm/kvm_para.h>
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+		       gpa_t req_gpa, gpa_t map_gpa)
+{
+	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
+			      req_gpa, map_gpa);
+}
+
+long kvmi_arch_unmap_hc(gpa_t map_gpa)
+{
+	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
+}
diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
new file mode 100644
index 000000000000..118c22ca47c5
--- /dev/null
+++ b/virt/kvm/kvmi_mem_guest.c
@@ -0,0 +1,379 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/rmap.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <uapi/linux/kvmi.h>
+#include <asm/kvmi_guest.h>
+
+#define ASSERT(exp) BUG_ON(!(exp))
+
+
+static struct list_head file_list;
+static spinlock_t file_lock;
+
+struct file_map {
+	struct list_head file_list;
+	struct file *file;
+	struct list_head map_list;
+	struct mutex lock;
+	int active;	/* for tearing down */
+};
+
+struct page_map {
+	struct list_head map_list;
+	__u64 gpa;
+	unsigned long vaddr;
+	unsigned long paddr;
+};
+
+
+static int kvm_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct file_map *fmp;
+
+	pr_debug("kvmi: file %016lx opened by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* link the file 1:1 with such a structure */
+	fmp = kmalloc(sizeof(struct file_map), GFP_KERNEL);
+	if (fmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&fmp->file_list);
+	fmp->file = filp;
+	filp->private_data = fmp;
+	INIT_LIST_HEAD(&fmp->map_list);
+	mutex_init(&fmp->lock);
+	fmp->active = 1;
+
+	/* add the entry to the global list */
+	spin_lock(&file_lock);
+	list_add_tail(&fmp->file_list, &file_list);
+	spin_unlock(&file_lock);
+
+	return 0;
+}
+
+/* actually does the mapping of a page */
+static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmap)
+{
+	unsigned long paddr;
+	struct vm_area_struct *vma = NULL;
+	struct page *page;
+	long result;
+
+	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
+		 map_req->gpa, map_req->gva);
+
+	/* check access to memory location */
+	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
+		pr_err("kvmi: invalid virtual address for mapping\n");
+		return -EINVAL;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	/* find the page to be replaced */
+	vma = find_vma(current->mm, map_req->gva);
+	if (IS_ERR_OR_NULL(vma)) {
+		result = PTR_ERR(vma);
+		pr_err("kvmi: find_vma() failed with result %ld\n", result);
+		goto out;
+	}
+
+	page = follow_page(vma, map_req->gva, 0);
+	if (IS_ERR_OR_NULL(page)) {
+		result = PTR_ERR(page);
+		pr_err("kvmi: follow_page() failed with result %ld\n", result);
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(page, "page to map_req into");
+
+	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
+
+	/* get the physical address and store it in page_map */
+	paddr = page_to_phys(page);
+	pr_debug("kvmi: page phys addr %016lx\n", paddr);
+	pmap->paddr = paddr;
+
+	/* last thing to do is host mapping */
+	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
+	if (IS_ERR_VALUE(result)) {
+		pr_err("kvmi: HC failed with result %ld\n", result);
+		goto out;
+	}
+
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return result;
+}
+
+/* actually does the unmapping of a page */
+static long _do_unmapping(unsigned long paddr)
+{
+	long result;
+
+	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
+
+	/* local GPA uniquely identifies the mapping on the host */
+	result = kvmi_arch_unmap_hc(paddr);
+	if (IS_ERR_VALUE(result))
+		pr_warn("kvmi: HC failed with result %ld\n", result);
+
+	return result;
+}
+
+static long kvm_dev_ioctl_map(struct file_map *fmp, struct kvmi_mem_map *map)
+{
+	struct page_map *pmp;
+	long result = 0;
+
+	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
+		return -EINVAL;
+
+	/* prepare list entry */
+	pmp = kmalloc(sizeof(struct page_map), GFP_KERNEL);
+	if (pmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&pmp->map_list);
+	pmp->gpa = map->gpa;
+	pmp->vaddr = map->gva;
+
+	/* acquire the file mapping */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		result = -ENODEV;
+		pr_warn("kvmi: unable to map, file is being closed\n");
+		goto out_err;
+	}
+
+	/* do the actual mapping */
+	result = _do_mapping(map, pmp);
+	if (IS_ERR_VALUE(result))
+		goto out_err;
+
+	/* link to list */
+	list_add_tail(&pmp->map_list, &fmp->map_list);
+
+	/* all fine */
+	result = 0;
+	goto out_finalize;
+
+out_err:
+	kfree(pmp);
+
+out_finalize:
+	mutex_unlock(&fmp->lock);
+
+	return result;
+}
+
+static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long vaddr)
+{
+	struct list_head *cur;
+	struct page_map *pmp;
+	bool found = false;
+
+	/* acquire the file */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		mutex_unlock(&fmp->lock);
+		pr_warn("kvmi: unable to unmap, file is being closed\n");
+		return -ENODEV;
+	}
+
+	/* check that this address belongs to us */
+	list_for_each(cur, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* found */
+		if (pmp->vaddr == vaddr) {
+			found = true;
+			break;
+		}
+	}
+
+	/* not found ? */
+	if (!found) {
+		mutex_unlock(&fmp->lock);
+		pr_err("kvmi: address %016lx not mapped\n", vaddr);
+		return -ENOENT;
+	}
+
+	/* decouple guest mapping */
+	list_del(&pmp->map_list);
+	mutex_unlock(&fmp->lock);
+
+	/* unmap & ignore result */
+	_do_unmapping(pmp->paddr);
+
+	/* free guest mapping */
+	kfree(pmp);
+
+	return 0;
+}
+
+static long kvm_dev_ioctl(struct file *filp,
+			  unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_map *fmp;
+	long result;
+
+	/* minor check */
+	fmp = filp->private_data;
+	ASSERT(fmp->file == filp);
+
+	switch (ioctl) {
+	case KVM_INTRO_MEM_MAP: {
+		struct kvmi_mem_map map;
+
+		result = -EFAULT;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			break;
+
+		result = kvm_dev_ioctl_map(fmp, &map);
+		if (IS_ERR_VALUE(result))
+			break;
+
+		result = 0;
+		break;
+	}
+	case KVM_INTRO_MEM_UNMAP: {
+		unsigned long vaddr = (unsigned long) arg;
+
+		result = kvm_dev_ioctl_unmap(fmp, vaddr);
+		if (IS_ERR_VALUE(result))
+			break;
+
+		result = 0;
+		break;
+	}
+	default:
+		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int kvm_dev_release(struct inode *inodep, struct file *filp)
+{
+	int result = 0;
+	struct file_map *fmp;
+	struct list_head *cur, *next;
+	struct page_map *pmp;
+
+	pr_debug("kvmi: file %016lx closed by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* acquire the file */
+	fmp = filp->private_data;
+	mutex_lock(&fmp->lock);
+
+	/* mark for teardown */
+	fmp->active = 0;
+
+	/* release mappings taken on this instance of the file */
+	list_for_each_safe(cur, next, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* unmap address */
+		_do_unmapping(pmp->paddr);
+
+		/* decouple & free guest mapping */
+		list_del(&pmp->map_list);
+		kfree(pmp);
+	}
+
+	/* done processing this file mapping */
+	mutex_unlock(&fmp->lock);
+
+	/* decouple file mapping */
+	spin_lock(&file_lock);
+	list_del(&fmp->file_list);
+	spin_unlock(&file_lock);
+
+	/* free it */
+	kfree(fmp);
+
+	return result;
+}
+
+
+static const struct file_operations kvmmem_ops = {
+	.open		= kvm_dev_open,
+	.unlocked_ioctl = kvm_dev_ioctl,
+	.compat_ioctl   = kvm_dev_ioctl,
+	.release	= kvm_dev_release,
+};
+
+static struct miscdevice kvm_mem_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "kvmmem",
+	.fops = &kvmmem_ops,
+};
+
+int __init kvm_intro_guest_init(void)
+{
+	int result;
+
+	if (!kvm_para_available()) {
+		pr_err("kvmi: paravirt not available\n");
+		return -EPERM;
+	}
+
+	result = misc_register(&kvm_mem_dev);
+	if (result) {
+		pr_err("kvmi: misc device register failed: %d\n", result);
+		return result;
+	}
+
+	INIT_LIST_HEAD(&file_list);
+	spin_lock_init(&file_lock);
+
+	pr_info("kvmi: guest introspection device created\n");
+
+	return 0;
+}
+
+void kvm_intro_guest_exit(void)
+{
+	misc_deregister(&kvm_mem_dev);
+}
+
+module_init(kvm_intro_guest_init)
+module_exit(kvm_intro_guest_exit)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Mircea Cîrjaliu

From: Adalbert Lazar <alazar@bitdefender.com>

An introspection tool running in a dedicated VM can use the new device
(/dev/kvmmem) to map memory from other introspected VM-s.

Two ioctl operations are supported:
  - KVM_INTRO_MEM_MAP/struct kvmi_mem_map
  - KVM_INTRO_MEM_UNMAP/unsigned long

In order to map an introspected gpa to the local gva, the process using
this device needs to obtain a token from the host KVMI subsystem (see
Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).

Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_INTRO_MEM_UNMAP)
to pass the requests to the host kernel/KVMi (see hypercalls.txt).

Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
---
 arch/x86/Kconfig                  |   9 +
 arch/x86/include/asm/kvmi_guest.h |  10 +
 arch/x86/kernel/Makefile          |   1 +
 arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
 virt/kvm/kvmi_mem_guest.c         | 379 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 425 insertions(+)
 create mode 100644 arch/x86/include/asm/kvmi_guest.h
 create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
 create mode 100644 virt/kvm/kvmi_mem_guest.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8eed3f94bfc7..6e2548f4d44c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -782,6 +782,15 @@ config KVM_DEBUG_FS
 	  Statistics are displayed in debugfs filesystem. Enabling this option
 	  may incur significant overhead.
 
+config KVMI_MEM_GUEST
+	bool "KVM Memory Introspection support on Guest"
+	depends on KVM_GUEST
+	default n
+	---help---
+	  This option enables functions and hypercalls for security applications
+	  running in a separate VM to control the execution of other VM-s, query
+	  the state of the vCPU-s (GPR-s, MSR-s etc.).
+
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	depends on PARAVIRT
diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
new file mode 100644
index 000000000000..c7ed53a938e0
--- /dev/null
+++ b/arch/x86/include/asm/kvmi_guest.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_GUEST_H__
+#define __KVMI_GUEST_H__
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+	gpa_t req_gpa, gpa_t map_gpa);
+long kvmi_arch_unmap_hc(gpa_t map_gpa);
+
+
+#endif
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 81bb565f4497..fdb54b65e46e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
+obj-$(CONFIG_KVMI_MEM_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
new file mode 100644
index 000000000000..c4e2613f90f3
--- /dev/null
+++ b/arch/x86/kernel/kvmi_mem_guest.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/kvm_types.h>
+#include <asm/kvm_para.h>
+
+long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
+		       gpa_t req_gpa, gpa_t map_gpa)
+{
+	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
+			      req_gpa, map_gpa);
+}
+
+long kvmi_arch_unmap_hc(gpa_t map_gpa)
+{
+	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
+}
diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
new file mode 100644
index 000000000000..118c22ca47c5
--- /dev/null
+++ b/virt/kvm/kvmi_mem_guest.c
@@ -0,0 +1,379 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection guest implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/rmap.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <uapi/linux/kvmi.h>
+#include <asm/kvmi_guest.h>
+
+#define ASSERT(exp) BUG_ON(!(exp))
+
+
+static struct list_head file_list;
+static spinlock_t file_lock;
+
+struct file_map {
+	struct list_head file_list;
+	struct file *file;
+	struct list_head map_list;
+	struct mutex lock;
+	int active;	/* for tearing down */
+};
+
+struct page_map {
+	struct list_head map_list;
+	__u64 gpa;
+	unsigned long vaddr;
+	unsigned long paddr;
+};
+
+
+static int kvm_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct file_map *fmp;
+
+	pr_debug("kvmi: file %016lx opened by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* link the file 1:1 with such a structure */
+	fmp = kmalloc(sizeof(struct file_map), GFP_KERNEL);
+	if (fmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&fmp->file_list);
+	fmp->file = filp;
+	filp->private_data = fmp;
+	INIT_LIST_HEAD(&fmp->map_list);
+	mutex_init(&fmp->lock);
+	fmp->active = 1;
+
+	/* add the entry to the global list */
+	spin_lock(&file_lock);
+	list_add_tail(&fmp->file_list, &file_list);
+	spin_unlock(&file_lock);
+
+	return 0;
+}
+
+/* actually does the mapping of a page */
+static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmap)
+{
+	unsigned long paddr;
+	struct vm_area_struct *vma = NULL;
+	struct page *page;
+	long result;
+
+	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
+		 map_req->gpa, map_req->gva);
+
+	/* check access to memory location */
+	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
+		pr_err("kvmi: invalid virtual address for mapping\n");
+		return -EINVAL;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	/* find the page to be replaced */
+	vma = find_vma(current->mm, map_req->gva);
+	if (IS_ERR_OR_NULL(vma)) {
+		result = PTR_ERR(vma);
+		pr_err("kvmi: find_vma() failed with result %ld\n", result);
+		goto out;
+	}
+
+	page = follow_page(vma, map_req->gva, 0);
+	if (IS_ERR_OR_NULL(page)) {
+		result = PTR_ERR(page);
+		pr_err("kvmi: follow_page() failed with result %ld\n", result);
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(page, "page to map_req into");
+
+	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
+
+	/* get the physical address and store it in page_map */
+	paddr = page_to_phys(page);
+	pr_debug("kvmi: page phys addr %016lx\n", paddr);
+	pmap->paddr = paddr;
+
+	/* last thing to do is host mapping */
+	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
+	if (IS_ERR_VALUE(result)) {
+		pr_err("kvmi: HC failed with result %ld\n", result);
+		goto out;
+	}
+
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return result;
+}
+
+/* actually does the unmapping of a page */
+static long _do_unmapping(unsigned long paddr)
+{
+	long result;
+
+	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
+
+	/* local GPA uniquely identifies the mapping on the host */
+	result = kvmi_arch_unmap_hc(paddr);
+	if (IS_ERR_VALUE(result))
+		pr_warn("kvmi: HC failed with result %ld\n", result);
+
+	return result;
+}
+
+static long kvm_dev_ioctl_map(struct file_map *fmp, struct kvmi_mem_map *map)
+{
+	struct page_map *pmp;
+	long result = 0;
+
+	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
+		return -EINVAL;
+
+	/* prepare list entry */
+	pmp = kmalloc(sizeof(struct page_map), GFP_KERNEL);
+	if (pmp == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&pmp->map_list);
+	pmp->gpa = map->gpa;
+	pmp->vaddr = map->gva;
+
+	/* acquire the file mapping */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		result = -ENODEV;
+		pr_warn("kvmi: unable to map, file is being closed\n");
+		goto out_err;
+	}
+
+	/* do the actual mapping */
+	result = _do_mapping(map, pmp);
+	if (IS_ERR_VALUE(result))
+		goto out_err;
+
+	/* link to list */
+	list_add_tail(&pmp->map_list, &fmp->map_list);
+
+	/* all fine */
+	result = 0;
+	goto out_finalize;
+
+out_err:
+	kfree(pmp);
+
+out_finalize:
+	mutex_unlock(&fmp->lock);
+
+	return result;
+}
+
+static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long vaddr)
+{
+	struct list_head *cur;
+	struct page_map *pmp;
+	bool found = false;
+
+	/* acquire the file */
+	mutex_lock(&fmp->lock);
+
+	/* check if other thread is closing the file */
+	if (!fmp->active) {
+		mutex_unlock(&fmp->lock);
+		pr_warn("kvmi: unable to unmap, file is being closed\n");
+		return -ENODEV;
+	}
+
+	/* check that this address belongs to us */
+	list_for_each(cur, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* found */
+		if (pmp->vaddr == vaddr) {
+			found = true;
+			break;
+		}
+	}
+
+	/* not found ? */
+	if (!found) {
+		mutex_unlock(&fmp->lock);
+		pr_err("kvmi: address %016lx not mapped\n", vaddr);
+		return -ENOENT;
+	}
+
+	/* decouple guest mapping */
+	list_del(&pmp->map_list);
+	mutex_unlock(&fmp->lock);
+
+	/* unmap & ignore result */
+	_do_unmapping(pmp->paddr);
+
+	/* free guest mapping */
+	kfree(pmp);
+
+	return 0;
+}
+
+static long kvm_dev_ioctl(struct file *filp,
+			  unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_map *fmp;
+	long result;
+
+	/* minor check */
+	fmp = filp->private_data;
+	ASSERT(fmp->file == filp);
+
+	switch (ioctl) {
+	case KVM_INTRO_MEM_MAP: {
+		struct kvmi_mem_map map;
+
+		result = -EFAULT;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			break;
+
+		result = kvm_dev_ioctl_map(fmp, &map);
+		if (IS_ERR_VALUE(result))
+			break;
+
+		result = 0;
+		break;
+	}
+	case KVM_INTRO_MEM_UNMAP: {
+		unsigned long vaddr = (unsigned long) arg;
+
+		result = kvm_dev_ioctl_unmap(fmp, vaddr);
+		if (IS_ERR_VALUE(result))
+			break;
+
+		result = 0;
+		break;
+	}
+	default:
+		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int kvm_dev_release(struct inode *inodep, struct file *filp)
+{
+	int result = 0;
+	struct file_map *fmp;
+	struct list_head *cur, *next;
+	struct page_map *pmp;
+
+	pr_debug("kvmi: file %016lx closed by process %s\n",
+		 (unsigned long) filp, current->comm);
+
+	/* acquire the file */
+	fmp = filp->private_data;
+	mutex_lock(&fmp->lock);
+
+	/* mark for teardown */
+	fmp->active = 0;
+
+	/* release mappings taken on this instance of the file */
+	list_for_each_safe(cur, next, &fmp->map_list) {
+		pmp = list_entry(cur, struct page_map, map_list);
+
+		/* unmap address */
+		_do_unmapping(pmp->paddr);
+
+		/* decouple & free guest mapping */
+		list_del(&pmp->map_list);
+		kfree(pmp);
+	}
+
+	/* done processing this file mapping */
+	mutex_unlock(&fmp->lock);
+
+	/* decouple file mapping */
+	spin_lock(&file_lock);
+	list_del(&fmp->file_list);
+	spin_unlock(&file_lock);
+
+	/* free it */
+	kfree(fmp);
+
+	return result;
+}
+
+
+static const struct file_operations kvmmem_ops = {
+	.open		= kvm_dev_open,
+	.unlocked_ioctl = kvm_dev_ioctl,
+	.compat_ioctl   = kvm_dev_ioctl,
+	.release	= kvm_dev_release,
+};
+
+static struct miscdevice kvm_mem_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "kvmmem",
+	.fops = &kvmmem_ops,
+};
+
+int __init kvm_intro_guest_init(void)
+{
+	int result;
+
+	if (!kvm_para_available()) {
+		pr_err("kvmi: paravirt not available\n");
+		return -EPERM;
+	}
+
+	result = misc_register(&kvm_mem_dev);
+	if (result) {
+		pr_err("kvmi: misc device register failed: %d\n", result);
+		return result;
+	}
+
+	INIT_LIST_HEAD(&file_list);
+	spin_lock_init(&file_lock);
+
+	pr_info("kvmi: guest introspection device created\n");
+
+	return 0;
+}
+
+void kvm_intro_guest_exit(void)
+{
+	misc_deregister(&kvm_mem_dev);
+}
+
+module_init(kvm_intro_guest_init)
+module_exit(kvm_intro_guest_exit)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 03/18] kvm: x86: add kvm_arch_msr_intercept()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This function is used by the introspection subsytem to enable/disable
MSR register interception.

The patch adds back the __vmx_enable_intercept_for_msr() function
removed with the commit 40d8338d095e
("KVM: VMX: remove functions that enable msr intercepts").

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/svm.c              | 11 +++++++++
 arch/x86/kvm/vmx.c              | 53 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  7 ++++++
 4 files changed, 75 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..8842d8e1e4ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1079,6 +1079,8 @@ struct kvm_x86_ops {
 	int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
 	int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
 	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
+
+	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
 };
 
 struct kvm_arch_async_pf {
@@ -1451,4 +1453,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index eb714f1cdf7e..5f7482851223 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5505,6 +5505,15 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 *msrpm = svm->msrpm;
+
+	set_msr_interception(msrpm, msr, enable, enable);
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -5620,6 +5629,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.pre_enter_smm = svm_pre_enter_smm,
 	.pre_leave_smm = svm_pre_leave_smm,
 	.enable_smi_window = enable_smi_window,
+
+	.msr_intercept = svm_msr_intercept,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8eba631c4dbd..9c984bbe263e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -12069,6 +12069,57 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
+{
+	int f = sizeof(unsigned long);
+
+	if (!cpu_has_vmx_msr_bitmap())
+		return;
+
+	/*
+	 * See Intel PRM Vol. 3, 24.6.9 (MSR-Bitmap Address). Early manuals
+	 * have the write-low and read-high bitmap offsets the wrong way round.
+	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
+	 */
+	if (msr <= 0x1fff) {
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__set_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__set_bit(msr, msr_bitmap + 0x800 / f);
+
+	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+		msr &= 0x1fff;
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__set_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__set_bit(msr, msr_bitmap + 0xc00 / f);
+
+	}
+}
+
+static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enabled)
+{
+	if (enabled) {
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+					       MSR_TYPE_W);
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+					       MSR_TYPE_W);
+	} else {
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+						MSR_TYPE_W);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+						MSR_TYPE_W);
+	}
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -12199,6 +12250,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.pre_enter_smm = vmx_pre_enter_smm,
 	.pre_leave_smm = vmx_pre_leave_smm,
 	.enable_smi_window = enable_smi_window,
+
+	.msr_intercept = vmx_msr_intercept,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1cec2c62a0b0..e1a3c2c6ec08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8871,6 +8871,13 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	kvm_x86_ops->msr_intercept(vcpu, msr, enable);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 03/18] kvm: x86: add kvm_arch_msr_intercept()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This function is used by the introspection subsytem to enable/disable
MSR register interception.

The patch adds back the __vmx_enable_intercept_for_msr() function
removed with the commit 40d8338d095e
("KVM: VMX: remove functions that enable msr intercepts").

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/svm.c              | 11 +++++++++
 arch/x86/kvm/vmx.c              | 53 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  7 ++++++
 4 files changed, 75 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..8842d8e1e4ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1079,6 +1079,8 @@ struct kvm_x86_ops {
 	int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
 	int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
 	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
+
+	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
 };
 
 struct kvm_arch_async_pf {
@@ -1451,4 +1453,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable);
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index eb714f1cdf7e..5f7482851223 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5505,6 +5505,15 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 *msrpm = svm->msrpm;
+
+	set_msr_interception(msrpm, msr, enable, enable);
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -5620,6 +5629,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.pre_enter_smm = svm_pre_enter_smm,
 	.pre_leave_smm = svm_pre_leave_smm,
 	.enable_smi_window = enable_smi_window,
+
+	.msr_intercept = svm_msr_intercept,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8eba631c4dbd..9c984bbe263e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -12069,6 +12069,57 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
+{
+	int f = sizeof(unsigned long);
+
+	if (!cpu_has_vmx_msr_bitmap())
+		return;
+
+	/*
+	 * See Intel PRM Vol. 3, 24.6.9 (MSR-Bitmap Address). Early manuals
+	 * have the write-low and read-high bitmap offsets the wrong way round.
+	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
+	 */
+	if (msr <= 0x1fff) {
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__set_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__set_bit(msr, msr_bitmap + 0x800 / f);
+
+	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+		msr &= 0x1fff;
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__set_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__set_bit(msr, msr_bitmap + 0xc00 / f);
+
+	}
+}
+
+static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enabled)
+{
+	if (enabled) {
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+					       MSR_TYPE_W);
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+					       MSR_TYPE_W);
+	} else {
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+						MSR_TYPE_W);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+						MSR_TYPE_W);
+	}
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -12199,6 +12250,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.pre_enter_smm = vmx_pre_enter_smm,
 	.pre_leave_smm = vmx_pre_leave_smm,
 	.enable_smi_window = enable_smi_window,
+
+	.msr_intercept = vmx_msr_intercept,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1cec2c62a0b0..e1a3c2c6ec08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8871,6 +8871,13 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
+				bool enable)
+{
+	kvm_x86_ops->msr_intercept(vcpu, msr, enable);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

These are helper functions used by the VM introspection subsytem on the
PF call path.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  7 +++++++
 arch/x86/include/asm/vmx.h      |  2 ++
 arch/x86/kvm/mmu.c              | 10 ++++++++++
 arch/x86/kvm/svm.c              |  8 ++++++++
 arch/x86/kvm/vmx.c              |  9 +++++++++
 5 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8842d8e1e4ee..239eb628f8fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
 	/* set at EPT violation at this point */
 	unsigned long exit_qualification;
 
+	/* #PF translated error code from EPT/NPT exit reason */
+	u64 error_code;
+
 	/* pv related host specific info */
 	struct {
 		bool pv_unhalted;
@@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
 	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
 
 	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
+	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
+bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8b6780751132..7036125349dd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -530,6 +530,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE_BIT	3
 #define EPT_VIOLATION_WRITABLE_BIT	4
 #define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_GLA_VALID_BIT	7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
@@ -537,6 +538,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
 #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
 /*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c4deb1f34faa..55fcb0292724 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->fault_gla(vcpu);
+}
+
+bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
+{
+	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
+}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5f7482851223..f41e4d7008d7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
+	svm->vcpu.arch.error_code = error_code;
+
 	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
 			svm->vmcb->control.insn_bytes,
 			svm->vmcb->control.insn_len);
@@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	set_msr_interception(msrpm, msr, enable, enable);
 }
 
+static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return ~0ull;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.enable_smi_window = enable_smi_window,
 
 	.msr_intercept = svm_msr_intercept,
+	.fault_gla = svm_fault_gla
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9c984bbe263e..5487e0242030 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
 	vcpu->arch.exit_qualification = exit_qualification;
+	vcpu->arch.error_code = error_code;
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	}
 }
 
+static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
+		return vmcs_readl(GUEST_LINEAR_ADDRESS);
+	return ~0ul;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.enable_smi_window = enable_smi_window,
 
 	.msr_intercept = vmx_msr_intercept,
+	.fault_gla = vmx_fault_gla
 };
 
 static int __init vmx_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

These are helper functions used by the VM introspection subsytem on the
PF call path.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  7 +++++++
 arch/x86/include/asm/vmx.h      |  2 ++
 arch/x86/kvm/mmu.c              | 10 ++++++++++
 arch/x86/kvm/svm.c              |  8 ++++++++
 arch/x86/kvm/vmx.c              |  9 +++++++++
 5 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8842d8e1e4ee..239eb628f8fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
 	/* set at EPT violation at this point */
 	unsigned long exit_qualification;
 
+	/* #PF translated error code from EPT/NPT exit reason */
+	u64 error_code;
+
 	/* pv related host specific info */
 	struct {
 		bool pv_unhalted;
@@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
 	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
 
 	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
+	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 
 void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 				bool enable);
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
+bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 8b6780751132..7036125349dd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -530,6 +530,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE_BIT	3
 #define EPT_VIOLATION_WRITABLE_BIT	4
 #define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_GLA_VALID_BIT	7
 #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
 #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
 #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
@@ -537,6 +538,7 @@ struct vmx_msr_entry {
 #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
 #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
 /*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c4deb1f34faa..55fcb0292724 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->fault_gla(vcpu);
+}
+
+bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
+{
+	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
+}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5f7482851223..f41e4d7008d7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
+	svm->vcpu.arch.error_code = error_code;
+
 	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
 			svm->vmcb->control.insn_bytes,
 			svm->vmcb->control.insn_len);
@@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	set_msr_interception(msrpm, msr, enable, enable);
 }
 
+static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
+{
+	return ~0ull;
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.enable_smi_window = enable_smi_window,
 
 	.msr_intercept = svm_msr_intercept,
+	.fault_gla = svm_fault_gla
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9c984bbe263e..5487e0242030 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
 	vcpu->arch.exit_qualification = exit_qualification;
+	vcpu->arch.error_code = error_code;
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
 	}
 }
 
+static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
+		return vmcs_readl(GUEST_LINEAR_ADDRESS);
+	return ~0ul;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.enable_smi_window = enable_smi_window,
 
 	.msr_intercept = vmx_msr_intercept,
+	.fault_gla = vmx_fault_gla
 };
 
 static int __init vmx_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
the exceptions vector.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1a3c2c6ec08..4b0c3692386d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+/*
+ * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
+ * the exceptions
+ */
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
+	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
+	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
+	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
+	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
+#ifdef CONFIG_X86_64
+	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
+	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
+	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
+	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
+	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
+	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
+	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
+	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
+#endif
+
+	kvm_rip_write(vcpu, regs->rip);
+	kvm_set_rflags(vcpu, regs->rflags);
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+}
+
 void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	struct kvm_segment cs;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6bdd4b9f6611..68e4d756f5c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
the exceptions vector.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1a3c2c6ec08..4b0c3692386d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+/*
+ * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
+ * the exceptions
+ */
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
+	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
+	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
+	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
+	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
+#ifdef CONFIG_X86_64
+	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
+	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
+	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
+	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
+	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
+	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
+	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
+	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
+#endif
+
+	kvm_rip_write(vcpu, regs->rip);
+	kvm_set_rflags(vcpu, regs->rflags);
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+}
+
 void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	struct kvm_segment cs;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6bdd4b9f6611..68e4d756f5c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 06/18] kvm: vmx: export the availability of EPT views
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This is used to validate the KVMI_GET_PAGE_ACCESS and KVMI_SET_PAGE_ACCESS
commands when the guest introspection tool selects a different EPT view.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx.c              | 2 ++
 arch/x86/kvm/x86.c              | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 239eb628f8fb..2cf03ed181e6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1162,6 +1162,7 @@ extern u64  kvm_max_tsc_scaling_ratio;
 extern u64  kvm_default_tsc_scaling_ratio;
 
 extern u64 kvm_mce_cap_supported;
+extern bool kvm_eptp_switching_supported;
 
 enum emulation_result {
 	EMULATE_DONE,         /* no further processing */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5487e0242030..093a2e1f7ea6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6898,6 +6898,8 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->cancel_hv_timer = NULL;
 	}
 
+	kvm_eptp_switching_supported = cpu_has_vmx_vmfunc();
+
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
 	kvm_mce_cap_supported |= MCG_LMCE_P;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b0c3692386d..e7db70ac1f82 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -138,6 +138,9 @@ module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
 
+bool __read_mostly kvm_eptp_switching_supported;
+EXPORT_SYMBOL_GPL(kvm_eptp_switching_supported);
+
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 06/18] kvm: vmx: export the availability of EPT views
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

This is used to validate the KVMI_GET_PAGE_ACCESS and KVMI_SET_PAGE_ACCESS
commands when the guest introspection tool selects a different EPT view.

Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx.c              | 2 ++
 arch/x86/kvm/x86.c              | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 239eb628f8fb..2cf03ed181e6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1162,6 +1162,7 @@ extern u64  kvm_max_tsc_scaling_ratio;
 extern u64  kvm_default_tsc_scaling_ratio;
 
 extern u64 kvm_mce_cap_supported;
+extern bool kvm_eptp_switching_supported;
 
 enum emulation_result {
 	EMULATE_DONE,         /* no further processing */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5487e0242030..093a2e1f7ea6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6898,6 +6898,8 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->cancel_hv_timer = NULL;
 	}
 
+	kvm_eptp_switching_supported = cpu_has_vmx_vmfunc();
+
 	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
 
 	kvm_mce_cap_supported |= MCG_LMCE_P;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b0c3692386d..e7db70ac1f82 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -138,6 +138,9 @@ module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
 
+bool __read_mostly kvm_eptp_switching_supported;
+EXPORT_SYMBOL_GPL(kvm_eptp_switching_supported);
+
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

These callbacks return a boolean value. If false, the emulation should
stop and the instruction should be reexecuted in guest. The preread
callback can return the bytes needed by the read operation.

The kvm_page_track_create_memslot() was extended in order to track gfn-s
as soon as the memory slots are created.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_page_track.h |  24 +++++-
 arch/x86/kvm/mmu.c                    | 143 ++++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu.h                    |   4 +
 arch/x86/kvm/page_track.c             | 129 ++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c                    |   2 +-
 5 files changed, 281 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 172f9749dbb2..77adc7f43754 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,8 +3,11 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+	KVM_PAGE_TRACK_PREREAD,
+	KVM_PAGE_TRACK_PREWRITE,
 	KVM_PAGE_TRACK_WRITE,
-	KVM_PAGE_TRACK_MAX,
+	KVM_PAGE_TRACK_PREEXEC,
+	KVM_PAGE_TRACK_MAX
 };
 
 /*
@@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
 	struct hlist_node node;
 
+	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			      int bytes,
+			      struct kvm_page_track_notifier_node *node,
+			      bool *data_ready);
+	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			       int bytes,
+			       struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when guest is writing the write-tracked page
 	 * and write emulation is finished at that time.
@@ -34,6 +44,11 @@ struct kvm_page_track_notifier_node {
 	 */
 	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			    int bytes, struct kvm_page_track_notifier_node *node);
+	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa,
+			      struct kvm_page_track_notifier_node *node);
+	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  unsigned long npages,
+				  struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when memory slot is being moved or removed
 	 * users can drop write-protection for the pages in that memory slot
@@ -51,7 +66,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 				 struct kvm_memory_slot *dont);
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
@@ -69,7 +84,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			    int bytes, bool *data_ready);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			     int bytes);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			  int bytes);
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 55fcb0292724..19dc17b00db2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1014,9 +1014,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_add_page(kvm, slot, gfn,
-						    KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
@@ -1031,9 +1035,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-						       KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
@@ -1416,6 +1424,29 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
+static bool spte_read_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
+
+	/* TODO: verify if the CPU supports EPT-execute-only */
+	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool spte_exec_protect(u64 *sptep, bool pt_protect)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
+
+	spte = spte & ~PT_USER_MASK;
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static bool __rmap_write_protect(struct kvm *kvm,
 				 struct kvm_rmap_head *rmap_head,
 				 bool pt_protect)
@@ -1430,6 +1461,34 @@ static bool __rmap_write_protect(struct kvm *kvm,
 	return flush;
 }
 
+static bool __rmap_read_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head,
+				bool pt_protect)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_read_protect(sptep);
+
+	return flush;
+}
+
+static bool __rmap_exec_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head,
+				bool pt_protect)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_exec_protect(sptep, pt_protect);
+
+	return flush;
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
 	u64 spte = *sptep;
@@ -1600,6 +1659,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool read_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		read_protected |= __rmap_read_protect(kvm, rmap_head, true);
+	}
+
+	return read_protected;
+}
+
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool exec_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		exec_protected |= __rmap_exec_protect(kvm, rmap_head, true);
+	}
+
+	return exec_protected;
+}
+
 static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
@@ -2688,7 +2777,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_mmu_page *sp;
 
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
 		return true;
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
@@ -2953,6 +3043,21 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	unsigned int acc = ACC_ALL;
+
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+		acc &= ~ACC_USER_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+		acc &= ~ACC_WRITE_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
+		acc &= ~ACC_EXEC_MASK;
+
+	return acc;
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
 {
@@ -2966,7 +3071,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
 		if (iterator.level == level) {
-			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
+			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn);
+
+			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
 					       write, level, gfn, pfn, prefault,
 					       map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
@@ -3713,15 +3820,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	if (!(error_code & PFERR_PRESENT_MASK))
 		return false;
 
 	/*
-	 * guest is writing the page which is write tracked which can
+	 * guest is reading/writing/fetching the page which is
+	 * read/write/execute tracked which can
 	 * not be fixed by page fault handler.
 	 */
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (((error_code & PFERR_USER_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+	    || ((error_code & PFERR_WRITE_MASK)
+		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
+		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
+	    || ((error_code & PFERR_FETCH_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
 		return true;
 
 	return false;
@@ -4942,7 +5055,11 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 * and resume the guest.
 	 */
 	if (vcpu->arch.mmu.direct_map &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
+	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREREAD) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREWRITE) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_WRITE) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREEXEC)) {
 		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
 		return 1;
 	}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b408c0ad612..57c947752490 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -193,5 +193,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 01c1371f39f8..8bf6581d25d5 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 		}
 }
 
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages)
 {
-	int  i;
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	int i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
 		slot->arch.gfn_track[i] = kvzalloc(npages *
@@ -46,6 +49,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 			goto track_free;
 	}
 
+	head = &kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return 0;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_create_slot)
+			n->track_create_slot(kvm, slot, npages, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+
 	return 0;
 
 track_free:
@@ -86,7 +100,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
 				  struct kvm_memory_slot *slot, gfn_t gfn,
@@ -104,9 +118,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
 	 */
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
-	if (mode == KVM_PAGE_TRACK_WRITE)
+	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
 		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
 			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
+		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
+		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
 
@@ -121,7 +142,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
@@ -214,6 +235,75 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
 
+/*
+ * Notify the node that a read access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ *
+ * The nodes will always be in conflict if they track the same page:
+ * - accepting a read won't guarantee that the next node will not override
+ *   the data (filling new/bytes and setting data_ready)
+ * - filling new/bytes with custom data won't guarantee that the next node
+ *   will not override that
+ */
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			    int bytes, bool *data_ready)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	*data_ready = false;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preread)
+			if (!n->track_preread(vcpu, gpa, new, bytes, n,
+					       data_ready))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
+/*
+ * Notify the node that an write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			     int bytes)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_prewrite)
+			if (!n->track_prewrite(vcpu, gpa, new, bytes, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that write access is intercepted and write emulation is
  * finished at this time.
@@ -240,6 +330,35 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	srcu_read_unlock(&head->track_srcu, idx);
 }
 
+/*
+ * Notify the node that an instruction is about to be executed.
+ * Returning false doesn't stop the other nodes from being called,
+ * but it will stop the emulation with ?!.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ */
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preexec)
+			if (!n->track_preexec(vcpu, gpa, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that memory slot is being removed or moved so that it can
  * drop write-protection for the pages in the memory slot.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e7db70ac1f82..74839859c0fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8421,7 +8421,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		}
 	}
 
-	if (kvm_page_track_create_memslot(slot, npages))
+	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

These callbacks return a boolean value. If false, the emulation should
stop and the instruction should be reexecuted in guest. The preread
callback can return the bytes needed by the read operation.

The kvm_page_track_create_memslot() was extended in order to track gfn-s
as soon as the memory slots are created.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_page_track.h |  24 +++++-
 arch/x86/kvm/mmu.c                    | 143 ++++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu.h                    |   4 +
 arch/x86/kvm/page_track.c             | 129 ++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c                    |   2 +-
 5 files changed, 281 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 172f9749dbb2..77adc7f43754 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,8 +3,11 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+	KVM_PAGE_TRACK_PREREAD,
+	KVM_PAGE_TRACK_PREWRITE,
 	KVM_PAGE_TRACK_WRITE,
-	KVM_PAGE_TRACK_MAX,
+	KVM_PAGE_TRACK_PREEXEC,
+	KVM_PAGE_TRACK_MAX
 };
 
 /*
@@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
 	struct hlist_node node;
 
+	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			      int bytes,
+			      struct kvm_page_track_notifier_node *node,
+			      bool *data_ready);
+	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			       int bytes,
+			       struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when guest is writing the write-tracked page
 	 * and write emulation is finished at that time.
@@ -34,6 +44,11 @@ struct kvm_page_track_notifier_node {
 	 */
 	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			    int bytes, struct kvm_page_track_notifier_node *node);
+	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa,
+			      struct kvm_page_track_notifier_node *node);
+	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  unsigned long npages,
+				  struct kvm_page_track_notifier_node *node);
 	/*
 	 * It is called when memory slot is being moved or removed
 	 * users can drop write-protection for the pages in that memory slot
@@ -51,7 +66,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 				 struct kvm_memory_slot *dont);
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
@@ -69,7 +84,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
 				   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			    int bytes, bool *data_ready);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			     int bytes);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			  int bytes);
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 55fcb0292724..19dc17b00db2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1014,9 +1014,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_add_page(kvm, slot, gfn,
-						    KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_add_page(kvm, slot, gfn,
+					     KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
@@ -1031,9 +1035,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-						       KVM_PAGE_TRACK_WRITE);
+	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_PREWRITE);
+		kvm_slot_page_track_remove_page(kvm, slot, gfn,
+						KVM_PAGE_TRACK_WRITE);
+		return;
+	}
 
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
@@ -1416,6 +1424,29 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
+static bool spte_read_protect(u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
+
+	/* TODO: verify if the CPU supports EPT-execute-only */
+	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool spte_exec_protect(u64 *sptep, bool pt_protect)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
+
+	spte = spte & ~PT_USER_MASK;
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static bool __rmap_write_protect(struct kvm *kvm,
 				 struct kvm_rmap_head *rmap_head,
 				 bool pt_protect)
@@ -1430,6 +1461,34 @@ static bool __rmap_write_protect(struct kvm *kvm,
 	return flush;
 }
 
+static bool __rmap_read_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head,
+				bool pt_protect)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_read_protect(sptep);
+
+	return flush;
+}
+
+static bool __rmap_exec_protect(struct kvm *kvm,
+				struct kvm_rmap_head *rmap_head,
+				bool pt_protect)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		flush |= spte_exec_protect(sptep, pt_protect);
+
+	return flush;
+}
+
 static bool spte_clear_dirty(u64 *sptep)
 {
 	u64 spte = *sptep;
@@ -1600,6 +1659,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool read_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		read_protected |= __rmap_read_protect(kvm, rmap_head, true);
+	}
+
+	return read_protected;
+}
+
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn)
+{
+	struct kvm_rmap_head *rmap_head;
+	int i;
+	bool exec_protected = false;
+
+	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+		rmap_head = __gfn_to_rmap(gfn, i, slot);
+		exec_protected |= __rmap_exec_protect(kvm, rmap_head, true);
+	}
+
+	return exec_protected;
+}
+
 static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
@@ -2688,7 +2777,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_mmu_page *sp;
 
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
 		return true;
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
@@ -2953,6 +3043,21 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	unsigned int acc = ACC_ALL;
+
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+		acc &= ~ACC_USER_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
+	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+		acc &= ~ACC_WRITE_MASK;
+	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
+		acc &= ~ACC_EXEC_MASK;
+
+	return acc;
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
 {
@@ -2966,7 +3071,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
 		if (iterator.level == level) {
-			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
+			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn);
+
+			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
 					       write, level, gfn, pfn, prefault,
 					       map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
@@ -3713,15 +3820,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	if (!(error_code & PFERR_PRESENT_MASK))
 		return false;
 
 	/*
-	 * guest is writing the page which is write tracked which can
+	 * guest is reading/writing/fetching the page which is
+	 * read/write/execute tracked which can
 	 * not be fixed by page fault handler.
 	 */
-	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+	if (((error_code & PFERR_USER_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
+	    || ((error_code & PFERR_WRITE_MASK)
+		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
+		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
+	    || ((error_code & PFERR_FETCH_MASK)
+		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
 		return true;
 
 	return false;
@@ -4942,7 +5055,11 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 * and resume the guest.
 	 */
 	if (vcpu->arch.mmu.direct_map &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
+	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREREAD) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREWRITE) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_WRITE) &&
+	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREEXEC)) {
 		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
 		return 1;
 	}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b408c0ad612..57c947752490 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -193,5 +193,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
+bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 01c1371f39f8..8bf6581d25d5 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 		}
 }
 
-int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  unsigned long npages)
 {
-	int  i;
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	int i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
 		slot->arch.gfn_track[i] = kvzalloc(npages *
@@ -46,6 +49,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 			goto track_free;
 	}
 
+	head = &kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return 0;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_create_slot)
+			n->track_create_slot(kvm, slot, npages, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+
 	return 0;
 
 track_free:
@@ -86,7 +100,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
 				  struct kvm_memory_slot *slot, gfn_t gfn,
@@ -104,9 +118,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
 	 */
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
-	if (mode == KVM_PAGE_TRACK_WRITE)
+	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
 		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
 			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
+		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
+		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
+			kvm_flush_remote_tlbs(kvm);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
 
@@ -121,7 +142,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
  * @kvm: the guest instance we are interested in.
  * @slot: the @gfn belongs to.
  * @gfn: the guest page.
- * @mode: tracking mode, currently only write track is supported.
+ * @mode: tracking mode.
  */
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
@@ -214,6 +235,75 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
 
+/*
+ * Notify the node that a read access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ *
+ * The nodes will always be in conflict if they track the same page:
+ * - accepting a read won't guarantee that the next node will not override
+ *   the data (filling new/bytes and setting data_ready)
+ * - filling new/bytes with custom data won't guarantee that the next node
+ *   will not override that
+ */
+bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
+			    int bytes, bool *data_ready)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	*data_ready = false;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preread)
+			if (!n->track_preread(vcpu, gpa, new, bytes, n,
+					       data_ready))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
+/*
+ * Notify the node that an write access is about to happen. Returning false
+ * doesn't stop the other nodes from being called, but it will stop
+ * the emulation.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ */
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+			     int bytes)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_prewrite)
+			if (!n->track_prewrite(vcpu, gpa, new, bytes, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that write access is intercepted and write emulation is
  * finished at this time.
@@ -240,6 +330,35 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	srcu_read_unlock(&head->track_srcu, idx);
 }
 
+/*
+ * Notify the node that an instruction is about to be executed.
+ * Returning false doesn't stop the other nodes from being called,
+ * but it will stop the emulation with ?!.
+ *
+ * The node should figure out if the written page is the one that node is
+ * interested in by itself.
+ */
+bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+	bool ret = true;
+
+	head = &vcpu->kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return ret;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
+		if (n->track_preexec)
+			if (!n->track_preexec(vcpu, gpa, n))
+				ret = false;
+	srcu_read_unlock(&head->track_srcu, idx);
+	return ret;
+}
+
 /*
  * Notify the node that memory slot is being removed or moved so that it can
  * drop write-protection for the pages in the memory slot.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e7db70ac1f82..74839859c0fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8421,7 +8421,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		}
 	}
 
-	if (kvm_page_track_create_memslot(slot, npages))
+	if (kvm_page_track_create_memslot(kvm, slot, npages))
 		goto out_free;
 
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu, Mircea Cîrjaliu,
	Marian Rotariu

From: Adalbert Lazar <alazar@bitdefender.com>

This subsystem is split into three source files:
 - kvmi_msg.c - ABI and socket related functions
 - kvmi_mem.c - handle map/unmap requests from the introspector
 - kvmi.c - all the other

The new data used by this subsystem is attached to the 'kvm' and
'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
structures).

Besides the KVMI system, this patch exports the
kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
(KVM_INTROSPECTION) used to pass the connection file handle from QEMU.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/x86.c              |    4 +-
 include/linux/kvm_host.h        |    4 +
 include/linux/kvmi.h            |   32 +
 include/linux/mm.h              |    3 +
 include/trace/events/kvmi.h     |  174 +++++
 include/uapi/linux/kvm.h        |    8 +
 mm/internal.h                   |    5 -
 virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h             |  121 ++++
 virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
 virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
 13 files changed, 3620 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/kvmi.h
 create mode 100644 include/trace/events/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi_int.h
 create mode 100644 virt/kvm/kvmi_mem.c
 create mode 100644 virt/kvm/kvmi_msg.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2cf03ed181e6..1e9e49eaee3b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -73,6 +73,7 @@
 #define KVM_REQ_HV_RESET		KVM_ARCH_REQ(20)
 #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
 #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
+#define KVM_REQ_INTROSPECTION           KVM_ARCH_REQ(23)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index dc4f2fdf5e57..ab6225563526 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,6 +9,7 @@ CFLAGS_vmx.o := -I.
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
+				$(KVM)/kvmi.o $(KVM)/kvmi_msg.o $(KVM)/kvmi_mem.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 74839859c0fd..cdfc7200a018 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3346,8 +3346,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	}
 }
 
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
-					 struct kvm_xsave *guest_xsave)
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave)
 {
 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 68e4d756f5c9..eae0598e18a5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -274,6 +274,7 @@ struct kvm_vcpu {
 	bool preempted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	void *kvmi;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -446,6 +447,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	void *kvmi;
 };
 
 #define kvm_err(fmt, ...) \
@@ -779,6 +781,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave);
 
 int kvm_arch_init(void *opaque);
 void kvm_arch_exit(void);
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
new file mode 100644
index 000000000000..7fac1d23f67c
--- /dev/null
+++ b/include/linux/kvmi.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_H__
+#define __KVMI_H__
+
+#define kvmi_is_present() 1
+
+int kvmi_init(void);
+void kvmi_uninit(void);
+void kvmi_destroy_vm(struct kvm *kvm);
+bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu);
+void kvmi_vcpu_init(struct kvm_vcpu *vcpu);
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value);
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva);
+bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu);
+void kvmi_hypercall_event(struct kvm_vcpu *vcpu);
+bool kvmi_lost_exception(struct kvm_vcpu *vcpu);
+void kvmi_trap_event(struct kvm_vcpu *vcpu);
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
+			   unsigned long exit_qualification,
+			   unsigned char descriptor, unsigned char write);
+void kvmi_flush_mem_access(struct kvm *kvm);
+void kvmi_handle_request(struct kvm_vcpu *vcpu);
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+			     gpa_t req_gpa, gpa_t map_gpa);
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa);
+
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff739cd..b659c7436789 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1115,6 +1115,9 @@ void page_address_init(void);
 #define page_address_init()  do { } while(0)
 #endif
 
+/* rmap.c */
+extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+
 extern void *page_rmapping(struct page *page);
 extern struct anon_vma *page_anon_vma(struct page *page);
 extern struct address_space *page_mapping(struct page *page);
diff --git a/include/trace/events/kvmi.h b/include/trace/events/kvmi.h
new file mode 100644
index 000000000000..dc36fd3b30dc
--- /dev/null
+++ b/include/trace/events/kvmi.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kvmi
+
+#if !defined(_TRACE_KVMI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KVMI_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __TRACE_KVMI_STRUCTURES
+#define __TRACE_KVMI_STRUCTURES
+
+#undef EN
+#define EN(x) { x, #x }
+
+static const struct trace_print_flags kvmi_msg_id_symbol[] = {
+	EN(KVMI_GET_VERSION),
+	EN(KVMI_PAUSE_VCPU),
+	EN(KVMI_GET_GUEST_INFO),
+	EN(KVMI_GET_REGISTERS),
+	EN(KVMI_SET_REGISTERS),
+	EN(KVMI_GET_PAGE_ACCESS),
+	EN(KVMI_SET_PAGE_ACCESS),
+	EN(KVMI_INJECT_EXCEPTION),
+	EN(KVMI_READ_PHYSICAL),
+	EN(KVMI_WRITE_PHYSICAL),
+	EN(KVMI_GET_MAP_TOKEN),
+	EN(KVMI_CONTROL_EVENTS),
+	EN(KVMI_CONTROL_CR),
+	EN(KVMI_CONTROL_MSR),
+	EN(KVMI_EVENT),
+	EN(KVMI_EVENT_REPLY),
+	EN(KVMI_GET_CPUID),
+	EN(KVMI_GET_XSAVE),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_event_id_symbol[] = {
+	EN(KVMI_EVENT_CR),
+	EN(KVMI_EVENT_MSR),
+	EN(KVMI_EVENT_XSETBV),
+	EN(KVMI_EVENT_BREAKPOINT),
+	EN(KVMI_EVENT_HYPERCALL),
+	EN(KVMI_EVENT_PAGE_FAULT),
+	EN(KVMI_EVENT_TRAP),
+	EN(KVMI_EVENT_DESCRIPTOR),
+	EN(KVMI_EVENT_CREATE_VCPU),
+	EN(KVMI_EVENT_PAUSE_VCPU),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_action_symbol[] = {
+	{KVMI_EVENT_ACTION_CONTINUE, "continue"},
+	{KVMI_EVENT_ACTION_RETRY, "retry"},
+	{KVMI_EVENT_ACTION_CRASH, "crash"},
+	{-1, NULL}
+};
+
+#endif /* __TRACE_KVMI_STRUCTURES */
+
+TRACE_EVENT(
+	kvmi_msg_dispatch,
+	TP_PROTO(__u16 id, __u16 size),
+	TP_ARGS(id, size),
+	TP_STRUCT__entry(
+		__field(__u16, id)
+		__field(__u16, size)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+		__entry->size = size;
+	),
+	TP_printk("%s size %u",
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->size)
+);
+
+TRACE_EVENT(
+	kvmi_send_event,
+	TP_PROTO(__u32 id),
+	TP_ARGS(id),
+	TP_STRUCT__entry(
+		__field(__u32, id)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+	),
+	TP_printk("%s",
+		trace_print_symbols_seq(p, __entry->id, kvmi_event_id_symbol))
+);
+
+#define KVMI_ACCESS_PRINTK() ({                                         \
+	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
+	static const char * const access_str[] = {			\
+		"---", "r--", "-w-", "rw-", "--x", "r-x", "-wx", "rwx"  \
+	};							        \
+	trace_seq_printf(p, "%s", access_str[__entry->access & 7]);	\
+	saved_ptr;							\
+})
+
+TRACE_EVENT(
+	kvmi_set_mem_access,
+	TP_PROTO(__u64 gfn, __u8 access, int err),
+	TP_ARGS(gfn, access, err),
+	TP_STRUCT__entry(
+		__field(__u64, gfn)
+		__field(__u8, access)
+		__field(int, err)
+	),
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->access = access;
+		__entry->err = err;
+	),
+	TP_printk("gfn %llx %s %s %d",
+		  __entry->gfn, KVMI_ACCESS_PRINTK(),
+		  __entry->err ? "failed" : "succeeded", __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_apply_mem_access,
+	TP_PROTO(__u64 gfn, __u8 access, int err),
+	TP_ARGS(gfn, access, err),
+	TP_STRUCT__entry(
+		__field(__u64, gfn)
+		__field(__u8, access)
+		__field(int, err)
+	),
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->access = access;
+		__entry->err = err;
+	),
+	TP_printk("gfn %llx %s flush %s %d",
+		  __entry->gfn, KVMI_ACCESS_PRINTK(),
+		  __entry->err ? "failed" : "succeeded", __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_event_page_fault,
+	TP_PROTO(__u64 gpa, __u64 gva, __u8 access, __u64 old_rip,
+		 __u32 action, __u64 new_rip, __u32 ctx_size),
+	TP_ARGS(gpa, gva, access, old_rip, action, new_rip, ctx_size),
+	TP_STRUCT__entry(
+		__field(__u64, gpa)
+		__field(__u64, gva)
+		__field(__u8, access)
+		__field(__u64, old_rip)
+		__field(__u32, action)
+		__field(__u64, new_rip)
+		__field(__u32, ctx_size)
+	),
+	TP_fast_assign(
+		__entry->gpa = gpa;
+		__entry->gva = gva;
+		__entry->access = access;
+		__entry->old_rip = old_rip;
+		__entry->action = action;
+		__entry->new_rip = new_rip;
+		__entry->ctx_size = ctx_size;
+	),
+	TP_printk("gpa %llx %s gva %llx rip %llx -> %s rip %llx ctx %u",
+		  __entry->gpa,
+		  KVMI_ACCESS_PRINTK(),
+		  __entry->gva,
+		  __entry->old_rip,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->new_rip, __entry->ctx_size)
+);
+
+#endif /* _TRACE_KVMI_H */
+
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 496e59a2738b..6b7c4469b808 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1359,6 +1359,14 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
 #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
 
+struct kvm_introspection {
+	int fd;
+	__u32 padding;
+	__u32 commands;
+	__u32 events;
+};
+#define KVM_INTROSPECTION      _IOW(KVMIO, 0xff, struct kvm_introspection)
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..9d363c802305 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -92,11 +92,6 @@ extern unsigned long highest_memmap_pfn;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
-/*
- * in mm/rmap.c:
- */
-extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
-
 /*
  * in mm/page_alloc.c
  */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
new file mode 100644
index 000000000000..c4cdaeddac45
--- /dev/null
+++ b/virt/kvm/kvmi.c
@@ -0,0 +1,1410 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ */
+#include <linux/mmu_context.h>
+#include <linux/random.h>
+#include <uapi/linux/kvmi.h>
+#include <uapi/asm/kvmi.h>
+#include "../../arch/x86/kvm/x86.h"
+#include "../../arch/x86/kvm/mmu.h"
+#include <asm/vmx.h>
+#include "cpuid.h"
+#include "kvmi_int.h"
+#include <asm/kvm_page_track.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/kvmi.h>
+
+struct kvmi_mem_access {
+	struct list_head link;
+	gfn_t gfn;
+	u8 access;
+	bool active[KVM_PAGE_TRACK_MAX];
+	struct kvm_memory_slot *slot;
+};
+
+static void wakeup_events(struct kvm *kvm);
+static bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
+			   unsigned long gva, u8 access);
+
+static struct workqueue_struct *wq;
+
+static const u8 full_access = KVMI_PAGE_ACCESS_R |
+			      KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X;
+
+static const struct {
+	unsigned int allow_bit;
+	enum kvm_page_track_mode track_mode;
+} track_modes[] = {
+	{ KVMI_PAGE_ACCESS_R, KVM_PAGE_TRACK_PREREAD },
+	{ KVMI_PAGE_ACCESS_W, KVM_PAGE_TRACK_PREWRITE },
+	{ KVMI_PAGE_ACCESS_X, KVM_PAGE_TRACK_PREEXEC },
+};
+
+void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req)
+{
+	set_bit(req, &ivcpu->requests);
+	/* Make sure the bit is set when the worker wakes up */
+	smp_wmb();
+	up(&ivcpu->sem_requests);
+}
+
+void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req)
+{
+	clear_bit(req, &ivcpu->requests);
+}
+
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	/*
+	 * This vcpu is already stopped, executing this command
+	 * as a result of the REQ_CMD bit being set
+	 * (see kvmi_handle_request).
+	 */
+	if (ivcpu->pause)
+		return -KVM_EBUSY;
+
+	ivcpu->pause = true;
+
+	return 0;
+}
+
+static void kvmi_apply_mem_access(struct kvm *kvm,
+				  struct kvm_memory_slot *slot,
+				  struct kvmi_mem_access *m)
+{
+	int idx, k;
+
+	if (!slot) {
+		slot = gfn_to_memslot(kvm, m->gfn);
+		if (!slot)
+			return;
+	}
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (k = 0; k < ARRAY_SIZE(track_modes); k++) {
+		unsigned int allow_bit = track_modes[k].allow_bit;
+		enum kvm_page_track_mode mode = track_modes[k].track_mode;
+
+		if (m->access & allow_bit) {
+			if (m->active[mode] && m->slot == slot) {
+				kvm_slot_page_track_remove_page(kvm, slot,
+								m->gfn, mode);
+				m->active[mode] = false;
+				m->slot = NULL;
+			}
+		} else if (!m->active[mode] || m->slot != slot) {
+			kvm_slot_page_track_add_page(kvm, slot, m->gfn, mode);
+			m->active[mode] = true;
+			m->slot = slot;
+		}
+	}
+
+	spin_unlock(&kvm->mmu_lock);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi_mem_access *__m;
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t gfn = gpa_to_gfn(gpa);
+
+	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
+		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
+
+	m = kzalloc(sizeof(struct kvmi_mem_access), GFP_KERNEL);
+	if (!m)
+		return -KVM_ENOMEM;
+
+	INIT_LIST_HEAD(&m->link);
+	m->gfn = gfn;
+	m->access = access;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	__m = radix_tree_lookup(&ikvm->access_tree, m->gfn);
+	if (__m) {
+		__m->access = m->access;
+		if (list_empty(&__m->link))
+			list_add_tail(&__m->link, &ikvm->access_list);
+	} else {
+		radix_tree_insert(&ikvm->access_tree, m->gfn, m);
+		list_add_tail(&m->link, &ikvm->access_list);
+		m = NULL;
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	kfree(m);
+
+	return 0;
+}
+
+static bool kvmi_test_mem_access(struct kvm *kvm, unsigned long gpa,
+				 u8 access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (!ikvm)
+		return false;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	m = radix_tree_lookup(&ikvm->access_tree, gpa_to_gfn(gpa));
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	/*
+	 * We want to be notified only for violations involving access
+	 * bits that we've specifically cleared
+	 */
+	if (m && ((~m->access) & access))
+		return true;
+
+	return false;
+}
+
+static struct kvmi_mem_access *
+kvmi_get_mem_access_unlocked(struct kvm *kvm, const gfn_t gfn)
+{
+	return radix_tree_lookup(&IKVM(kvm)->access_tree, gfn);
+}
+
+static bool is_introspected(struct kvmi *ikvm)
+{
+	return (ikvm && ikvm->sock);
+}
+
+void kvmi_flush_mem_access(struct kvm *kvm)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	while (!list_empty(&ikvm->access_list)) {
+		struct kvmi_mem_access *m =
+			list_first_entry(&ikvm->access_list,
+					 struct kvmi_mem_access, link);
+
+		list_del_init(&m->link);
+
+		kvmi_apply_mem_access(kvm, NULL, m);
+
+		if (m->access == full_access) {
+			radix_tree_delete(&ikvm->access_tree, m->gfn);
+			kfree(m);
+		}
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static void kvmi_free_mem_access(struct kvm *kvm)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+	struct kvmi *ikvm = IKVM(kvm);
+
+	mutex_lock(&ikvm->access_tree_lock);
+	radix_tree_for_each_slot(slot, &ikvm->access_tree, &iter, 0) {
+		struct kvmi_mem_access *m = *slot;
+
+		m->access = full_access;
+		kvmi_apply_mem_access(kvm, NULL, m);
+
+		radix_tree_delete(&ikvm->access_tree, m->gfn);
+		kfree(*slot);
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static unsigned long *msr_mask(struct kvmi *ikvm, unsigned int *msr)
+{
+	switch (*msr) {
+	case 0 ... 0x1fff:
+		return ikvm->msr_mask.low;
+	case 0xc0000000 ... 0xc0001fff:
+		*msr &= 0x1fff;
+		return ikvm->msr_mask.high;
+	}
+	return NULL;
+}
+
+static bool test_msr_mask(struct kvmi *ikvm, unsigned int msr)
+{
+	unsigned long *mask = msr_mask(ikvm, &msr);
+
+	if (!mask)
+		return false;
+	if (!test_bit(msr, mask))
+		return false;
+
+	return true;
+}
+
+static int msr_control(struct kvmi *ikvm, unsigned int msr, bool enable)
+{
+	unsigned long *mask = msr_mask(ikvm, &msr);
+
+	if (!mask)
+		return -KVM_EINVAL;
+	if (enable)
+		set_bit(msr, mask);
+	else
+		clear_bit(msr, mask);
+	return 0;
+}
+
+unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+				   const struct kvm_sregs *sregs)
+{
+	unsigned int mode = 0;
+
+	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
+		if (sregs->cs.l)
+			mode = 8;
+		else if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (sregs->cr0 & X86_CR0_PE) {
+		if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (!sregs->cs.db)
+		mode = 2;
+	else
+		mode = 4;
+
+	return mode;
+}
+
+static int maybe_delayed_init(void)
+{
+	if (wq)
+		return 0;
+
+	wq = alloc_workqueue("kvmi", WQ_CPU_INTENSIVE, 0);
+	if (!wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+int kvmi_init(void)
+{
+	return 0;
+}
+
+static void work_cb(struct work_struct *work)
+{
+	struct kvmi *ikvm = container_of(work, struct kvmi, work);
+	struct kvm   *kvm = ikvm->kvm;
+
+	while (kvmi_msg_process(ikvm))
+		;
+
+	/* We are no longer interested in any kind of events */
+	atomic_set(&ikvm->event_mask, 0);
+
+	/* Clean-up for the next kvmi_hook() call */
+	ikvm->cr_mask = 0;
+	memset(&ikvm->msr_mask, 0, sizeof(ikvm->msr_mask));
+
+	wakeup_events(kvm);
+
+	/* Restore the spte access rights */
+	/* Shouldn't wait for reconnection? */
+	kvmi_free_mem_access(kvm);
+
+	complete_all(&ikvm->finished);
+}
+
+static void __alloc_vcpu_kvmi(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = kzalloc(sizeof(struct kvmi_vcpu), GFP_KERNEL);
+
+	if (ivcpu) {
+		sema_init(&ivcpu->sem_requests, 0);
+
+		/*
+		 * Make sure the ivcpu is initialized
+		 * before making it visible.
+		 */
+		smp_wmb();
+
+		vcpu->kvmi = ivcpu;
+
+		kvmi_make_request(ivcpu, REQ_INIT);
+		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+	}
+}
+
+void kvmi_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+	if (is_introspected(ikvm)) {
+		mutex_lock(&vcpu->kvm->lock);
+		__alloc_vcpu_kvmi(vcpu);
+		mutex_unlock(&vcpu->kvm->lock);
+	}
+}
+
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
+{
+	kfree(IVCPU(vcpu));
+}
+
+static bool __alloc_kvmi(struct kvm *kvm)
+{
+	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);
+
+	if (ikvm) {
+		INIT_LIST_HEAD(&ikvm->access_list);
+		mutex_init(&ikvm->access_tree_lock);
+		INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
+		rwlock_init(&ikvm->sock_lock);
+		init_completion(&ikvm->finished);
+		INIT_WORK(&ikvm->work, work_cb);
+
+		kvm->kvmi = ikvm;
+		ikvm->kvm = kvm; /* work_cb */
+	}
+
+	return (ikvm != NULL);
+}
+
+static bool alloc_kvmi(struct kvm *kvm)
+{
+	bool done;
+
+	mutex_lock(&kvm->lock);
+	done = (
+		maybe_delayed_init() == 0    &&
+		IKVM(kvm)            == NULL &&
+		__alloc_kvmi(kvm)    == true
+	);
+	mutex_unlock(&kvm->lock);
+
+	return done;
+}
+
+static void alloc_all_kvmi_vcpu(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		if (!IKVM(vcpu))
+			__alloc_vcpu_kvmi(vcpu);
+	mutex_unlock(&kvm->lock);
+}
+
+static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (is_introspected(ikvm)) {
+		kvm_err("Guest already introspected\n");
+		return false;
+	}
+
+	if (!kvmi_msg_init(ikvm, qemu->fd))
+		return false;
+
+	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
+	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
+
+	alloc_all_kvmi_vcpu(kvm);
+	queue_work(wq, &ikvm->work);
+
+	return true;
+}
+
+/*
+ * When called from outside a page fault handler, this call should
+ * return ~0ull
+ */
+static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	u64 gla;
+	u64 gla_val;
+	u64 v;
+
+	if (!vcpu->arch.gpa_available)
+		return ~0ull;
+
+	gla = kvm_mmu_fault_gla(vcpu);
+	if (gla == ~0ull)
+		return gla;
+	gla_val = gla;
+
+	/* Handle the potential overflow by returning ~0ull */
+	if (vcpu->arch.gpa_val > gpa) {
+		v = vcpu->arch.gpa_val - gpa;
+		if (v > gla)
+			gla = ~0ull;
+		else
+			gla -= v;
+	} else {
+		v = gpa - vcpu->arch.gpa_val;
+		if (v > (U64_MAX - gla))
+			gla = ~0ull;
+		else
+			gla += v;
+	}
+
+	return gla;
+}
+
+static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u8 *new,
+			       int bytes,
+			       struct kvm_page_track_notifier_node *node,
+			       bool *data_ready)
+{
+	u64 gla;
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	bool ret = true;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return ret;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
+	if (ivcpu && ivcpu->ctx_size > 0) {
+		int s = min_t(int, bytes, ivcpu->ctx_size);
+
+		memcpy(new, ivcpu->ctx_data, s);
+		ivcpu->ctx_size = 0;
+
+		if (*data_ready)
+			kvm_err("Override custom data");
+
+		*data_ready = true;
+	}
+
+	return ret;
+}
+
+static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa,
+				const u8 *new,
+				int bytes,
+				struct kvm_page_track_notifier_node *node)
+{
+	u64 gla;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return true;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_W);
+}
+
+static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa,
+				struct kvm_page_track_notifier_node *node)
+{
+	u64 gla;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return true;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+
+	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_X);
+}
+
+static void kvmi_track_create_slot(struct kvm *kvm,
+				   struct kvm_memory_slot *slot,
+				   unsigned long npages,
+				   struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + npages;
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = kvmi_get_mem_access_unlocked(kvm, start);
+		if (m)
+			kvmi_apply_mem_access(kvm, slot, m);
+		start++;
+	}
+
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + slot->npages;
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = kvmi_get_mem_access_unlocked(kvm, start);
+		if (m) {
+			u8 prev_access = m->access;
+
+			m->access = full_access;
+			kvmi_apply_mem_access(kvm, slot, m);
+			m->access = prev_access;
+		}
+		start++;
+	}
+
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static struct kvm_page_track_notifier_node kptn_node = {
+	.track_preread = kvmi_track_preread,
+	.track_prewrite = kvmi_track_prewrite,
+	.track_preexec = kvmi_track_preexec,
+	.track_create_slot = kvmi_track_create_slot,
+	.track_flush_slot = kvmi_track_flush_slot
+};
+
+bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
+{
+	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
+
+	kvm_page_track_register_notifier(kvm, &kptn_node);
+
+	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
+}
+
+void kvmi_destroy_vm(struct kvm *kvm)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (ikvm) {
+		kvmi_msg_uninit(ikvm);
+
+		mutex_destroy(&ikvm->access_tree_lock);
+		kfree(ikvm);
+	}
+
+	kvmi_mem_destroy_vm(kvm);
+}
+
+void kvmi_uninit(void)
+{
+	if (wq) {
+		destroy_workqueue(wq);
+		wq = NULL;
+	}
+}
+
+void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
+{
+	struct msr_data msr;
+
+	msr.host_initiated = true;
+
+	msr.index = MSR_IA32_SYSENTER_CS;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_cs = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_ESP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_esp = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_EIP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_eip = msr.data;
+
+	msr.index = MSR_EFER;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.efer = msr.data;
+
+	msr.index = MSR_STAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.star = msr.data;
+
+	msr.index = MSR_LSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.lstar = msr.data;
+
+	msr.index = MSR_CSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.cstar = msr.data;
+
+	msr.index = MSR_IA32_CR_PAT;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.pat = msr.data;
+}
+
+static bool is_event_enabled(struct kvm *kvm, int event_bit)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	return (ikvm && (atomic_read(&ikvm->event_mask) & event_bit));
+}
+
+static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
+{
+	int err = -ESRCH;
+	struct pid *pid;
+	struct siginfo siginfo[1] = { };
+
+	rcu_read_lock();
+	pid = rcu_dereference(vcpu->pid);
+	if (pid)
+		err = kill_pid_info(sig, siginfo, pid);
+	rcu_read_unlock();
+
+	return err;
+}
+
+static void kvmi_vm_shutdown(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvmi_vcpu_kill(SIGTERM, vcpu);
+	}
+	mutex_unlock(&kvm->lock);
+}
+
+/* TODO: Do we need a return code ? */
+static void handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action)
+{
+	switch (action) {
+	case KVMI_EVENT_ACTION_CRASH:
+		kvmi_vm_shutdown(vcpu->kvm);
+		break;
+
+	default:
+		kvm_err("Unsupported event action: %d\n", action);
+	}
+}
+
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 ret_value;
+	u32 action;
+
+	if (!is_event_enabled(kvm, KVMI_EVENT_CR))
+		return true;
+	if (!test_bit(cr, &IKVM(kvm)->cr_mask))
+		return true;
+	if (old_value == *new_value)
+		return true;
+
+	action = kvmi_msg_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		*new_value = ret_value;
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false;
+}
+
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 ret_value;
+	u32 action;
+	struct msr_data old_msr = { .host_initiated = true,
+				    .index = msr->index };
+
+	if (msr->host_initiated)
+		return true;
+	if (!is_event_enabled(kvm, KVMI_EVENT_MSR))
+		return true;
+	if (!test_msr_mask(IKVM(kvm), msr->index))
+		return true;
+	if (kvm_get_msr(vcpu, &old_msr))
+		return true;
+	if (old_msr.data == msr->data)
+		return true;
+
+	action = kvmi_msg_send_msr(vcpu, msr->index, old_msr.data, msr->data,
+				   &ret_value);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		msr->data = ret_value;
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false;
+}
+
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_XSETBV))
+		return;
+
+	action = kvmi_msg_send_xsetbv(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
+{
+	u32 action;
+	u64 gpa;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
+		/* qemu will automatically reinject the breakpoint */
+		return false;
+
+	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+
+	if (gpa == UNMAPPED_GVA)
+		kvm_err("%s: invalid gva: %llx", __func__, gva);
+
+	action = kvmi_msg_send_bp(vcpu, gpa);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		/* rip was most likely adjusted past the INT 3 instruction */
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	/* qemu will automatically reinject the breakpoint */
+	return false;
+}
+EXPORT_SYMBOL(kvmi_breakpoint_event);
+
+#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
+bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long subfunc1, subfunc2;
+	bool longmode = is_64_bit_mode(vcpu);
+	unsigned long nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
+
+	if (longmode) {
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RDI);
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RSI);
+	} else {
+		nr &= 0xFFFFFFFF;
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RBX);
+		subfunc1 &= 0xFFFFFFFF;
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RCX);
+		subfunc2 &= 0xFFFFFFFF;
+	}
+
+	return (nr == KVM_HC_XEN_HVM_OP
+		&& subfunc1 == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+		&& subfunc2 == 0);
+}
+
+void kvmi_hypercall_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_HYPERCALL)
+			|| !kvmi_is_agent_hypercall(vcpu))
+		return;
+
+	action = kvmi_msg_send_hypercall(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
+			   unsigned long gva, u8 access)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi_vcpu *ivcpu;
+	bool trap_access, ret = true;
+	u32 ctx_size;
+	u64 old_rip;
+	u32 action;
+
+	if (!is_event_enabled(kvm, KVMI_EVENT_PAGE_FAULT))
+		return true;
+
+	/* Have we shown interest in this page? */
+	if (!kvmi_test_mem_access(kvm, gpa, access))
+		return true;
+
+	ivcpu    = IVCPU(vcpu);
+	ctx_size = sizeof(ivcpu->ctx_data);
+	old_rip  = kvm_rip_read(vcpu);
+
+	if (!kvmi_msg_send_pf(vcpu, gpa, gva, access, &action,
+			      &trap_access,
+			      ivcpu->ctx_data, &ctx_size))
+		goto out;
+
+	ivcpu->ctx_size = 0;
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ivcpu->ctx_size = ctx_size;
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		ret = false;
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	/* TODO: trap_access -> don't REPeat the instruction */
+out:
+	trace_kvmi_event_page_fault(gpa, gva, access, old_rip, action,
+				    kvm_rip_read(vcpu), ctx_size);
+	return ret;
+}
+
+bool kvmi_lost_exception(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (!ivcpu || !ivcpu->exception.injected)
+		return false;
+
+	ivcpu->exception.injected = 0;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_TRAP))
+		return false;
+
+	if ((vcpu->arch.exception.injected || vcpu->arch.exception.pending)
+		&& vcpu->arch.exception.nr == ivcpu->exception.nr
+		&& vcpu->arch.exception.error_code
+			== ivcpu->exception.error_code)
+		return false;
+
+	return true;
+}
+
+void kvmi_trap_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	u32 vector, type, err;
+	u32 action;
+
+	if (vcpu->arch.exception.pending) {
+		vector = vcpu->arch.exception.nr;
+		err = vcpu->arch.exception.error_code;
+
+		if (kvm_exception_is_soft(vector))
+			type = INTR_TYPE_SOFT_EXCEPTION;
+		else
+			type = INTR_TYPE_HARD_EXCEPTION;
+	} else if (vcpu->arch.interrupt.pending) {
+		vector = vcpu->arch.interrupt.nr;
+		err = 0;
+
+		if (vcpu->arch.interrupt.soft)
+			type = INTR_TYPE_SOFT_INTR;
+		else
+			type = INTR_TYPE_EXT_INTR;
+	} else {
+		vector = 0;
+		type = 0;
+		err = 0;
+	}
+
+	kvm_err("New exception nr %d/%d err %x/%x addr %lx",
+		vector, ivcpu->exception.nr,
+		err, ivcpu->exception.error_code,
+		vcpu->arch.cr2);
+
+	action = kvmi_msg_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
+			   unsigned long exit_qualification,
+			   unsigned char descriptor, unsigned char write)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_DESCRIPTOR))
+		return true;
+
+	action = kvmi_msg_send_descriptor(vcpu, info, exit_qualification,
+					  descriptor, write);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false; /* TODO: double check this */
+}
+EXPORT_SYMBOL(kvmi_descriptor_event);
+
+static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_CREATE_VCPU))
+		return false;
+
+	action = kvmi_msg_send_create_vcpu(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return true;
+}
+
+static bool kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	IVCPU(vcpu)->pause = false;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_PAUSE_VCPU))
+		return false;
+
+	action = kvmi_msg_send_pause_vcpu(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return true;
+}
+
+/* TODO: refactor this function uto avoid recursive calls and the semaphore. */
+void kvmi_handle_request(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	while (ivcpu->ev_rpl_waiting
+		|| READ_ONCE(ivcpu->requests)) {
+
+		down(&ivcpu->sem_requests);
+
+		if (test_bit(REQ_INIT, &ivcpu->requests)) {
+			/*
+			 * kvmi_create_vcpu_event() may call this function
+			 * again and won't return unless there is no more work
+			 * to be done. The while condition will be evaluated
+			 * to false, but we explicitly exit the loop to avoid
+			 * surprizing the reader more than we already did.
+			 */
+			kvmi_clear_request(ivcpu, REQ_INIT);
+			if (kvmi_create_vcpu_event(vcpu))
+				break;
+		} else if (test_bit(REQ_CMD, &ivcpu->requests)) {
+			kvmi_msg_handle_vcpu_cmd(vcpu);
+			/* it will clear the REQ_CMD bit */
+			if (ivcpu->pause && !ivcpu->ev_rpl_waiting) {
+				/* Same warnings as with REQ_INIT. */
+				if (kvmi_pause_vcpu_event(vcpu))
+					break;
+			}
+		} else if (test_bit(REQ_REPLY, &ivcpu->requests)) {
+			kvmi_clear_request(ivcpu, REQ_REPLY);
+			ivcpu->ev_rpl_waiting = false;
+			if (ivcpu->have_delayed_regs) {
+				kvm_arch_vcpu_set_regs(vcpu,
+							&ivcpu->delayed_regs);
+				ivcpu->have_delayed_regs = false;
+			}
+			if (ivcpu->pause) {
+				/* Same warnings as with REQ_INIT. */
+				if (kvmi_pause_vcpu_event(vcpu))
+					break;
+			}
+		} else if (test_bit(REQ_CLOSE, &ivcpu->requests)) {
+			kvmi_clear_request(ivcpu, REQ_CLOSE);
+			break;
+		} else {
+			kvm_err("Unexpected request");
+		}
+	}
+
+	kvmi_flush_mem_access(vcpu->kvm);
+	/* TODO: merge with kvmi_set_mem_access() */
+}
+
+int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
+		       u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
+{
+	struct kvm_cpuid_entry2 *e;
+
+	e = kvm_find_cpuid_entry(vcpu, function, index);
+	if (!e)
+		return -KVM_ENOENT;
+
+	*eax = e->eax;
+	*ebx = e->ebx;
+	*ecx = e->ecx;
+	*edx = e->edx;
+
+	return 0;
+}
+
+int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc)
+{
+	/*
+	 * Should we switch vcpu_cnt to unsigned int?
+	 * If not, we should limit this to max u16 - 1
+	 */
+	*vcpu_cnt = atomic_read(&vcpu->kvm->online_vcpus);
+	if (kvm_has_tsc_control)
+		*tsc = 1000ul * vcpu->arch.virtual_tsc_khz;
+	else
+		*tsc = 0;
+
+	return 0;
+}
+
+static int get_first_vcpu(struct kvm *kvm, struct kvm_vcpu **vcpu)
+{
+	struct kvm_vcpu *v;
+
+	if (!atomic_read(&kvm->online_vcpus))
+		return -KVM_EINVAL;
+
+	v = kvm_get_vcpu(kvm, 0);
+
+	if (!v)
+		return -KVM_EINVAL;
+
+	*vcpu = v;
+
+	return 0;
+}
+
+int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
+			   struct kvm_regs *regs,
+			   struct kvm_sregs *sregs, struct kvm_msrs *msrs)
+{
+	struct kvm_msr_entry  *msr = msrs->entries;
+	unsigned int	       n   = msrs->nmsrs;
+
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, sregs);
+	*mode = kvmi_vcpu_mode(vcpu, sregs);
+
+	for (; n--; msr++) {
+		struct msr_data m   = { .index = msr->index };
+		int		err = kvm_get_msr(vcpu, &m);
+
+		if (err)
+			return -KVM_EINVAL;
+
+		msr->data = m.data;
+	}
+
+	return 0;
+}
+
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (ivcpu->ev_rpl_waiting) {
+		memcpy(&ivcpu->delayed_regs, regs, sizeof(ivcpu->delayed_regs));
+		ivcpu->have_delayed_regs = true;
+	} else
+		kvm_err("Drop KVMI_SET_REGISTERS");
+	return 0;
+}
+
+int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	struct kvmi_mem_access *m;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	m = kvmi_get_mem_access_unlocked(vcpu->kvm, gpa_to_gfn(gpa));
+	*access = m ? m->access : full_access;
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	return 0;
+}
+
+static bool is_vector_valid(u8 vector)
+{
+	return true;
+}
+
+static bool is_gva_valid(struct kvm_vcpu *vcpu, u64 gva)
+{
+	return true;
+}
+
+int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+			      bool error_code_valid, u16 error_code,
+			      u64 address)
+{
+	struct x86_exception e = {
+		.vector = vector,
+		.error_code_valid = error_code_valid,
+		.error_code = error_code,
+		.address = address,
+	};
+
+	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
+		return -KVM_EINVAL;
+
+	if (e.vector == PF_VECTOR)
+		kvm_inject_page_fault(vcpu, &e);
+	else if (e.error_code_valid)
+		kvm_queue_exception_e(vcpu, e.vector, e.error_code);
+	else
+		kvm_queue_exception(vcpu, e.vector);
+
+	if (IVCPU(vcpu)->exception.injected)
+		kvm_err("Override exception");
+
+	IVCPU(vcpu)->exception.injected = 1;
+	IVCPU(vcpu)->exception.nr = e.vector;
+	IVCPU(vcpu)->exception.error_code = error_code_valid ? error_code : 0;
+
+	return 0;
+}
+
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long hva;
+
+	mutex_lock(&kvm->slots_lock);
+	hva = gfn_to_hva(kvm, gfn);
+	mutex_unlock(&kvm->slots_lock);
+
+	return hva;
+}
+
+static long get_user_pages_remote_unlocked(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   unsigned int gup_flags,
+					   struct page **pages)
+{
+	long ret;
+	struct task_struct *tsk = NULL;
+	struct vm_area_struct **vmas = NULL;
+	int locked = 1;
+
+	down_read(&mm->mmap_sem);
+	ret =
+	    get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				  vmas, &locked);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size, int (*send)(
+				   struct kvmi *, const struct kvmi_msg_hdr *,
+				   int err, const void *buf, size_t),
+				   const struct kvmi_msg_hdr *ctx)
+{
+	int err, ec;
+	unsigned long hva;
+	struct page *page = NULL;
+	void *ptr_page = NULL, *ptr = NULL;
+	size_t ptr_size = 0;
+	struct kvm_vcpu *vcpu;
+
+	ec = get_first_vcpu(kvm, &vcpu);
+
+	if (ec)
+		goto out;
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
+
+	if (kvm_is_error_hva(hva)) {
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, 0, &page) != 1) {
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	ptr_page = kmap_atomic(page);
+
+	ptr = ptr_page + (gpa & ~PAGE_MASK);
+	ptr_size = size;
+
+out:
+	err = send(IKVM(kvm), ctx, ec, ptr, ptr_size);
+
+	if (ptr_page)
+		kunmap_atomic(ptr_page);
+	if (page)
+		put_page(page);
+	return err;
+}
+
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
+{
+	int err;
+	unsigned long hva;
+	struct page *page;
+	void *ptr;
+	struct kvm_vcpu *vcpu;
+
+	err = get_first_vcpu(kvm, &vcpu);
+
+	if (err)
+		return err;
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
+
+	if (kvm_is_error_hva(hva))
+		return -KVM_EINVAL;
+
+	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, FOLL_WRITE,
+			&page) != 1)
+		return -KVM_EINVAL;
+
+	ptr = kmap_atomic(page);
+
+	memcpy(ptr + (gpa & ~PAGE_MASK), buf, size);
+
+	kunmap_atomic(ptr);
+	put_page(page);
+
+	return 0;
+}
+
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	int err = 0;
+
+	/* create random token */
+	get_random_bytes(token, sizeof(struct kvmi_map_mem_token));
+
+	/* store token in HOST database */
+	if (kvmi_store_token(kvm, token))
+		err = -KVM_ENOMEM;
+
+	return err;
+}
+
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events)
+{
+	int err = 0;
+
+	if (events & ~KVMI_KNOWN_EVENTS)
+		return -KVM_EINVAL;
+
+	if (events & KVMI_EVENT_BREAKPOINT) {
+		if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT)) {
+			struct kvm_guest_debug dbg = { };
+
+			dbg.control =
+			    KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
+
+			err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
+		}
+	}
+
+	if (!err)
+		atomic_set(&IKVM(vcpu->kvm)->event_mask, events);
+
+	return err;
+}
+
+int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr)
+{
+	switch (cr) {
+	case 0:
+	case 3:
+	case 4:
+		if (enable)
+			set_bit(cr, &ikvm->cr_mask);
+		else
+			clear_bit(cr, &ikvm->cr_mask);
+		return 0;
+
+	default:
+		return -KVM_EINVAL;
+	}
+}
+
+int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr)
+{
+	struct kvm_vcpu *vcpu;
+	int err;
+
+	err = get_first_vcpu(kvm, &vcpu);
+	if (err)
+		return err;
+
+	err = msr_control(IKVM(kvm), msr, enable);
+
+	if (!err)
+		kvm_arch_msr_intercept(vcpu, msr, enable);
+
+	return err;
+}
+
+void wakeup_events(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmi_make_request(IVCPU(vcpu), REQ_CLOSE);
+	mutex_unlock(&kvm->lock);
+}
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
new file mode 100644
index 000000000000..5976b98f11cb
--- /dev/null
+++ b/virt/kvm/kvmi_int.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_INT_H__
+#define __KVMI_INT_H__
+
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+
+#include <uapi/linux/kvmi.h>
+
+#define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
+
+struct kvmi_vcpu {
+	u8 ctx_data[256];
+	u32 ctx_size;
+	struct semaphore sem_requests;
+	unsigned long requests;
+	/* TODO: get this ~64KB buffer from a cache */
+	u8 msg_buf[KVMI_MAX_MSG_SIZE];
+	struct kvmi_event_reply ev_rpl;
+	void *ev_rpl_ptr;
+	size_t ev_rpl_size;
+	size_t ev_rpl_received;
+	u32 ev_seq;
+	bool ev_rpl_waiting;
+	struct {
+		u16 error_code;
+		u8 nr;
+		bool injected;
+	} exception;
+	struct kvm_regs delayed_regs;
+	bool have_delayed_regs;
+	bool pause;
+};
+
+#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
+
+struct kvmi {
+	atomic_t event_mask;
+	unsigned long cr_mask;
+	struct {
+		unsigned long low[BITS_TO_LONGS(8192)];
+		unsigned long high[BITS_TO_LONGS(8192)];
+	} msr_mask;
+	struct radix_tree_root access_tree;
+	struct mutex access_tree_lock;
+	struct list_head access_list;
+	struct work_struct work;
+	struct socket *sock;
+	rwlock_t sock_lock;
+	struct completion finished;
+	struct kvm *kvm;
+	/* TODO: get this ~64KB buffer from a cache */
+	u8 msg_buf[KVMI_MAX_MSG_SIZE];
+	u32 cmd_allow_mask;
+	u32 event_allow_mask;
+};
+
+#define REQ_INIT   0
+#define REQ_CMD    1
+#define REQ_REPLY  2
+#define REQ_CLOSE  3
+
+/* kvmi_msg.c */
+bool kvmi_msg_init(struct kvmi *ikvm, int fd);
+bool kvmi_msg_process(struct kvmi *ikvm);
+void kvmi_msg_uninit(struct kvmi *ikvm);
+void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
+		     u64 new_value, u64 *ret_value);
+u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
+		      u64 new_value, u64 *ret_value);
+u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa);
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
+bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
+		      u32 *action, bool *trap_access, u8 *ctx,
+		      u32 *ctx_size);
+u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
+		       u32 error_code, u64 cr2);
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
+			     u64 exit_qualification, u8 descriptor, u8 write);
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
+
+/* kvmi.c */
+int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc);
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu);
+int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
+			   struct kvm_regs *regs, struct kvm_sregs *sregs,
+			   struct kvm_msrs *msrs);
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs);
+int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access);
+int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+			      bool error_code_valid, u16 error_code,
+			      u64 address);
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events);
+int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
+		       u32 *eax, u32 *ebx, u32 *rcx, u32 *edx);
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size,
+			   int (*send)(struct kvmi *,
+					const struct kvmi_msg_hdr*,
+					int err, const void *buf, size_t),
+			   const struct kvmi_msg_hdr *ctx);
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size,
+			    const void *buf);
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
+int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr);
+int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr);
+int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access);
+void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req);
+void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req);
+unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+			    const struct kvm_sregs *sregs);
+void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event);
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn);
+void kvmi_mem_destroy_vm(struct kvm *kvm);
+
+/* kvmi_mem.c */
+int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
+
+#endif
diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
new file mode 100644
index 000000000000..c766357678e6
--- /dev/null
+++ b/virt/kvm/kvmi_mem.c
@@ -0,0 +1,730 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection memory mapping implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/rmap.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/printk.h>
+#include <linux/kvmi.h>
+#include <linux/huge_mm.h>
+
+#include <uapi/linux/kvmi.h>
+
+#include "kvmi_int.h"
+
+
+static struct list_head mapping_list;
+static spinlock_t mapping_lock;
+
+struct host_map {
+	struct list_head mapping_list;
+	gpa_t map_gpa;
+	struct kvm *machine;
+	gpa_t req_gpa;
+};
+
+
+static struct list_head token_list;
+static spinlock_t token_lock;
+
+struct token_entry {
+	struct list_head token_list;
+	struct kvmi_map_mem_token token;
+	struct kvm *kvm;
+};
+
+
+int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	struct token_entry *tep;
+
+	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
+			     32, 1, token, sizeof(struct kvmi_map_mem_token),
+			     false);
+
+	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);
+	if (tep == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tep->token_list);
+	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));
+	tep->kvm = kvm;
+
+	spin_lock(&token_lock);
+	list_add_tail(&tep->token_list, &token_list);
+	spin_unlock(&token_lock);
+
+	return 0;
+}
+
+static struct kvm *find_machine_at(struct kvm_vcpu *vcpu, gva_t tkn_gva)
+{
+	long result;
+	gpa_t tkn_gpa;
+	struct kvmi_map_mem_token token;
+	struct list_head *cur;
+	struct token_entry *tep, *found = NULL;
+	struct kvm *target_kvm = NULL;
+
+	/* machine token is passed as pointer */
+	tkn_gpa = kvm_mmu_gva_to_gpa_system(vcpu, tkn_gva, NULL);
+	if (tkn_gpa == UNMAPPED_GVA)
+		return NULL;
+
+	/* copy token to local address space */
+	result = kvm_read_guest(vcpu->kvm, tkn_gpa, &token, sizeof(token));
+	if (IS_ERR_VALUE(result)) {
+		kvm_err("kvmi: failed copying token from user\n");
+		return ERR_PTR(result);
+	}
+
+	/* consume token & find the VM */
+	spin_lock(&token_lock);
+	list_for_each(cur, &token_list) {
+		tep = list_entry(cur, struct token_entry, token_list);
+
+		if (!memcmp(&token, &tep->token, sizeof(token))) {
+			list_del(&tep->token_list);
+			found = tep;
+			break;
+		}
+	}
+	spin_unlock(&token_lock);
+
+	if (found != NULL) {
+		target_kvm = found->kvm;
+		kfree(found);
+	}
+
+	return target_kvm;
+}
+
+static void remove_vm_token(struct kvm *kvm)
+{
+	struct list_head *cur, *next;
+	struct token_entry *tep;
+
+	spin_lock(&token_lock);
+	list_for_each_safe(cur, next, &token_list) {
+		tep = list_entry(cur, struct token_entry, token_list);
+
+		if (tep->kvm == kvm) {
+			list_del(&tep->token_list);
+			kfree(tep);
+		}
+	}
+	spin_unlock(&token_lock);
+
+}
+
+
+static int add_to_list(gpa_t map_gpa, struct kvm *machine, gpa_t req_gpa)
+{
+	struct host_map *map;
+
+	map = kmalloc(sizeof(struct host_map), GFP_KERNEL);
+	if (map == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&map->mapping_list);
+	map->map_gpa = map_gpa;
+	map->machine = machine;
+	map->req_gpa = req_gpa;
+
+	spin_lock(&mapping_lock);
+	list_add_tail(&map->mapping_list, &mapping_list);
+	spin_unlock(&mapping_lock);
+
+	return 0;
+}
+
+static struct host_map *extract_from_list(gpa_t map_gpa)
+{
+	struct list_head *cur;
+	struct host_map *map;
+
+	spin_lock(&mapping_lock);
+	list_for_each(cur, &mapping_list) {
+		map = list_entry(cur, struct host_map, mapping_list);
+
+		/* found - extract and return */
+		if (map->map_gpa == map_gpa) {
+			list_del(&map->mapping_list);
+			spin_unlock(&mapping_lock);
+
+			return map;
+		}
+	}
+	spin_unlock(&mapping_lock);
+
+	return NULL;
+}
+
+static void remove_vm_from_list(struct kvm *kvm)
+{
+	struct list_head *cur, *next;
+	struct host_map *map;
+
+	spin_lock(&mapping_lock);
+
+	list_for_each_safe(cur, next, &mapping_list) {
+		map = list_entry(cur, struct host_map, mapping_list);
+
+		if (map->machine == kvm) {
+			list_del(&map->mapping_list);
+			kfree(map);
+		}
+	}
+
+	spin_unlock(&mapping_lock);
+}
+
+static void remove_entry(struct host_map *map)
+{
+	kfree(map);
+}
+
+
+static struct vm_area_struct *isolate_page_vma(struct vm_area_struct *vma,
+					       unsigned long addr)
+{
+	int result;
+
+	/* corner case */
+	if (vma_pages(vma) == 1)
+		return vma;
+
+	if (addr != vma->vm_start) {
+		/* first split only if address in the middle */
+		result = split_vma(vma->vm_mm, vma, addr, false);
+		if (IS_ERR_VALUE((long)result))
+			return ERR_PTR((long)result);
+
+		vma = find_vma(vma->vm_mm, addr);
+		if (vma == NULL)
+			return ERR_PTR(-ENOENT);
+
+		/* corner case (again) */
+		if (vma_pages(vma) == 1)
+			return vma;
+	}
+
+	result = split_vma(vma->vm_mm, vma, addr + PAGE_SIZE, true);
+	if (IS_ERR_VALUE((long)result))
+		return ERR_PTR((long)result);
+
+	vma = find_vma(vma->vm_mm, addr);
+	if (vma == NULL)
+		return ERR_PTR(-ENOENT);
+
+	BUG_ON(vma_pages(vma) != 1);
+
+	return vma;
+}
+
+static int redirect_rmap(struct vm_area_struct *req_vma, struct page *req_page,
+			 struct vm_area_struct *map_vma)
+{
+	int result;
+
+	unlink_anon_vmas(map_vma);
+
+	result = anon_vma_fork(map_vma, req_vma);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	page_dup_rmap(req_page, false);
+
+out:
+	return result;
+}
+
+static int host_map_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
+			     struct page *req_page, struct page *map_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	pte_t newpte;
+
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+
+	/* classic replace_page() code */
+	pmd = mm_find_pmd(map_mm, map_hva);
+	if (!pmd)
+		return -EFAULT;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
+
+	/* create new PTE based on requested page */
+	newpte = mk_pte(req_page, map_vma->vm_page_prot);
+	newpte = pte_set_flags(newpte, pte_flags(*ptep));
+
+	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+	ptep_clear_flush_notify(map_vma, map_hva, ptep);
+	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
+
+	return 0;
+}
+
+static void discard_page(struct page *map_page)
+{
+	lock_page(map_page);
+	// TODO: put_anon_vma() ???? - should be here
+	page_remove_rmap(map_page, false);
+	if (!page_mapped(map_page))
+		try_to_free_swap(map_page);
+	unlock_page(map_page);
+	put_page(map_page);
+}
+
+static void kvmi_split_huge_pmd(struct vm_area_struct *req_vma,
+				hva_t req_hva, struct page *req_page)
+{
+	bool tail = false;
+
+	/* move reference count from compound head... */
+	if (PageTail(req_page)) {
+		tail = true;
+		put_page(req_page);
+	}
+
+	if (PageCompound(req_page))
+		split_huge_pmd_address(req_vma, req_hva, false, NULL);
+
+	/* ... to the actual page, after splitting */
+	if (tail)
+		get_page(req_page);
+}
+
+static int kvmi_map_action(struct mm_struct *req_mm, hva_t req_hva,
+			   struct mm_struct *map_mm, hva_t map_hva)
+{
+	struct vm_area_struct *req_vma;
+	struct page *req_page = NULL;
+
+	struct vm_area_struct *map_vma;
+	struct page *map_page;
+
+	long nrpages;
+	int result = 0;
+
+	/* VMAs will be modified */
+	down_write(&req_mm->mmap_sem);
+	down_write(&map_mm->mmap_sem);
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm,
+		req_hva, 1, 0,
+		&req_page, &req_vma, NULL);
+	if (nrpages == 0) {
+		kvm_err("kvmi: no page for req_hva %016lx\n", req_hva);
+		result = -ENOENT;
+		goto out_err;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		kvm_err("kvmi: get_user_pages_remote() failed with result %d\n",
+			result);
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page before remap");
+
+	/* find (not get) local page corresponding to target address */
+	map_vma = find_vma(map_mm, map_hva);
+	if (map_vma == NULL) {
+		kvm_err("kvmi: no local VMA found for remapping\n");
+		result = -ENOENT;
+		goto out_err;
+	}
+
+	map_page = follow_page(map_vma, map_hva, 0);
+	if (IS_ERR_VALUE(map_page)) {
+		result = PTR_ERR(map_page);
+		kvm_debug("kvmi: follow_page() failed with result %d\n",
+			result);
+		goto out_err;
+	} else if (map_page == NULL) {
+		result = -ENOENT;
+		kvm_debug("kvmi: follow_page() returned no page\n");
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(map_page, "map_page before remap");
+
+	/* split local VMA for rmap redirecting */
+	map_vma = isolate_page_vma(map_vma, map_hva);
+	if (IS_ERR_VALUE(map_vma)) {
+		result = PTR_ERR(map_vma);
+		kvm_debug("kvmi: isolate_page_vma() failed with result %d\n",
+			result);
+		goto out_err;
+	}
+
+	/* split remote huge page */
+	kvmi_split_huge_pmd(req_vma, req_hva, req_page);
+
+	/* re-link VMAs */
+	result = redirect_rmap(req_vma, req_page, map_vma);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	/* also redirect page tables */
+	result = host_map_fix_ptes(map_vma, map_hva, req_page, map_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	/* the old page will be discarded */
+	discard_page(map_page);
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(map_page, "map_page after being discarded");
+
+	/* done */
+	goto out_finalize;
+
+out_err:
+	/* get_user_pages_remote() incremented page reference count */
+	if (req_page != NULL)
+		put_page(req_page);
+
+out_finalize:
+	/* release semaphores in reverse order */
+	up_write(&map_mm->mmap_sem);
+	up_write(&req_mm->mmap_sem);
+
+	return result;
+}
+
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+	gpa_t req_gpa, gpa_t map_gpa)
+{
+	int result = 0;
+	struct kvm *target_kvm;
+
+	gfn_t req_gfn;
+	hva_t req_hva;
+	struct mm_struct *req_mm;
+
+	gfn_t map_gfn;
+	hva_t map_hva;
+	struct mm_struct *map_mm = vcpu->kvm->mm;
+
+	kvm_debug("kvmi: mapping request req_gpa %016llx, map_gpa %016llx\n",
+		  req_gpa, map_gpa);
+
+	/* get the struct kvm * corresponding to the token */
+	target_kvm = find_machine_at(vcpu, tkn_gva);
+	if (IS_ERR_VALUE(target_kvm))
+		return PTR_ERR(target_kvm);
+	else if (target_kvm == NULL) {
+		kvm_err("kvmi: unable to find target machine\n");
+		return -ENOENT;
+	}
+	kvm_get_kvm(target_kvm);
+	req_mm = target_kvm->mm;
+
+	/* translate source addresses */
+	req_gfn = gpa_to_gfn(req_gpa);
+	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
+	if (kvm_is_error_hva(req_hva)) {
+		kvm_err("kvmi: invalid req HVA %016lx\n", req_hva);
+		result = -EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: req_gpa %016llx, req_gfn %016llx, req_hva %016lx\n",
+		  req_gpa, req_gfn, req_hva);
+
+	/* translate destination addresses */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		kvm_err("kvmi: invalid map HVA %016lx\n", map_hva);
+		result = -EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
+		map_gpa, map_gfn, map_hva);
+
+	/* go to step 2 */
+	result = kvmi_map_action(req_mm, req_hva, map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	/* add mapping to list */
+	result = add_to_list(map_gpa, target_kvm, req_gpa);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	/* all fine */
+	kvm_debug("kvmi: mapping of req_gpa %016llx successful\n", req_gpa);
+
+out:
+	/* mandatory dec refernce count */
+	kvm_put_kvm(target_kvm);
+
+	return result;
+}
+
+
+static int restore_rmap(struct vm_area_struct *map_vma, hva_t map_hva,
+			struct page *req_page, struct page *new_page)
+{
+	int result;
+
+	/* decouple links to anon_vmas */
+	unlink_anon_vmas(map_vma);
+	map_vma->anon_vma = NULL;
+
+	/* allocate new anon_vma */
+	result = anon_vma_prepare(map_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	lock_page(new_page);
+	page_add_new_anon_rmap(new_page, map_vma, map_hva, false);
+	unlock_page(new_page);
+
+	/* decrease req_page mapcount */
+	atomic_dec(&req_page->_mapcount);
+
+	return 0;
+}
+
+static int host_unmap_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
+			       struct page *new_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	pte_t newpte;
+
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+
+	/* page replacing code */
+	pmd = mm_find_pmd(map_mm, map_hva);
+	if (!pmd)
+		return -EFAULT;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
+
+	newpte = mk_pte(new_page, map_vma->vm_page_prot);
+	newpte = pte_set_flags(newpte, pte_flags(*ptep));
+
+	/* clear cache & MMU notifier entries */
+	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+	ptep_clear_flush_notify(map_vma, map_hva, ptep);
+	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
+
+	return 0;
+}
+
+static int kvmi_unmap_action(struct mm_struct *req_mm,
+			     struct mm_struct *map_mm, hva_t map_hva)
+{
+	struct vm_area_struct *map_vma;
+	struct page *req_page = NULL;
+	struct page *new_page = NULL;
+
+	int result;
+
+	/* VMAs will be modified */
+	down_write(&req_mm->mmap_sem);
+	down_write(&map_mm->mmap_sem);
+
+	/* find destination VMA for mapping */
+	map_vma = find_vma(map_mm, map_hva);
+	if (map_vma == NULL) {
+		result = -ENOENT;
+		kvm_err("kvmi: no local VMA found for unmapping\n");
+		goto out_err;
+	}
+
+	/* find (not get) page mapped to destination address */
+	req_page = follow_page(map_vma, map_hva, 0);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		kvm_err("kvmi: follow_page() failed with result %d\n", result);
+		goto out_err;
+	} else if (req_page == NULL) {
+		result = -ENOENT;
+		kvm_err("kvmi: follow_page() returned no page\n");
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page before decoupling");
+
+	/* Returns NULL when no page can be allocated. */
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, map_vma, map_hva);
+	if (new_page == NULL) {
+		result = -ENOMEM;
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(new_page, "new_page after allocation");
+
+	/* should fix the rmap tree */
+	result = restore_rmap(map_vma, map_hva, req_page, new_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page after decoupling");
+
+	/* page table fixing here */
+	result = host_unmap_fix_ptes(map_vma, map_hva, new_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(new_page, "new_page after unmapping");
+
+	goto out_finalize;
+
+out_err:
+	if (new_page != NULL)
+		put_page(new_page);
+
+out_finalize:
+	/* reference count was inc during get_user_pages_remote() */
+	if (req_page != NULL) {
+		put_page(req_page);
+
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			dump_page(req_page, "req_page after release");
+	}
+
+	/* release semaphores in reverse order */
+	up_write(&map_mm->mmap_sem);
+	up_write(&req_mm->mmap_sem);
+
+	return result;
+}
+
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
+{
+	struct kvm *target_kvm;
+	struct mm_struct *req_mm;
+
+	struct host_map *map;
+	int result;
+
+	gfn_t map_gfn;
+	hva_t map_hva;
+	struct mm_struct *map_mm = vcpu->kvm->mm;
+
+	kvm_debug("kvmi: unmap request for map_gpa %016llx\n", map_gpa);
+
+	/* get the struct kvm * corresponding to map_gpa */
+	map = extract_from_list(map_gpa);
+	if (map == NULL) {
+		kvm_err("kvmi: map_gpa %016llx not mapped\n", map_gpa);
+		return -ENOENT;
+	}
+	target_kvm = map->machine;
+	kvm_get_kvm(target_kvm);
+	req_mm = target_kvm->mm;
+
+	kvm_debug("kvmi: req_gpa %016llx of machine %016lx mapped in map_gpa %016llx\n",
+		  map->req_gpa, (unsigned long) map->machine, map->map_gpa);
+
+	/* address where we did the remapping */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		result = -EFAULT;
+		kvm_err("kvmi: invalid HVA %016lx\n", map_hva);
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
+		  map_gpa, map_gfn, map_hva);
+
+	/* go to step 2 */
+	result = kvmi_unmap_action(req_mm, map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	kvm_debug("kvmi: unmap of map_gpa %016llx successful\n", map_gpa);
+
+out:
+	kvm_put_kvm(target_kvm);
+
+	/* remove entry whatever happens above */
+	remove_entry(map);
+
+	return result;
+}
+
+void kvmi_mem_destroy_vm(struct kvm *kvm)
+{
+	kvm_debug("kvmi: machine %016lx was torn down\n",
+		(unsigned long) kvm);
+
+	remove_vm_from_list(kvm);
+	remove_vm_token(kvm);
+}
+
+
+int kvm_intro_host_init(void)
+{
+	/* token database */
+	INIT_LIST_HEAD(&token_list);
+	spin_lock_init(&token_lock);
+
+	/* mapping database */
+	INIT_LIST_HEAD(&mapping_list);
+	spin_lock_init(&mapping_lock);
+
+	kvm_info("kvmi: initialized host memory introspection\n");
+
+	return 0;
+}
+
+void kvm_intro_host_exit(void)
+{
+	// ...
+}
+
+module_init(kvm_intro_host_init)
+module_exit(kvm_intro_host_exit)
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
new file mode 100644
index 000000000000..b1b20eb6332d
--- /dev/null
+++ b/virt/kvm/kvmi_msg.c
@@ -0,0 +1,1134 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ */
+#include <linux/file.h>
+#include <linux/net.h>
+#include <linux/kvm_host.h>
+#include <linux/kvmi.h>
+#include <asm/virtext.h>
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/asm/kvmi.h>
+
+#include "kvmi_int.h"
+
+#include <trace/events/kvmi.h>
+
+/*
+ * TODO: break these call paths
+ *   kvmi.c        work_cb
+ *   kvmi_msg.c    kvmi_dispatch_message
+ *   kvmi.c        kvmi_cmd_... / kvmi_make_request
+ *   kvmi_msg.c    kvmi_msg_reply
+ *
+ *   kvmi.c        kvmi_X_event
+ *   kvmi_msg.c    kvmi_send_event
+ *   kvmi.c        kvmi_handle_request
+ */
+
+/* TODO: move some of the code to arch/x86 */
+
+static atomic_t seq_ev = ATOMIC_INIT(0);
+
+static u32 new_seq(void)
+{
+	return atomic_inc_return(&seq_ev);
+}
+
+static const char *event_str(unsigned int e)
+{
+	switch (e) {
+	case KVMI_EVENT_CR:
+		return "CR";
+	case KVMI_EVENT_MSR:
+		return "MSR";
+	case KVMI_EVENT_XSETBV:
+		return "XSETBV";
+	case KVMI_EVENT_BREAKPOINT:
+		return "BREAKPOINT";
+	case KVMI_EVENT_HYPERCALL:
+		return "HYPERCALL";
+	case KVMI_EVENT_PAGE_FAULT:
+		return "PAGE_FAULT";
+	case KVMI_EVENT_TRAP:
+		return "TRAP";
+	case KVMI_EVENT_DESCRIPTOR:
+		return "DESCRIPTOR";
+	case KVMI_EVENT_CREATE_VCPU:
+		return "CREATE_VCPU";
+	case KVMI_EVENT_PAUSE_VCPU:
+		return "PAUSE_VCPU";
+	default:
+		return "EVENT?";
+	}
+}
+
+static const char * const msg_IDs[] = {
+	[KVMI_GET_VERSION]      = "KVMI_GET_VERSION",
+	[KVMI_GET_GUEST_INFO]   = "KVMI_GET_GUEST_INFO",
+	[KVMI_PAUSE_VCPU]       = "KVMI_PAUSE_VCPU",
+	[KVMI_GET_REGISTERS]    = "KVMI_GET_REGISTERS",
+	[KVMI_SET_REGISTERS]    = "KVMI_SET_REGISTERS",
+	[KVMI_GET_PAGE_ACCESS]  = "KVMI_GET_PAGE_ACCESS",
+	[KVMI_SET_PAGE_ACCESS]  = "KVMI_SET_PAGE_ACCESS",
+	[KVMI_INJECT_EXCEPTION] = "KVMI_INJECT_EXCEPTION",
+	[KVMI_READ_PHYSICAL]    = "KVMI_READ_PHYSICAL",
+	[KVMI_WRITE_PHYSICAL]   = "KVMI_WRITE_PHYSICAL",
+	[KVMI_GET_MAP_TOKEN]    = "KVMI_GET_MAP_TOKEN",
+	[KVMI_CONTROL_EVENTS]   = "KVMI_CONTROL_EVENTS",
+	[KVMI_CONTROL_CR]       = "KVMI_CONTROL_CR",
+	[KVMI_CONTROL_MSR]      = "KVMI_CONTROL_MSR",
+	[KVMI_EVENT]            = "KVMI_EVENT",
+	[KVMI_EVENT_REPLY]      = "KVMI_EVENT_REPLY",
+	[KVMI_GET_CPUID]        = "KVMI_GET_CPUID",
+	[KVMI_GET_XSAVE]        = "KVMI_GET_XSAVE",
+};
+
+static size_t sizeof_get_registers(const void *r)
+{
+	const struct kvmi_get_registers *req = r;
+
+	return sizeof(*req) + sizeof(req->msrs_idx[0]) * req->nmsrs;
+}
+
+static size_t sizeof_get_page_access(const void *r)
+{
+	const struct kvmi_get_page_access *req = r;
+
+	return sizeof(*req) + sizeof(req->gpa[0]) * req->count;
+}
+
+static size_t sizeof_set_page_access(const void *r)
+{
+	const struct kvmi_set_page_access *req = r;
+
+	return sizeof(*req) + sizeof(req->entries[0]) * req->count;
+}
+
+static size_t sizeof_write_physical(const void *r)
+{
+	const struct kvmi_write_physical *req = r;
+
+	return sizeof(*req) + req->size;
+}
+
+static const struct {
+	size_t size;
+	size_t (*cbk_full_size)(const void *msg);
+} msg_bytes[] = {
+	[KVMI_GET_VERSION]      = { 0, NULL },
+	[KVMI_GET_GUEST_INFO]   = { sizeof(struct kvmi_get_guest_info), NULL },
+	[KVMI_PAUSE_VCPU]       = { sizeof(struct kvmi_pause_vcpu), NULL },
+	[KVMI_GET_REGISTERS]    = { sizeof(struct kvmi_get_registers),
+						sizeof_get_registers },
+	[KVMI_SET_REGISTERS]    = { sizeof(struct kvmi_set_registers), NULL },
+	[KVMI_GET_PAGE_ACCESS]  = { sizeof(struct kvmi_get_page_access),
+						sizeof_get_page_access },
+	[KVMI_SET_PAGE_ACCESS]  = { sizeof(struct kvmi_set_page_access),
+						sizeof_set_page_access },
+	[KVMI_INJECT_EXCEPTION] = { sizeof(struct kvmi_inject_exception),
+					NULL },
+	[KVMI_READ_PHYSICAL]    = { sizeof(struct kvmi_read_physical), NULL },
+	[KVMI_WRITE_PHYSICAL]   = { sizeof(struct kvmi_write_physical),
+						sizeof_write_physical },
+	[KVMI_GET_MAP_TOKEN]    = { 0, NULL },
+	[KVMI_CONTROL_EVENTS]   = { sizeof(struct kvmi_control_events), NULL },
+	[KVMI_CONTROL_CR]       = { sizeof(struct kvmi_control_cr), NULL },
+	[KVMI_CONTROL_MSR]      = { sizeof(struct kvmi_control_msr), NULL },
+	[KVMI_GET_CPUID]        = { sizeof(struct kvmi_get_cpuid), NULL },
+	[KVMI_GET_XSAVE]        = { sizeof(struct kvmi_get_xsave), NULL },
+};
+
+static int kvmi_sock_read(struct kvmi *ikvm, void *buf, size_t size)
+{
+	struct kvec i = {
+		.iov_base = buf,
+		.iov_len = size,
+	};
+	struct msghdr m = { };
+	int rc;
+
+	read_lock(&ikvm->sock_lock);
+
+	if (likely(ikvm->sock))
+		rc = kernel_recvmsg(ikvm->sock, &m, &i, 1, size, MSG_WAITALL);
+	else
+		rc = -EPIPE;
+
+	if (rc > 0)
+		print_hex_dump_debug("read: ", DUMP_PREFIX_NONE, 32, 1,
+					buf, rc, false);
+
+	read_unlock(&ikvm->sock_lock);
+
+	if (unlikely(rc != size)) {
+		kvm_err("kernel_recvmsg: %d\n", rc);
+		if (rc >= 0)
+			rc = -EPIPE;
+		return rc;
+	}
+
+	return 0;
+}
+
+static int kvmi_sock_write(struct kvmi *ikvm, struct kvec *i, size_t n,
+			   size_t size)
+{
+	struct msghdr m = { };
+	int rc, k;
+
+	read_lock(&ikvm->sock_lock);
+
+	if (likely(ikvm->sock))
+		rc = kernel_sendmsg(ikvm->sock, &m, i, n, size);
+	else
+		rc = -EPIPE;
+
+	for (k = 0; k < n; k++)
+		print_hex_dump_debug("write: ", DUMP_PREFIX_NONE, 32, 1,
+				     i[k].iov_base, i[k].iov_len, false);
+
+	read_unlock(&ikvm->sock_lock);
+
+	if (unlikely(rc != size)) {
+		kvm_err("kernel_sendmsg: %d\n", rc);
+		if (rc >= 0)
+			rc = -EPIPE;
+		return rc;
+	}
+
+	return 0;
+}
+
+static const char *id2str(int i)
+{
+	return (i < ARRAY_SIZE(msg_IDs) && msg_IDs[i] ? msg_IDs[i] : "unknown");
+}
+
+static struct kvmi_vcpu *kvmi_vcpu_waiting_for_reply(struct kvm *kvm, u32 seq)
+{
+	struct kvmi_vcpu *found = NULL;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	mutex_lock(&kvm->lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/* kvmi_send_event */
+		smp_rmb();
+		if (READ_ONCE(IVCPU(vcpu)->ev_rpl_waiting)
+		    && seq == IVCPU(vcpu)->ev_seq) {
+			found = IVCPU(vcpu);
+			break;
+		}
+	}
+
+	mutex_unlock(&kvm->lock);
+
+	return found;
+}
+
+static bool kvmi_msg_dispatch_reply(struct kvmi *ikvm,
+				    const struct kvmi_msg_hdr *msg)
+{
+	struct kvmi_vcpu *ivcpu;
+	int err;
+
+	ivcpu = kvmi_vcpu_waiting_for_reply(ikvm->kvm, msg->seq);
+	if (!ivcpu) {
+		kvm_err("%s: unexpected event reply (seq=%u)\n", __func__,
+			msg->seq);
+		return false;
+	}
+
+	if (msg->size == sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size) {
+		err = kvmi_sock_read(ikvm, &ivcpu->ev_rpl,
+					sizeof(ivcpu->ev_rpl));
+		if (!err && ivcpu->ev_rpl_size)
+			err = kvmi_sock_read(ikvm, ivcpu->ev_rpl_ptr,
+						ivcpu->ev_rpl_size);
+	} else {
+		kvm_err("%s: invalid event reply size (max=%zu, recv=%u, expected=%zu)\n",
+			__func__, ivcpu->ev_rpl_size, msg->size,
+			sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size);
+		err = -1;
+	}
+
+	ivcpu->ev_rpl_received = err ? -1 : ivcpu->ev_rpl_size;
+
+	kvmi_make_request(ivcpu, REQ_REPLY);
+
+	return (err == 0);
+}
+
+static bool consume_sock_bytes(struct kvmi *ikvm, size_t n)
+{
+	while (n) {
+		u8 buf[256];
+		size_t chunk = min(n, sizeof(buf));
+
+		if (kvmi_sock_read(ikvm, buf, chunk) != 0)
+			return false;
+
+		n -= chunk;
+	}
+
+	return true;
+}
+
+static int kvmi_msg_reply(struct kvmi *ikvm,
+			  const struct kvmi_msg_hdr *msg,
+			  int err, const void *rpl, size_t rpl_size)
+{
+	struct kvmi_error_code ec;
+	struct kvmi_msg_hdr h;
+	struct kvec vec[3] = {
+		{.iov_base = &h,           .iov_len = sizeof(h) },
+		{.iov_base = &ec,          .iov_len = sizeof(ec)},
+		{.iov_base = (void *) rpl, .iov_len = rpl_size  },
+	};
+	size_t size = sizeof(h) + sizeof(ec) + (err ? 0 : rpl_size);
+	size_t n = err ? ARRAY_SIZE(vec)-1 : ARRAY_SIZE(vec);
+
+	memset(&h, 0, sizeof(h));
+	h.id = msg->id;
+	h.seq = msg->seq;
+	h.size = size - sizeof(h);
+
+	memset(&ec, 0, sizeof(ec));
+	ec.err = err;
+
+	return kvmi_sock_write(ikvm, vec, n, size);
+}
+
+static int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				int err, const void *rpl, size_t size)
+{
+	/*
+	 * As soon as we reply to this vCPU command, we can get another one,
+	 * and we must signal that the incoming buffer (ivcpu->msg_buf)
+	 * is ready by clearing this bit/request.
+	 */
+	kvmi_clear_request(IVCPU(vcpu), REQ_CMD);
+
+	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, size);
+}
+
+bool kvmi_msg_init(struct kvmi *ikvm, int fd)
+{
+	struct socket *sock;
+	int r;
+
+	sock = sockfd_lookup(fd, &r);
+
+	if (!sock) {
+		kvm_err("Invalid file handle: %d\n", fd);
+		return false;
+	}
+
+	WRITE_ONCE(ikvm->sock, sock);
+
+	return true;
+}
+
+void kvmi_msg_uninit(struct kvmi *ikvm)
+{
+	kvm_info("Wake up the receiving thread\n");
+
+	read_lock(&ikvm->sock_lock);
+
+	if (ikvm->sock)
+		kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
+
+	read_unlock(&ikvm->sock_lock);
+
+	kvm_info("Wait for the receiving thread to complete\n");
+	wait_for_completion(&ikvm->finished);
+}
+
+static int handle_get_version(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_version_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	rpl.version = KVMI_VERSION;
+
+	return kvmi_msg_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
+}
+
+static struct kvm_vcpu *kvmi_get_vcpu(struct kvmi *ikvm, int vcpu_id)
+{
+	struct kvm *kvm = ikvm->kvm;
+
+	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
+		return NULL;
+
+	return kvm_get_vcpu(kvm, vcpu_id);
+}
+
+static bool invalid_page_access(u64 gpa, u64 size)
+{
+	u64 off = gpa & ~PAGE_MASK;
+
+	return (size == 0 || size > PAGE_SIZE || off + size > PAGE_SIZE);
+}
+
+static int handle_read_physical(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_read_physical *req = _req;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	return kvmi_cmd_read_physical(ikvm->kvm, req->gpa, req->size,
+				      kvmi_msg_reply, msg);
+}
+
+static int handle_write_physical(struct kvmi *ikvm,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req)
+{
+	const struct kvmi_write_physical *req = _req;
+	int ec;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	ec = kvmi_cmd_write_physical(ikvm->kvm, req->gpa, req->size, req->data);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+static int handle_get_map_token(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	struct kvmi_get_map_token_reply rpl;
+	int ec;
+
+	ec = kvmi_cmd_alloc_token(ikvm->kvm, &rpl.token);
+
+	return kvmi_msg_reply(ikvm, msg, ec, &rpl, sizeof(rpl));
+}
+
+static int handle_control_cr(struct kvmi *ikvm,
+			     const struct kvmi_msg_hdr *msg, const void *_req)
+{
+	const struct kvmi_control_cr *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_control_cr(ikvm, req->enable, req->cr);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+static int handle_control_msr(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *_req)
+{
+	const struct kvmi_control_msr *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_control_msr(ikvm->kvm, req->enable, req->msr);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+/*
+ * These commands are executed on the receiving thread/worker.
+ */
+static int (*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
+			     const void *) = {
+	[KVMI_GET_VERSION]    = handle_get_version,
+	[KVMI_READ_PHYSICAL]  = handle_read_physical,
+	[KVMI_WRITE_PHYSICAL] = handle_write_physical,
+	[KVMI_GET_MAP_TOKEN]  = handle_get_map_token,
+	[KVMI_CONTROL_CR]     = handle_control_cr,
+	[KVMI_CONTROL_MSR]    = handle_control_msr,
+};
+
+static int handle_get_guest_info(struct kvm_vcpu *vcpu,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *req)
+{
+	struct kvmi_get_guest_info_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	kvmi_cmd_get_guest_info(vcpu, &rpl.vcpu_count, &rpl.tsc_speed);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, 0, &rpl, sizeof(rpl));
+}
+
+static int handle_pause_vcpu(struct kvm_vcpu *vcpu,
+			     const struct kvmi_msg_hdr *msg,
+			     const void *req)
+{
+	int ec = kvmi_cmd_pause_vcpu(vcpu);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
+				       const struct kvmi_get_registers *req,
+				       size_t *rpl_size)
+{
+	struct kvmi_get_registers_reply *rpl;
+	u16 k, n = req->nmsrs;
+
+	*rpl_size = sizeof(*rpl) + sizeof(rpl->msrs.entries[0]) * n;
+
+	rpl = kzalloc(*rpl_size, GFP_KERNEL);
+
+	if (rpl) {
+		rpl->msrs.nmsrs = n;
+
+		for (k = 0; k < n; k++)
+			rpl->msrs.entries[k].index = req->msrs_idx[k];
+	}
+
+	return rpl;
+}
+
+static int handle_get_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_registers_reply *rpl;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	rpl = alloc_get_registers_reply(msg, req, &rpl_size);
+
+	if (!rpl)
+		ec = -KVM_ENOMEM;
+	else
+		ec = kvmi_cmd_get_registers(vcpu, &rpl->mode,
+						&rpl->regs, &rpl->sregs,
+						&rpl->msrs);
+
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+static int handle_set_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_set_registers *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_set_registers(vcpu, &req->regs);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_get_page_access(struct kvm_vcpu *vcpu,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *_req)
+{
+	const struct kvmi_get_page_access *req = _req;
+	struct kvmi_get_page_access_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	u16 k, n = req->count;
+	int err, ec = 0;
+
+	if (req->view != 0 && !kvm_eptp_switching_supported) {
+		ec = -KVM_ENOSYS;
+		goto out;
+	}
+
+	if (req->view != 0) { /* TODO */
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	rpl_size = sizeof(*rpl) + sizeof(rpl->access[0]) * n;
+	rpl = kzalloc(rpl_size, GFP_KERNEL);
+
+	if (!rpl) {
+		ec = -KVM_ENOMEM;
+		goto out;
+	}
+
+	for (k = 0; k < n && ec == 0; k++)
+		ec = kvmi_cmd_get_page_access(vcpu, req->gpa[k],
+						&rpl->access[k]);
+
+out:
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+static int handle_set_page_access(struct kvm_vcpu *vcpu,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *_req)
+{
+	const struct kvmi_set_page_access *req = _req;
+	struct kvm *kvm = vcpu->kvm;
+	u16 k, n = req->count;
+	int ec = 0;
+
+	if (req->view != 0) {
+		if (!kvm_eptp_switching_supported)
+			ec = -KVM_ENOSYS;
+		else
+			ec = -KVM_EINVAL; /* TODO */
+	} else {
+		for (k = 0; k < n; k++) {
+			u64 gpa   = req->entries[k].gpa;
+			u8 access = req->entries[k].access;
+			int ec0;
+
+			if (access &  ~(KVMI_PAGE_ACCESS_R |
+					KVMI_PAGE_ACCESS_W |
+					KVMI_PAGE_ACCESS_X))
+				ec0 = -KVM_EINVAL;
+			else
+				ec0 = kvmi_set_mem_access(kvm, gpa, access);
+
+			if (ec0 && !ec)
+				ec = ec0;
+
+			trace_kvmi_set_mem_access(gpa_to_gfn(gpa), access, ec0);
+		}
+	}
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_inject_exception(struct kvm_vcpu *vcpu,
+				   const struct kvmi_msg_hdr *msg,
+				   const void *_req)
+{
+	const struct kvmi_inject_exception *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_inject_exception(vcpu, req->nr, req->has_error,
+				       req->error_code, req->address);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_control_events(struct kvm_vcpu *vcpu,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req)
+{
+	const struct kvmi_control_events *req = _req;
+	u32 not_allowed = ~IKVM(vcpu->kvm)->event_allow_mask;
+	u32 unknown = ~KVMI_KNOWN_EVENTS;
+	int ec;
+
+	if (req->events & unknown)
+		ec = -KVM_EINVAL;
+	else if (req->events & not_allowed)
+		ec = -KVM_EPERM;
+	else
+		ec = kvmi_cmd_control_events(vcpu, req->events);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_get_cpuid(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg,
+			    const void *_req)
+{
+	const struct kvmi_get_cpuid *req = _req;
+	struct kvmi_get_cpuid_reply rpl;
+	int ec;
+
+	memset(&rpl, 0, sizeof(rpl));
+
+	ec = kvmi_cmd_get_cpuid(vcpu, req->function, req->index,
+					&rpl.eax, &rpl.ebx, &rpl.ecx,
+					&rpl.edx);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, &rpl, sizeof(rpl));
+}
+
+static int handle_get_xsave(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_xsave_reply *rpl;
+	size_t rpl_size = sizeof(*rpl) + sizeof(struct kvm_xsave);
+	int ec = 0, err;
+
+	rpl = kzalloc(rpl_size, GFP_KERNEL);
+
+	if (!rpl)
+		ec = -KVM_ENOMEM;
+	else {
+		struct kvm_xsave *area;
+
+		area = (struct kvm_xsave *)&rpl->region[0];
+		kvm_vcpu_ioctl_x86_get_xsave(vcpu, area);
+	}
+
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+/*
+ * These commands are executed on the vCPU thread. The receiving thread
+ * saves the command into kvmi_vcpu.msg_buf[] and signals the vCPU to handle
+ * the command (including sending back the reply).
+ */
+static int (*const msg_vcpu[])(struct kvm_vcpu *,
+			       const struct kvmi_msg_hdr *, const void *) = {
+	[KVMI_GET_GUEST_INFO]   = handle_get_guest_info,
+	[KVMI_PAUSE_VCPU]       = handle_pause_vcpu,
+	[KVMI_GET_REGISTERS]    = handle_get_registers,
+	[KVMI_SET_REGISTERS]    = handle_set_registers,
+	[KVMI_GET_PAGE_ACCESS]  = handle_get_page_access,
+	[KVMI_SET_PAGE_ACCESS]  = handle_set_page_access,
+	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
+	[KVMI_CONTROL_EVENTS]   = handle_control_events,
+	[KVMI_GET_CPUID]        = handle_get_cpuid,
+	[KVMI_GET_XSAVE]        = handle_get_xsave,
+};
+
+void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_msg_hdr *msg = (void *) ivcpu->msg_buf;
+	u8 *req = ivcpu->msg_buf + sizeof(*msg);
+	int err;
+
+	err = msg_vcpu[msg->id](vcpu, msg, req);
+
+	if (err)
+		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
+			id2str(msg->id), err);
+
+	/*
+	 * No error code is returned.
+	 *
+	 * The introspector gets its error code from the message handler
+	 * or the socket is closed (and QEMU should reconnect).
+	 */
+}
+
+static int kvmi_msg_recv_varlen(struct kvmi *ikvm, size_t(*cbk) (const void *),
+				size_t min_n, size_t msg_size)
+{
+	size_t extra_n;
+	u8 *extra_buf;
+	int err;
+
+	if (min_n > msg_size) {
+		kvm_err("%s: got %zu bytes instead of min %zu\n",
+			__func__, msg_size, min_n);
+		return -EINVAL;
+	}
+
+	if (!min_n)
+		return 0;
+
+	err = kvmi_sock_read(ikvm, ikvm->msg_buf, min_n);
+
+	extra_buf = ikvm->msg_buf + min_n;
+	extra_n = msg_size - min_n;
+
+	if (!err && extra_n) {
+		if (cbk(ikvm->msg_buf) == msg_size)
+			err = kvmi_sock_read(ikvm, extra_buf, extra_n);
+		else
+			err = -EINVAL;
+	}
+
+	return err;
+}
+
+static int kvmi_msg_recv_n(struct kvmi *ikvm, size_t n, size_t msg_size)
+{
+	if (n != msg_size) {
+		kvm_err("%s: got %zu bytes instead of %zu\n",
+			__func__, msg_size, n);
+		return -EINVAL;
+	}
+
+	if (!n)
+		return 0;
+
+	return kvmi_sock_read(ikvm, ikvm->msg_buf, n);
+}
+
+static int kvmi_msg_recv(struct kvmi *ikvm, const struct kvmi_msg_hdr *msg)
+{
+	size_t (*cbk)(const void *) = msg_bytes[msg->id].cbk_full_size;
+	size_t expected = msg_bytes[msg->id].size;
+
+	if (cbk)
+		return kvmi_msg_recv_varlen(ikvm, cbk, expected, msg->size);
+	else
+		return kvmi_msg_recv_n(ikvm, expected, msg->size);
+}
+
+struct vcpu_msg_hdr {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+static int kvmi_msg_queue_to_vcpu(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg)
+{
+	struct vcpu_msg_hdr *vcpu_hdr = (struct vcpu_msg_hdr *)ikvm->msg_buf;
+	struct kvmi_vcpu *ivcpu;
+	struct kvm_vcpu *vcpu;
+
+	if (msg->size < sizeof(*vcpu_hdr)) {
+		kvm_err("%s: invalid vcpu message: %d\n", __func__, msg->size);
+		return -EINVAL; /* ABI error */
+	}
+
+	vcpu = kvmi_get_vcpu(ikvm, vcpu_hdr->vcpu);
+
+	if (!vcpu) {
+		kvm_err("%s: invalid vcpu: %d\n", __func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EINVAL, NULL, 0);
+	}
+
+	ivcpu = vcpu->kvmi;
+
+	if (!ivcpu) {
+		kvm_err("%s: not introspected vcpu: %d\n",
+			__func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EAGAIN, NULL, 0);
+	}
+
+	if (test_bit(REQ_CMD, &ivcpu->requests)) {
+		kvm_err("%s: vcpu is busy: %d\n", __func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EBUSY, NULL, 0);
+	}
+
+	memcpy(ivcpu->msg_buf, msg, sizeof(*msg));
+	memcpy(ivcpu->msg_buf + sizeof(*msg), ikvm->msg_buf, msg->size);
+
+	kvmi_make_request(ivcpu, REQ_CMD);
+	kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+	kvm_vcpu_kick(vcpu);
+
+	return 0;
+}
+
+static bool kvmi_msg_dispatch_cmd(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg)
+{
+	int err = kvmi_msg_recv(ikvm, msg);
+
+	if (err)
+		goto out;
+
+	if (!KVMI_ALLOWED_COMMAND(msg->id, ikvm->cmd_allow_mask)) {
+		err = kvmi_msg_reply(ikvm, msg, -KVM_EPERM, NULL, 0);
+		goto out;
+	}
+
+	if (msg_vcpu[msg->id])
+		err = kvmi_msg_queue_to_vcpu(ikvm, msg);
+	else
+		err = msg_vm[msg->id](ikvm, msg, ikvm->msg_buf);
+
+out:
+	if (err)
+		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
+			id2str(msg->id), err);
+
+	return (err == 0);
+}
+
+static bool handle_unsupported_msg(struct kvmi *ikvm,
+				   const struct kvmi_msg_hdr *msg)
+{
+	int err;
+
+	kvm_err("%s: %u\n", __func__, msg->id);
+
+	err = consume_sock_bytes(ikvm, msg->size);
+
+	if (!err)
+		err = kvmi_msg_reply(ikvm, msg, -KVM_ENOSYS, NULL, 0);
+
+	return (err == 0);
+}
+
+static bool kvmi_msg_dispatch(struct kvmi *ikvm)
+{
+	struct kvmi_msg_hdr msg;
+	int err;
+
+	err = kvmi_sock_read(ikvm, &msg, sizeof(msg));
+
+	if (err) {
+		kvm_err("%s: can't read\n", __func__);
+		return false;
+	}
+
+	trace_kvmi_msg_dispatch(msg.id, msg.size);
+
+	kvm_debug("%s: id:%u (%s) size:%u\n", __func__, msg.id,
+		  id2str(msg.id), msg.size);
+
+	if (msg.id == KVMI_EVENT_REPLY)
+		return kvmi_msg_dispatch_reply(ikvm, &msg);
+
+	if (msg.id >= ARRAY_SIZE(msg_bytes)
+	    || (!msg_vm[msg.id] && !msg_vcpu[msg.id]))
+		return handle_unsupported_msg(ikvm, &msg);
+
+	return kvmi_msg_dispatch_cmd(ikvm, &msg);
+}
+
+static void kvmi_sock_close(struct kvmi *ikvm)
+{
+	kvm_info("%s\n", __func__);
+
+	write_lock(&ikvm->sock_lock);
+
+	if (ikvm->sock) {
+		kvm_info("Release the socket\n");
+		sockfd_put(ikvm->sock);
+
+		ikvm->sock = NULL;
+	}
+
+	write_unlock(&ikvm->sock_lock);
+}
+
+bool kvmi_msg_process(struct kvmi *ikvm)
+{
+	if (!kvmi_msg_dispatch(ikvm)) {
+		kvmi_sock_close(ikvm);
+		return false;
+	}
+	return true;
+}
+
+static void kvmi_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev,
+			     u32 ev_id)
+{
+	memset(ev, 0, sizeof(*ev));
+	ev->vcpu = vcpu->vcpu_id;
+	ev->event = ev_id;
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, &ev->regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &ev->sregs);
+	ev->mode = kvmi_vcpu_mode(vcpu, &ev->sregs);
+	kvmi_get_msrs(vcpu, ev);
+}
+
+static bool kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
+			    void *ev,  size_t ev_size,
+			    void *rpl, size_t rpl_size)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_event common;
+	struct kvmi_msg_hdr h;
+	struct kvec vec[3] = {
+		{.iov_base = &h,      .iov_len = sizeof(h)     },
+		{.iov_base = &common, .iov_len = sizeof(common)},
+		{.iov_base = ev,      .iov_len = ev_size       },
+	};
+	size_t msg_size = sizeof(h) + sizeof(common) + ev_size;
+	size_t n = ev_size ? ARRAY_SIZE(vec) : ARRAY_SIZE(vec)-1;
+
+	memset(&h, 0, sizeof(h));
+	h.id = KVMI_EVENT;
+	h.seq = new_seq();
+	h.size = msg_size - sizeof(h);
+
+	kvmi_setup_event(vcpu, &common, ev_id);
+
+	ivcpu->ev_rpl_ptr = rpl;
+	ivcpu->ev_rpl_size = rpl_size;
+	ivcpu->ev_seq = h.seq;
+	ivcpu->ev_rpl_received = -1;
+	WRITE_ONCE(ivcpu->ev_rpl_waiting, true);
+	/* kvmi_vcpu_waiting_for_reply() */
+	smp_wmb();
+
+	trace_kvmi_send_event(ev_id);
+
+	kvm_debug("%s: %-11s(seq:%u) size:%lu vcpu:%d\n",
+		  __func__, event_str(ev_id), h.seq, ev_size, vcpu->vcpu_id);
+
+	if (kvmi_sock_write(IKVM(vcpu->kvm), vec, n, msg_size) == 0)
+		kvmi_handle_request(vcpu);
+
+	kvm_debug("%s: reply for %-11s(seq:%u) size:%lu vcpu:%d\n",
+		  __func__, event_str(ev_id), h.seq, rpl_size, vcpu->vcpu_id);
+
+	return (ivcpu->ev_rpl_received >= 0);
+}
+
+u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
+		     u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_cr e;
+	struct kvmi_event_cr_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.cr = cr;
+	e.old_value = old_value;
+	e.new_value = new_value;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_CR, &e, sizeof(e),
+				&r, sizeof(r))) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
+		      u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_msr e;
+	struct kvmi_event_msr_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.msr = msr;
+	e.old_value = old_value;
+	e.new_value = new_value;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_MSR, &e, sizeof(e),
+				&r, sizeof(r))) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_XSETBV, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa)
+{
+	struct kvmi_event_breakpoint e;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_BREAKPOINT,
+				&e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_HYPERCALL, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
+		      u32 *action, bool *trap_access, u8 *ctx_data,
+		      u32 *ctx_size)
+{
+	u32 max_ctx_size = *ctx_size;
+	struct kvmi_event_page_fault e;
+	struct kvmi_event_page_fault_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+	e.gva = gva;
+	e.mode = mode;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAGE_FAULT, &e, sizeof(e),
+				&r, sizeof(r)))
+		return false;
+
+	*action = IVCPU(vcpu)->ev_rpl.action;
+	*trap_access = r.trap_access;
+	*ctx_size = 0;
+
+	if (r.ctx_size <= max_ctx_size) {
+		*ctx_size = min_t(u32, r.ctx_size, sizeof(r.ctx_data));
+		if (*ctx_size)
+			memcpy(ctx_data, r.ctx_data, *ctx_size);
+	} else {
+		kvm_err("%s: ctx_size (recv:%u max:%u)\n", __func__,
+			r.ctx_size, *ctx_size);
+		/*
+		 * TODO: This is an ABI error.
+		 * We should shutdown the socket?
+		 */
+	}
+
+	return true;
+}
+
+u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
+		       u32 error_code, u64 cr2)
+{
+	struct kvmi_event_trap e;
+
+	memset(&e, 0, sizeof(e));
+	e.vector = vector;
+	e.type = type;
+	e.error_code = error_code;
+	e.cr2 = cr2;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_TRAP, &e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
+			     u64 exit_qualification, u8 descriptor, u8 write)
+{
+	struct kvmi_event_descriptor e;
+
+	memset(&e, 0, sizeof(e));
+	e.descriptor = descriptor;
+	e.write = write;
+
+	if (cpu_has_vmx()) {
+		e.arch.vmx.instr_info = info;
+		e.arch.vmx.exit_qualification = exit_qualification;
+	} else {
+		e.arch.svm.exit_info = info;
+	}
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_DESCRIPTOR,
+				&e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_CREATE_VCPU, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAUSE_VCPU, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu, Mircea Cîrjaliu,
	Marian Rotariu

From: Adalbert Lazar <alazar@bitdefender.com>

This subsystem is split into three source files:
 - kvmi_msg.c - ABI and socket related functions
 - kvmi_mem.c - handle map/unmap requests from the introspector
 - kvmi.c - all the other

The new data used by this subsystem is attached to the 'kvm' and
'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
structures).

Besides the KVMI system, this patch exports the
kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
(KVM_INTROSPECTION) used to pass the connection file handle from QEMU.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/x86.c              |    4 +-
 include/linux/kvm_host.h        |    4 +
 include/linux/kvmi.h            |   32 +
 include/linux/mm.h              |    3 +
 include/trace/events/kvmi.h     |  174 +++++
 include/uapi/linux/kvm.h        |    8 +
 mm/internal.h                   |    5 -
 virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi_int.h             |  121 ++++
 virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
 virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
 13 files changed, 3620 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/kvmi.h
 create mode 100644 include/trace/events/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi_int.h
 create mode 100644 virt/kvm/kvmi_mem.c
 create mode 100644 virt/kvm/kvmi_msg.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2cf03ed181e6..1e9e49eaee3b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -73,6 +73,7 @@
 #define KVM_REQ_HV_RESET		KVM_ARCH_REQ(20)
 #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
 #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
+#define KVM_REQ_INTROSPECTION           KVM_ARCH_REQ(23)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index dc4f2fdf5e57..ab6225563526 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -9,6 +9,7 @@ CFLAGS_vmx.o := -I.
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
+				$(KVM)/kvmi.o $(KVM)/kvmi_msg.o $(KVM)/kvmi_mem.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 74839859c0fd..cdfc7200a018 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3346,8 +3346,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	}
 }
 
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
-					 struct kvm_xsave *guest_xsave)
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave)
 {
 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 68e4d756f5c9..eae0598e18a5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -274,6 +274,7 @@ struct kvm_vcpu {
 	bool preempted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	void *kvmi;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -446,6 +447,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	void *kvmi;
 };
 
 #define kvm_err(fmt, ...) \
@@ -779,6 +781,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 					struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
+void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+				  struct kvm_xsave *guest_xsave);
 
 int kvm_arch_init(void *opaque);
 void kvm_arch_exit(void);
diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
new file mode 100644
index 000000000000..7fac1d23f67c
--- /dev/null
+++ b/include/linux/kvmi.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_H__
+#define __KVMI_H__
+
+#define kvmi_is_present() 1
+
+int kvmi_init(void);
+void kvmi_uninit(void);
+void kvmi_destroy_vm(struct kvm *kvm);
+bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu);
+void kvmi_vcpu_init(struct kvm_vcpu *vcpu);
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value);
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva);
+bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu);
+void kvmi_hypercall_event(struct kvm_vcpu *vcpu);
+bool kvmi_lost_exception(struct kvm_vcpu *vcpu);
+void kvmi_trap_event(struct kvm_vcpu *vcpu);
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
+			   unsigned long exit_qualification,
+			   unsigned char descriptor, unsigned char write);
+void kvmi_flush_mem_access(struct kvm *kvm);
+void kvmi_handle_request(struct kvm_vcpu *vcpu);
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+			     gpa_t req_gpa, gpa_t map_gpa);
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa);
+
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff739cd..b659c7436789 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1115,6 +1115,9 @@ void page_address_init(void);
 #define page_address_init()  do { } while(0)
 #endif
 
+/* rmap.c */
+extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+
 extern void *page_rmapping(struct page *page);
 extern struct anon_vma *page_anon_vma(struct page *page);
 extern struct address_space *page_mapping(struct page *page);
diff --git a/include/trace/events/kvmi.h b/include/trace/events/kvmi.h
new file mode 100644
index 000000000000..dc36fd3b30dc
--- /dev/null
+++ b/include/trace/events/kvmi.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kvmi
+
+#if !defined(_TRACE_KVMI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KVMI_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __TRACE_KVMI_STRUCTURES
+#define __TRACE_KVMI_STRUCTURES
+
+#undef EN
+#define EN(x) { x, #x }
+
+static const struct trace_print_flags kvmi_msg_id_symbol[] = {
+	EN(KVMI_GET_VERSION),
+	EN(KVMI_PAUSE_VCPU),
+	EN(KVMI_GET_GUEST_INFO),
+	EN(KVMI_GET_REGISTERS),
+	EN(KVMI_SET_REGISTERS),
+	EN(KVMI_GET_PAGE_ACCESS),
+	EN(KVMI_SET_PAGE_ACCESS),
+	EN(KVMI_INJECT_EXCEPTION),
+	EN(KVMI_READ_PHYSICAL),
+	EN(KVMI_WRITE_PHYSICAL),
+	EN(KVMI_GET_MAP_TOKEN),
+	EN(KVMI_CONTROL_EVENTS),
+	EN(KVMI_CONTROL_CR),
+	EN(KVMI_CONTROL_MSR),
+	EN(KVMI_EVENT),
+	EN(KVMI_EVENT_REPLY),
+	EN(KVMI_GET_CPUID),
+	EN(KVMI_GET_XSAVE),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_event_id_symbol[] = {
+	EN(KVMI_EVENT_CR),
+	EN(KVMI_EVENT_MSR),
+	EN(KVMI_EVENT_XSETBV),
+	EN(KVMI_EVENT_BREAKPOINT),
+	EN(KVMI_EVENT_HYPERCALL),
+	EN(KVMI_EVENT_PAGE_FAULT),
+	EN(KVMI_EVENT_TRAP),
+	EN(KVMI_EVENT_DESCRIPTOR),
+	EN(KVMI_EVENT_CREATE_VCPU),
+	EN(KVMI_EVENT_PAUSE_VCPU),
+	{-1, NULL}
+};
+
+static const struct trace_print_flags kvmi_action_symbol[] = {
+	{KVMI_EVENT_ACTION_CONTINUE, "continue"},
+	{KVMI_EVENT_ACTION_RETRY, "retry"},
+	{KVMI_EVENT_ACTION_CRASH, "crash"},
+	{-1, NULL}
+};
+
+#endif /* __TRACE_KVMI_STRUCTURES */
+
+TRACE_EVENT(
+	kvmi_msg_dispatch,
+	TP_PROTO(__u16 id, __u16 size),
+	TP_ARGS(id, size),
+	TP_STRUCT__entry(
+		__field(__u16, id)
+		__field(__u16, size)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+		__entry->size = size;
+	),
+	TP_printk("%s size %u",
+		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
+		  __entry->size)
+);
+
+TRACE_EVENT(
+	kvmi_send_event,
+	TP_PROTO(__u32 id),
+	TP_ARGS(id),
+	TP_STRUCT__entry(
+		__field(__u32, id)
+	),
+	TP_fast_assign(
+		__entry->id = id;
+	),
+	TP_printk("%s",
+		trace_print_symbols_seq(p, __entry->id, kvmi_event_id_symbol))
+);
+
+#define KVMI_ACCESS_PRINTK() ({                                         \
+	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
+	static const char * const access_str[] = {			\
+		"---", "r--", "-w-", "rw-", "--x", "r-x", "-wx", "rwx"  \
+	};							        \
+	trace_seq_printf(p, "%s", access_str[__entry->access & 7]);	\
+	saved_ptr;							\
+})
+
+TRACE_EVENT(
+	kvmi_set_mem_access,
+	TP_PROTO(__u64 gfn, __u8 access, int err),
+	TP_ARGS(gfn, access, err),
+	TP_STRUCT__entry(
+		__field(__u64, gfn)
+		__field(__u8, access)
+		__field(int, err)
+	),
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->access = access;
+		__entry->err = err;
+	),
+	TP_printk("gfn %llx %s %s %d",
+		  __entry->gfn, KVMI_ACCESS_PRINTK(),
+		  __entry->err ? "failed" : "succeeded", __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_apply_mem_access,
+	TP_PROTO(__u64 gfn, __u8 access, int err),
+	TP_ARGS(gfn, access, err),
+	TP_STRUCT__entry(
+		__field(__u64, gfn)
+		__field(__u8, access)
+		__field(int, err)
+	),
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->access = access;
+		__entry->err = err;
+	),
+	TP_printk("gfn %llx %s flush %s %d",
+		  __entry->gfn, KVMI_ACCESS_PRINTK(),
+		  __entry->err ? "failed" : "succeeded", __entry->err)
+);
+
+TRACE_EVENT(
+	kvmi_event_page_fault,
+	TP_PROTO(__u64 gpa, __u64 gva, __u8 access, __u64 old_rip,
+		 __u32 action, __u64 new_rip, __u32 ctx_size),
+	TP_ARGS(gpa, gva, access, old_rip, action, new_rip, ctx_size),
+	TP_STRUCT__entry(
+		__field(__u64, gpa)
+		__field(__u64, gva)
+		__field(__u8, access)
+		__field(__u64, old_rip)
+		__field(__u32, action)
+		__field(__u64, new_rip)
+		__field(__u32, ctx_size)
+	),
+	TP_fast_assign(
+		__entry->gpa = gpa;
+		__entry->gva = gva;
+		__entry->access = access;
+		__entry->old_rip = old_rip;
+		__entry->action = action;
+		__entry->new_rip = new_rip;
+		__entry->ctx_size = ctx_size;
+	),
+	TP_printk("gpa %llx %s gva %llx rip %llx -> %s rip %llx ctx %u",
+		  __entry->gpa,
+		  KVMI_ACCESS_PRINTK(),
+		  __entry->gva,
+		  __entry->old_rip,
+		  trace_print_symbols_seq(p, __entry->action,
+					  kvmi_action_symbol),
+		  __entry->new_rip, __entry->ctx_size)
+);
+
+#endif /* _TRACE_KVMI_H */
+
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 496e59a2738b..6b7c4469b808 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1359,6 +1359,14 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
 #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
 
+struct kvm_introspection {
+	int fd;
+	__u32 padding;
+	__u32 commands;
+	__u32 events;
+};
+#define KVM_INTROSPECTION      _IOW(KVMIO, 0xff, struct kvm_introspection)
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..9d363c802305 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -92,11 +92,6 @@ extern unsigned long highest_memmap_pfn;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
-/*
- * in mm/rmap.c:
- */
-extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
-
 /*
  * in mm/page_alloc.c
  */
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
new file mode 100644
index 000000000000..c4cdaeddac45
--- /dev/null
+++ b/virt/kvm/kvmi.c
@@ -0,0 +1,1410 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ */
+#include <linux/mmu_context.h>
+#include <linux/random.h>
+#include <uapi/linux/kvmi.h>
+#include <uapi/asm/kvmi.h>
+#include "../../arch/x86/kvm/x86.h"
+#include "../../arch/x86/kvm/mmu.h"
+#include <asm/vmx.h>
+#include "cpuid.h"
+#include "kvmi_int.h"
+#include <asm/kvm_page_track.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/kvmi.h>
+
+struct kvmi_mem_access {
+	struct list_head link;
+	gfn_t gfn;
+	u8 access;
+	bool active[KVM_PAGE_TRACK_MAX];
+	struct kvm_memory_slot *slot;
+};
+
+static void wakeup_events(struct kvm *kvm);
+static bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
+			   unsigned long gva, u8 access);
+
+static struct workqueue_struct *wq;
+
+static const u8 full_access = KVMI_PAGE_ACCESS_R |
+			      KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X;
+
+static const struct {
+	unsigned int allow_bit;
+	enum kvm_page_track_mode track_mode;
+} track_modes[] = {
+	{ KVMI_PAGE_ACCESS_R, KVM_PAGE_TRACK_PREREAD },
+	{ KVMI_PAGE_ACCESS_W, KVM_PAGE_TRACK_PREWRITE },
+	{ KVMI_PAGE_ACCESS_X, KVM_PAGE_TRACK_PREEXEC },
+};
+
+void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req)
+{
+	set_bit(req, &ivcpu->requests);
+	/* Make sure the bit is set when the worker wakes up */
+	smp_wmb();
+	up(&ivcpu->sem_requests);
+}
+
+void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req)
+{
+	clear_bit(req, &ivcpu->requests);
+}
+
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	/*
+	 * This vcpu is already stopped, executing this command
+	 * as a result of the REQ_CMD bit being set
+	 * (see kvmi_handle_request).
+	 */
+	if (ivcpu->pause)
+		return -KVM_EBUSY;
+
+	ivcpu->pause = true;
+
+	return 0;
+}
+
+static void kvmi_apply_mem_access(struct kvm *kvm,
+				  struct kvm_memory_slot *slot,
+				  struct kvmi_mem_access *m)
+{
+	int idx, k;
+
+	if (!slot) {
+		slot = gfn_to_memslot(kvm, m->gfn);
+		if (!slot)
+			return;
+	}
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (k = 0; k < ARRAY_SIZE(track_modes); k++) {
+		unsigned int allow_bit = track_modes[k].allow_bit;
+		enum kvm_page_track_mode mode = track_modes[k].track_mode;
+
+		if (m->access & allow_bit) {
+			if (m->active[mode] && m->slot == slot) {
+				kvm_slot_page_track_remove_page(kvm, slot,
+								m->gfn, mode);
+				m->active[mode] = false;
+				m->slot = NULL;
+			}
+		} else if (!m->active[mode] || m->slot != slot) {
+			kvm_slot_page_track_add_page(kvm, slot, m->gfn, mode);
+			m->active[mode] = true;
+			m->slot = slot;
+		}
+	}
+
+	spin_unlock(&kvm->mmu_lock);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi_mem_access *__m;
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t gfn = gpa_to_gfn(gpa);
+
+	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
+		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
+
+	m = kzalloc(sizeof(struct kvmi_mem_access), GFP_KERNEL);
+	if (!m)
+		return -KVM_ENOMEM;
+
+	INIT_LIST_HEAD(&m->link);
+	m->gfn = gfn;
+	m->access = access;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	__m = radix_tree_lookup(&ikvm->access_tree, m->gfn);
+	if (__m) {
+		__m->access = m->access;
+		if (list_empty(&__m->link))
+			list_add_tail(&__m->link, &ikvm->access_list);
+	} else {
+		radix_tree_insert(&ikvm->access_tree, m->gfn, m);
+		list_add_tail(&m->link, &ikvm->access_list);
+		m = NULL;
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	kfree(m);
+
+	return 0;
+}
+
+static bool kvmi_test_mem_access(struct kvm *kvm, unsigned long gpa,
+				 u8 access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (!ikvm)
+		return false;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	m = radix_tree_lookup(&ikvm->access_tree, gpa_to_gfn(gpa));
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	/*
+	 * We want to be notified only for violations involving access
+	 * bits that we've specifically cleared
+	 */
+	if (m && ((~m->access) & access))
+		return true;
+
+	return false;
+}
+
+static struct kvmi_mem_access *
+kvmi_get_mem_access_unlocked(struct kvm *kvm, const gfn_t gfn)
+{
+	return radix_tree_lookup(&IKVM(kvm)->access_tree, gfn);
+}
+
+static bool is_introspected(struct kvmi *ikvm)
+{
+	return (ikvm && ikvm->sock);
+}
+
+void kvmi_flush_mem_access(struct kvm *kvm)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	while (!list_empty(&ikvm->access_list)) {
+		struct kvmi_mem_access *m =
+			list_first_entry(&ikvm->access_list,
+					 struct kvmi_mem_access, link);
+
+		list_del_init(&m->link);
+
+		kvmi_apply_mem_access(kvm, NULL, m);
+
+		if (m->access == full_access) {
+			radix_tree_delete(&ikvm->access_tree, m->gfn);
+			kfree(m);
+		}
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static void kvmi_free_mem_access(struct kvm *kvm)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+	struct kvmi *ikvm = IKVM(kvm);
+
+	mutex_lock(&ikvm->access_tree_lock);
+	radix_tree_for_each_slot(slot, &ikvm->access_tree, &iter, 0) {
+		struct kvmi_mem_access *m = *slot;
+
+		m->access = full_access;
+		kvmi_apply_mem_access(kvm, NULL, m);
+
+		radix_tree_delete(&ikvm->access_tree, m->gfn);
+		kfree(*slot);
+	}
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static unsigned long *msr_mask(struct kvmi *ikvm, unsigned int *msr)
+{
+	switch (*msr) {
+	case 0 ... 0x1fff:
+		return ikvm->msr_mask.low;
+	case 0xc0000000 ... 0xc0001fff:
+		*msr &= 0x1fff;
+		return ikvm->msr_mask.high;
+	}
+	return NULL;
+}
+
+static bool test_msr_mask(struct kvmi *ikvm, unsigned int msr)
+{
+	unsigned long *mask = msr_mask(ikvm, &msr);
+
+	if (!mask)
+		return false;
+	if (!test_bit(msr, mask))
+		return false;
+
+	return true;
+}
+
+static int msr_control(struct kvmi *ikvm, unsigned int msr, bool enable)
+{
+	unsigned long *mask = msr_mask(ikvm, &msr);
+
+	if (!mask)
+		return -KVM_EINVAL;
+	if (enable)
+		set_bit(msr, mask);
+	else
+		clear_bit(msr, mask);
+	return 0;
+}
+
+unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+				   const struct kvm_sregs *sregs)
+{
+	unsigned int mode = 0;
+
+	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
+		if (sregs->cs.l)
+			mode = 8;
+		else if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (sregs->cr0 & X86_CR0_PE) {
+		if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (!sregs->cs.db)
+		mode = 2;
+	else
+		mode = 4;
+
+	return mode;
+}
+
+static int maybe_delayed_init(void)
+{
+	if (wq)
+		return 0;
+
+	wq = alloc_workqueue("kvmi", WQ_CPU_INTENSIVE, 0);
+	if (!wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+int kvmi_init(void)
+{
+	return 0;
+}
+
+static void work_cb(struct work_struct *work)
+{
+	struct kvmi *ikvm = container_of(work, struct kvmi, work);
+	struct kvm   *kvm = ikvm->kvm;
+
+	while (kvmi_msg_process(ikvm))
+		;
+
+	/* We are no longer interested in any kind of events */
+	atomic_set(&ikvm->event_mask, 0);
+
+	/* Clean-up for the next kvmi_hook() call */
+	ikvm->cr_mask = 0;
+	memset(&ikvm->msr_mask, 0, sizeof(ikvm->msr_mask));
+
+	wakeup_events(kvm);
+
+	/* Restore the spte access rights */
+	/* Shouldn't wait for reconnection? */
+	kvmi_free_mem_access(kvm);
+
+	complete_all(&ikvm->finished);
+}
+
+static void __alloc_vcpu_kvmi(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = kzalloc(sizeof(struct kvmi_vcpu), GFP_KERNEL);
+
+	if (ivcpu) {
+		sema_init(&ivcpu->sem_requests, 0);
+
+		/*
+		 * Make sure the ivcpu is initialized
+		 * before making it visible.
+		 */
+		smp_wmb();
+
+		vcpu->kvmi = ivcpu;
+
+		kvmi_make_request(ivcpu, REQ_INIT);
+		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+	}
+}
+
+void kvmi_vcpu_init(struct kvm_vcpu *vcpu)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+
+	if (is_introspected(ikvm)) {
+		mutex_lock(&vcpu->kvm->lock);
+		__alloc_vcpu_kvmi(vcpu);
+		mutex_unlock(&vcpu->kvm->lock);
+	}
+}
+
+void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
+{
+	kfree(IVCPU(vcpu));
+}
+
+static bool __alloc_kvmi(struct kvm *kvm)
+{
+	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);
+
+	if (ikvm) {
+		INIT_LIST_HEAD(&ikvm->access_list);
+		mutex_init(&ikvm->access_tree_lock);
+		INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
+		rwlock_init(&ikvm->sock_lock);
+		init_completion(&ikvm->finished);
+		INIT_WORK(&ikvm->work, work_cb);
+
+		kvm->kvmi = ikvm;
+		ikvm->kvm = kvm; /* work_cb */
+	}
+
+	return (ikvm != NULL);
+}
+
+static bool alloc_kvmi(struct kvm *kvm)
+{
+	bool done;
+
+	mutex_lock(&kvm->lock);
+	done = (
+		maybe_delayed_init() == 0    &&
+		IKVM(kvm)            == NULL &&
+		__alloc_kvmi(kvm)    == true
+	);
+	mutex_unlock(&kvm->lock);
+
+	return done;
+}
+
+static void alloc_all_kvmi_vcpu(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		if (!IKVM(vcpu))
+			__alloc_vcpu_kvmi(vcpu);
+	mutex_unlock(&kvm->lock);
+}
+
+static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (is_introspected(ikvm)) {
+		kvm_err("Guest already introspected\n");
+		return false;
+	}
+
+	if (!kvmi_msg_init(ikvm, qemu->fd))
+		return false;
+
+	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
+	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
+
+	alloc_all_kvmi_vcpu(kvm);
+	queue_work(wq, &ikvm->work);
+
+	return true;
+}
+
+/*
+ * When called from outside a page fault handler, this call should
+ * return ~0ull
+ */
+static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	u64 gla;
+	u64 gla_val;
+	u64 v;
+
+	if (!vcpu->arch.gpa_available)
+		return ~0ull;
+
+	gla = kvm_mmu_fault_gla(vcpu);
+	if (gla == ~0ull)
+		return gla;
+	gla_val = gla;
+
+	/* Handle the potential overflow by returning ~0ull */
+	if (vcpu->arch.gpa_val > gpa) {
+		v = vcpu->arch.gpa_val - gpa;
+		if (v > gla)
+			gla = ~0ull;
+		else
+			gla -= v;
+	} else {
+		v = gpa - vcpu->arch.gpa_val;
+		if (v > (U64_MAX - gla))
+			gla = ~0ull;
+		else
+			gla += v;
+	}
+
+	return gla;
+}
+
+static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u8 *new,
+			       int bytes,
+			       struct kvm_page_track_notifier_node *node,
+			       bool *data_ready)
+{
+	u64 gla;
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	bool ret = true;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return ret;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
+	if (ivcpu && ivcpu->ctx_size > 0) {
+		int s = min_t(int, bytes, ivcpu->ctx_size);
+
+		memcpy(new, ivcpu->ctx_data, s);
+		ivcpu->ctx_size = 0;
+
+		if (*data_ready)
+			kvm_err("Override custom data");
+
+		*data_ready = true;
+	}
+
+	return ret;
+}
+
+static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa,
+				const u8 *new,
+				int bytes,
+				struct kvm_page_track_notifier_node *node)
+{
+	u64 gla;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return true;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_W);
+}
+
+static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa,
+				struct kvm_page_track_notifier_node *node)
+{
+	u64 gla;
+
+	if (kvm_mmu_nested_guest_page_fault(vcpu))
+		return true;
+	gla = kvmi_mmu_fault_gla(vcpu, gpa);
+
+	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_X);
+}
+
+static void kvmi_track_create_slot(struct kvm *kvm,
+				   struct kvm_memory_slot *slot,
+				   unsigned long npages,
+				   struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + npages;
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = kvmi_get_mem_access_unlocked(kvm, start);
+		if (m)
+			kvmi_apply_mem_access(kvm, slot, m);
+		start++;
+	}
+
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  struct kvm_page_track_notifier_node *node)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+	gfn_t start = slot->base_gfn;
+	const gfn_t end = start + slot->npages;
+
+	if (!ikvm)
+		return;
+
+	mutex_lock(&ikvm->access_tree_lock);
+
+	while (start < end) {
+		struct kvmi_mem_access *m;
+
+		m = kvmi_get_mem_access_unlocked(kvm, start);
+		if (m) {
+			u8 prev_access = m->access;
+
+			m->access = full_access;
+			kvmi_apply_mem_access(kvm, slot, m);
+			m->access = prev_access;
+		}
+		start++;
+	}
+
+	mutex_unlock(&ikvm->access_tree_lock);
+}
+
+static struct kvm_page_track_notifier_node kptn_node = {
+	.track_preread = kvmi_track_preread,
+	.track_prewrite = kvmi_track_prewrite,
+	.track_preexec = kvmi_track_preexec,
+	.track_create_slot = kvmi_track_create_slot,
+	.track_flush_slot = kvmi_track_flush_slot
+};
+
+bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
+{
+	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
+
+	kvm_page_track_register_notifier(kvm, &kptn_node);
+
+	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
+}
+
+void kvmi_destroy_vm(struct kvm *kvm)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	if (ikvm) {
+		kvmi_msg_uninit(ikvm);
+
+		mutex_destroy(&ikvm->access_tree_lock);
+		kfree(ikvm);
+	}
+
+	kvmi_mem_destroy_vm(kvm);
+}
+
+void kvmi_uninit(void)
+{
+	if (wq) {
+		destroy_workqueue(wq);
+		wq = NULL;
+	}
+}
+
+void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
+{
+	struct msr_data msr;
+
+	msr.host_initiated = true;
+
+	msr.index = MSR_IA32_SYSENTER_CS;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_cs = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_ESP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_esp = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_EIP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_eip = msr.data;
+
+	msr.index = MSR_EFER;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.efer = msr.data;
+
+	msr.index = MSR_STAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.star = msr.data;
+
+	msr.index = MSR_LSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.lstar = msr.data;
+
+	msr.index = MSR_CSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.cstar = msr.data;
+
+	msr.index = MSR_IA32_CR_PAT;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.pat = msr.data;
+}
+
+static bool is_event_enabled(struct kvm *kvm, int event_bit)
+{
+	struct kvmi *ikvm = IKVM(kvm);
+
+	return (ikvm && (atomic_read(&ikvm->event_mask) & event_bit));
+}
+
+static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
+{
+	int err = -ESRCH;
+	struct pid *pid;
+	struct siginfo siginfo[1] = { };
+
+	rcu_read_lock();
+	pid = rcu_dereference(vcpu->pid);
+	if (pid)
+		err = kill_pid_info(sig, siginfo, pid);
+	rcu_read_unlock();
+
+	return err;
+}
+
+static void kvmi_vm_shutdown(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvmi_vcpu_kill(SIGTERM, vcpu);
+	}
+	mutex_unlock(&kvm->lock);
+}
+
+/* TODO: Do we need a return code ? */
+static void handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action)
+{
+	switch (action) {
+	case KVMI_EVENT_ACTION_CRASH:
+		kvmi_vm_shutdown(vcpu->kvm);
+		break;
+
+	default:
+		kvm_err("Unsupported event action: %d\n", action);
+	}
+}
+
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 ret_value;
+	u32 action;
+
+	if (!is_event_enabled(kvm, KVMI_EVENT_CR))
+		return true;
+	if (!test_bit(cr, &IKVM(kvm)->cr_mask))
+		return true;
+	if (old_value == *new_value)
+		return true;
+
+	action = kvmi_msg_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		*new_value = ret_value;
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false;
+}
+
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u64 ret_value;
+	u32 action;
+	struct msr_data old_msr = { .host_initiated = true,
+				    .index = msr->index };
+
+	if (msr->host_initiated)
+		return true;
+	if (!is_event_enabled(kvm, KVMI_EVENT_MSR))
+		return true;
+	if (!test_msr_mask(IKVM(kvm), msr->index))
+		return true;
+	if (kvm_get_msr(vcpu, &old_msr))
+		return true;
+	if (old_msr.data == msr->data)
+		return true;
+
+	action = kvmi_msg_send_msr(vcpu, msr->index, old_msr.data, msr->data,
+				   &ret_value);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		msr->data = ret_value;
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false;
+}
+
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_XSETBV))
+		return;
+
+	action = kvmi_msg_send_xsetbv(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
+{
+	u32 action;
+	u64 gpa;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
+		/* qemu will automatically reinject the breakpoint */
+		return false;
+
+	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+
+	if (gpa == UNMAPPED_GVA)
+		kvm_err("%s: invalid gva: %llx", __func__, gva);
+
+	action = kvmi_msg_send_bp(vcpu, gpa);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		/* rip was most likely adjusted past the INT 3 instruction */
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	/* qemu will automatically reinject the breakpoint */
+	return false;
+}
+EXPORT_SYMBOL(kvmi_breakpoint_event);
+
+#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
+bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long subfunc1, subfunc2;
+	bool longmode = is_64_bit_mode(vcpu);
+	unsigned long nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
+
+	if (longmode) {
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RDI);
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RSI);
+	} else {
+		nr &= 0xFFFFFFFF;
+		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RBX);
+		subfunc1 &= 0xFFFFFFFF;
+		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RCX);
+		subfunc2 &= 0xFFFFFFFF;
+	}
+
+	return (nr == KVM_HC_XEN_HVM_OP
+		&& subfunc1 == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
+		&& subfunc2 == 0);
+}
+
+void kvmi_hypercall_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_HYPERCALL)
+			|| !kvmi_is_agent_hypercall(vcpu))
+		return;
+
+	action = kvmi_msg_send_hypercall(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
+			   unsigned long gva, u8 access)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi_vcpu *ivcpu;
+	bool trap_access, ret = true;
+	u32 ctx_size;
+	u64 old_rip;
+	u32 action;
+
+	if (!is_event_enabled(kvm, KVMI_EVENT_PAGE_FAULT))
+		return true;
+
+	/* Have we shown interest in this page? */
+	if (!kvmi_test_mem_access(kvm, gpa, access))
+		return true;
+
+	ivcpu    = IVCPU(vcpu);
+	ctx_size = sizeof(ivcpu->ctx_data);
+	old_rip  = kvm_rip_read(vcpu);
+
+	if (!kvmi_msg_send_pf(vcpu, gpa, gva, access, &action,
+			      &trap_access,
+			      ivcpu->ctx_data, &ctx_size))
+		goto out;
+
+	ivcpu->ctx_size = 0;
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		ivcpu->ctx_size = ctx_size;
+		break;
+	case KVMI_EVENT_ACTION_RETRY:
+		ret = false;
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	/* TODO: trap_access -> don't REPeat the instruction */
+out:
+	trace_kvmi_event_page_fault(gpa, gva, access, old_rip, action,
+				    kvm_rip_read(vcpu), ctx_size);
+	return ret;
+}
+
+bool kvmi_lost_exception(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (!ivcpu || !ivcpu->exception.injected)
+		return false;
+
+	ivcpu->exception.injected = 0;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_TRAP))
+		return false;
+
+	if ((vcpu->arch.exception.injected || vcpu->arch.exception.pending)
+		&& vcpu->arch.exception.nr == ivcpu->exception.nr
+		&& vcpu->arch.exception.error_code
+			== ivcpu->exception.error_code)
+		return false;
+
+	return true;
+}
+
+void kvmi_trap_event(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	u32 vector, type, err;
+	u32 action;
+
+	if (vcpu->arch.exception.pending) {
+		vector = vcpu->arch.exception.nr;
+		err = vcpu->arch.exception.error_code;
+
+		if (kvm_exception_is_soft(vector))
+			type = INTR_TYPE_SOFT_EXCEPTION;
+		else
+			type = INTR_TYPE_HARD_EXCEPTION;
+	} else if (vcpu->arch.interrupt.pending) {
+		vector = vcpu->arch.interrupt.nr;
+		err = 0;
+
+		if (vcpu->arch.interrupt.soft)
+			type = INTR_TYPE_SOFT_INTR;
+		else
+			type = INTR_TYPE_EXT_INTR;
+	} else {
+		vector = 0;
+		type = 0;
+		err = 0;
+	}
+
+	kvm_err("New exception nr %d/%d err %x/%x addr %lx",
+		vector, ivcpu->exception.nr,
+		err, ivcpu->exception.error_code,
+		vcpu->arch.cr2);
+
+	action = kvmi_msg_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+}
+
+bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
+			   unsigned long exit_qualification,
+			   unsigned char descriptor, unsigned char write)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_DESCRIPTOR))
+		return true;
+
+	action = kvmi_msg_send_descriptor(vcpu, info, exit_qualification,
+					  descriptor, write);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		return true;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return false; /* TODO: double check this */
+}
+EXPORT_SYMBOL(kvmi_descriptor_event);
+
+static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_CREATE_VCPU))
+		return false;
+
+	action = kvmi_msg_send_create_vcpu(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return true;
+}
+
+static bool kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
+{
+	u32 action;
+
+	IVCPU(vcpu)->pause = false;
+
+	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_PAUSE_VCPU))
+		return false;
+
+	action = kvmi_msg_send_pause_vcpu(vcpu);
+
+	switch (action) {
+	case KVMI_EVENT_ACTION_CONTINUE:
+		break;
+	default:
+		handle_common_event_actions(vcpu, action);
+	}
+
+	return true;
+}
+
+/* TODO: refactor this function uto avoid recursive calls and the semaphore. */
+void kvmi_handle_request(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	while (ivcpu->ev_rpl_waiting
+		|| READ_ONCE(ivcpu->requests)) {
+
+		down(&ivcpu->sem_requests);
+
+		if (test_bit(REQ_INIT, &ivcpu->requests)) {
+			/*
+			 * kvmi_create_vcpu_event() may call this function
+			 * again and won't return unless there is no more work
+			 * to be done. The while condition will be evaluated
+			 * to false, but we explicitly exit the loop to avoid
+			 * surprizing the reader more than we already did.
+			 */
+			kvmi_clear_request(ivcpu, REQ_INIT);
+			if (kvmi_create_vcpu_event(vcpu))
+				break;
+		} else if (test_bit(REQ_CMD, &ivcpu->requests)) {
+			kvmi_msg_handle_vcpu_cmd(vcpu);
+			/* it will clear the REQ_CMD bit */
+			if (ivcpu->pause && !ivcpu->ev_rpl_waiting) {
+				/* Same warnings as with REQ_INIT. */
+				if (kvmi_pause_vcpu_event(vcpu))
+					break;
+			}
+		} else if (test_bit(REQ_REPLY, &ivcpu->requests)) {
+			kvmi_clear_request(ivcpu, REQ_REPLY);
+			ivcpu->ev_rpl_waiting = false;
+			if (ivcpu->have_delayed_regs) {
+				kvm_arch_vcpu_set_regs(vcpu,
+							&ivcpu->delayed_regs);
+				ivcpu->have_delayed_regs = false;
+			}
+			if (ivcpu->pause) {
+				/* Same warnings as with REQ_INIT. */
+				if (kvmi_pause_vcpu_event(vcpu))
+					break;
+			}
+		} else if (test_bit(REQ_CLOSE, &ivcpu->requests)) {
+			kvmi_clear_request(ivcpu, REQ_CLOSE);
+			break;
+		} else {
+			kvm_err("Unexpected request");
+		}
+	}
+
+	kvmi_flush_mem_access(vcpu->kvm);
+	/* TODO: merge with kvmi_set_mem_access() */
+}
+
+int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
+		       u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
+{
+	struct kvm_cpuid_entry2 *e;
+
+	e = kvm_find_cpuid_entry(vcpu, function, index);
+	if (!e)
+		return -KVM_ENOENT;
+
+	*eax = e->eax;
+	*ebx = e->ebx;
+	*ecx = e->ecx;
+	*edx = e->edx;
+
+	return 0;
+}
+
+int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc)
+{
+	/*
+	 * Should we switch vcpu_cnt to unsigned int?
+	 * If not, we should limit this to max u16 - 1
+	 */
+	*vcpu_cnt = atomic_read(&vcpu->kvm->online_vcpus);
+	if (kvm_has_tsc_control)
+		*tsc = 1000ul * vcpu->arch.virtual_tsc_khz;
+	else
+		*tsc = 0;
+
+	return 0;
+}
+
+static int get_first_vcpu(struct kvm *kvm, struct kvm_vcpu **vcpu)
+{
+	struct kvm_vcpu *v;
+
+	if (!atomic_read(&kvm->online_vcpus))
+		return -KVM_EINVAL;
+
+	v = kvm_get_vcpu(kvm, 0);
+
+	if (!v)
+		return -KVM_EINVAL;
+
+	*vcpu = v;
+
+	return 0;
+}
+
+int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
+			   struct kvm_regs *regs,
+			   struct kvm_sregs *sregs, struct kvm_msrs *msrs)
+{
+	struct kvm_msr_entry  *msr = msrs->entries;
+	unsigned int	       n   = msrs->nmsrs;
+
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, sregs);
+	*mode = kvmi_vcpu_mode(vcpu, sregs);
+
+	for (; n--; msr++) {
+		struct msr_data m   = { .index = msr->index };
+		int		err = kvm_get_msr(vcpu, &m);
+
+		if (err)
+			return -KVM_EINVAL;
+
+		msr->data = m.data;
+	}
+
+	return 0;
+}
+
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+
+	if (ivcpu->ev_rpl_waiting) {
+		memcpy(&ivcpu->delayed_regs, regs, sizeof(ivcpu->delayed_regs));
+		ivcpu->have_delayed_regs = true;
+	} else
+		kvm_err("Drop KVMI_SET_REGISTERS");
+	return 0;
+}
+
+int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access)
+{
+	struct kvmi *ikvm = IKVM(vcpu->kvm);
+	struct kvmi_mem_access *m;
+
+	mutex_lock(&ikvm->access_tree_lock);
+	m = kvmi_get_mem_access_unlocked(vcpu->kvm, gpa_to_gfn(gpa));
+	*access = m ? m->access : full_access;
+	mutex_unlock(&ikvm->access_tree_lock);
+
+	return 0;
+}
+
+static bool is_vector_valid(u8 vector)
+{
+	return true;
+}
+
+static bool is_gva_valid(struct kvm_vcpu *vcpu, u64 gva)
+{
+	return true;
+}
+
+int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+			      bool error_code_valid, u16 error_code,
+			      u64 address)
+{
+	struct x86_exception e = {
+		.vector = vector,
+		.error_code_valid = error_code_valid,
+		.error_code = error_code,
+		.address = address,
+	};
+
+	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
+		return -KVM_EINVAL;
+
+	if (e.vector == PF_VECTOR)
+		kvm_inject_page_fault(vcpu, &e);
+	else if (e.error_code_valid)
+		kvm_queue_exception_e(vcpu, e.vector, e.error_code);
+	else
+		kvm_queue_exception(vcpu, e.vector);
+
+	if (IVCPU(vcpu)->exception.injected)
+		kvm_err("Override exception");
+
+	IVCPU(vcpu)->exception.injected = 1;
+	IVCPU(vcpu)->exception.nr = e.vector;
+	IVCPU(vcpu)->exception.error_code = error_code_valid ? error_code : 0;
+
+	return 0;
+}
+
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long hva;
+
+	mutex_lock(&kvm->slots_lock);
+	hva = gfn_to_hva(kvm, gfn);
+	mutex_unlock(&kvm->slots_lock);
+
+	return hva;
+}
+
+static long get_user_pages_remote_unlocked(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   unsigned int gup_flags,
+					   struct page **pages)
+{
+	long ret;
+	struct task_struct *tsk = NULL;
+	struct vm_area_struct **vmas = NULL;
+	int locked = 1;
+
+	down_read(&mm->mmap_sem);
+	ret =
+	    get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				  vmas, &locked);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size, int (*send)(
+				   struct kvmi *, const struct kvmi_msg_hdr *,
+				   int err, const void *buf, size_t),
+				   const struct kvmi_msg_hdr *ctx)
+{
+	int err, ec;
+	unsigned long hva;
+	struct page *page = NULL;
+	void *ptr_page = NULL, *ptr = NULL;
+	size_t ptr_size = 0;
+	struct kvm_vcpu *vcpu;
+
+	ec = get_first_vcpu(kvm, &vcpu);
+
+	if (ec)
+		goto out;
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
+
+	if (kvm_is_error_hva(hva)) {
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, 0, &page) != 1) {
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	ptr_page = kmap_atomic(page);
+
+	ptr = ptr_page + (gpa & ~PAGE_MASK);
+	ptr_size = size;
+
+out:
+	err = send(IKVM(kvm), ctx, ec, ptr, ptr_size);
+
+	if (ptr_page)
+		kunmap_atomic(ptr_page);
+	if (page)
+		put_page(page);
+	return err;
+}
+
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
+{
+	int err;
+	unsigned long hva;
+	struct page *page;
+	void *ptr;
+	struct kvm_vcpu *vcpu;
+
+	err = get_first_vcpu(kvm, &vcpu);
+
+	if (err)
+		return err;
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
+
+	if (kvm_is_error_hva(hva))
+		return -KVM_EINVAL;
+
+	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, FOLL_WRITE,
+			&page) != 1)
+		return -KVM_EINVAL;
+
+	ptr = kmap_atomic(page);
+
+	memcpy(ptr + (gpa & ~PAGE_MASK), buf, size);
+
+	kunmap_atomic(ptr);
+	put_page(page);
+
+	return 0;
+}
+
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	int err = 0;
+
+	/* create random token */
+	get_random_bytes(token, sizeof(struct kvmi_map_mem_token));
+
+	/* store token in HOST database */
+	if (kvmi_store_token(kvm, token))
+		err = -KVM_ENOMEM;
+
+	return err;
+}
+
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events)
+{
+	int err = 0;
+
+	if (events & ~KVMI_KNOWN_EVENTS)
+		return -KVM_EINVAL;
+
+	if (events & KVMI_EVENT_BREAKPOINT) {
+		if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT)) {
+			struct kvm_guest_debug dbg = { };
+
+			dbg.control =
+			    KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
+
+			err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
+		}
+	}
+
+	if (!err)
+		atomic_set(&IKVM(vcpu->kvm)->event_mask, events);
+
+	return err;
+}
+
+int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr)
+{
+	switch (cr) {
+	case 0:
+	case 3:
+	case 4:
+		if (enable)
+			set_bit(cr, &ikvm->cr_mask);
+		else
+			clear_bit(cr, &ikvm->cr_mask);
+		return 0;
+
+	default:
+		return -KVM_EINVAL;
+	}
+}
+
+int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr)
+{
+	struct kvm_vcpu *vcpu;
+	int err;
+
+	err = get_first_vcpu(kvm, &vcpu);
+	if (err)
+		return err;
+
+	err = msr_control(IKVM(kvm), msr, enable);
+
+	if (!err)
+		kvm_arch_msr_intercept(vcpu, msr, enable);
+
+	return err;
+}
+
+void wakeup_events(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmi_make_request(IVCPU(vcpu), REQ_CLOSE);
+	mutex_unlock(&kvm->lock);
+}
diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
new file mode 100644
index 000000000000..5976b98f11cb
--- /dev/null
+++ b/virt/kvm/kvmi_int.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVMI_INT_H__
+#define __KVMI_INT_H__
+
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+
+#include <uapi/linux/kvmi.h>
+
+#define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
+
+struct kvmi_vcpu {
+	u8 ctx_data[256];
+	u32 ctx_size;
+	struct semaphore sem_requests;
+	unsigned long requests;
+	/* TODO: get this ~64KB buffer from a cache */
+	u8 msg_buf[KVMI_MAX_MSG_SIZE];
+	struct kvmi_event_reply ev_rpl;
+	void *ev_rpl_ptr;
+	size_t ev_rpl_size;
+	size_t ev_rpl_received;
+	u32 ev_seq;
+	bool ev_rpl_waiting;
+	struct {
+		u16 error_code;
+		u8 nr;
+		bool injected;
+	} exception;
+	struct kvm_regs delayed_regs;
+	bool have_delayed_regs;
+	bool pause;
+};
+
+#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
+
+struct kvmi {
+	atomic_t event_mask;
+	unsigned long cr_mask;
+	struct {
+		unsigned long low[BITS_TO_LONGS(8192)];
+		unsigned long high[BITS_TO_LONGS(8192)];
+	} msr_mask;
+	struct radix_tree_root access_tree;
+	struct mutex access_tree_lock;
+	struct list_head access_list;
+	struct work_struct work;
+	struct socket *sock;
+	rwlock_t sock_lock;
+	struct completion finished;
+	struct kvm *kvm;
+	/* TODO: get this ~64KB buffer from a cache */
+	u8 msg_buf[KVMI_MAX_MSG_SIZE];
+	u32 cmd_allow_mask;
+	u32 event_allow_mask;
+};
+
+#define REQ_INIT   0
+#define REQ_CMD    1
+#define REQ_REPLY  2
+#define REQ_CLOSE  3
+
+/* kvmi_msg.c */
+bool kvmi_msg_init(struct kvmi *ikvm, int fd);
+bool kvmi_msg_process(struct kvmi *ikvm);
+void kvmi_msg_uninit(struct kvmi *ikvm);
+void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
+		     u64 new_value, u64 *ret_value);
+u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
+		      u64 new_value, u64 *ret_value);
+u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa);
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
+bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
+		      u32 *action, bool *trap_access, u8 *ctx,
+		      u32 *ctx_size);
+u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
+		       u32 error_code, u64 cr2);
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
+			     u64 exit_qualification, u8 descriptor, u8 write);
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
+
+/* kvmi.c */
+int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc);
+int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu);
+int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
+			   struct kvm_regs *regs, struct kvm_sregs *sregs,
+			   struct kvm_msrs *msrs);
+int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs);
+int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access);
+int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
+			      bool error_code_valid, u16 error_code,
+			      u64 address);
+int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events);
+int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
+		       u32 *eax, u32 *ebx, u32 *rcx, u32 *edx);
+int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size,
+			   int (*send)(struct kvmi *,
+					const struct kvmi_msg_hdr*,
+					int err, const void *buf, size_t),
+			   const struct kvmi_msg_hdr *ctx);
+int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size,
+			    const void *buf);
+int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
+int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr);
+int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr);
+int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access);
+void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req);
+void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req);
+unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+			    const struct kvm_sregs *sregs);
+void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event);
+unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn);
+void kvmi_mem_destroy_vm(struct kvm *kvm);
+
+/* kvmi_mem.c */
+int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
+
+#endif
diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
new file mode 100644
index 000000000000..c766357678e6
--- /dev/null
+++ b/virt/kvm/kvmi_mem.c
@@ -0,0 +1,730 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection memory mapping implementation
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/rmap.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/printk.h>
+#include <linux/kvmi.h>
+#include <linux/huge_mm.h>
+
+#include <uapi/linux/kvmi.h>
+
+#include "kvmi_int.h"
+
+
+static struct list_head mapping_list;
+static spinlock_t mapping_lock;
+
+struct host_map {
+	struct list_head mapping_list;
+	gpa_t map_gpa;
+	struct kvm *machine;
+	gpa_t req_gpa;
+};
+
+
+static struct list_head token_list;
+static spinlock_t token_lock;
+
+struct token_entry {
+	struct list_head token_list;
+	struct kvmi_map_mem_token token;
+	struct kvm *kvm;
+};
+
+
+int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
+{
+	struct token_entry *tep;
+
+	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
+			     32, 1, token, sizeof(struct kvmi_map_mem_token),
+			     false);
+
+	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);
+	if (tep == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tep->token_list);
+	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));
+	tep->kvm = kvm;
+
+	spin_lock(&token_lock);
+	list_add_tail(&tep->token_list, &token_list);
+	spin_unlock(&token_lock);
+
+	return 0;
+}
+
+static struct kvm *find_machine_at(struct kvm_vcpu *vcpu, gva_t tkn_gva)
+{
+	long result;
+	gpa_t tkn_gpa;
+	struct kvmi_map_mem_token token;
+	struct list_head *cur;
+	struct token_entry *tep, *found = NULL;
+	struct kvm *target_kvm = NULL;
+
+	/* machine token is passed as pointer */
+	tkn_gpa = kvm_mmu_gva_to_gpa_system(vcpu, tkn_gva, NULL);
+	if (tkn_gpa == UNMAPPED_GVA)
+		return NULL;
+
+	/* copy token to local address space */
+	result = kvm_read_guest(vcpu->kvm, tkn_gpa, &token, sizeof(token));
+	if (IS_ERR_VALUE(result)) {
+		kvm_err("kvmi: failed copying token from user\n");
+		return ERR_PTR(result);
+	}
+
+	/* consume token & find the VM */
+	spin_lock(&token_lock);
+	list_for_each(cur, &token_list) {
+		tep = list_entry(cur, struct token_entry, token_list);
+
+		if (!memcmp(&token, &tep->token, sizeof(token))) {
+			list_del(&tep->token_list);
+			found = tep;
+			break;
+		}
+	}
+	spin_unlock(&token_lock);
+
+	if (found != NULL) {
+		target_kvm = found->kvm;
+		kfree(found);
+	}
+
+	return target_kvm;
+}
+
+static void remove_vm_token(struct kvm *kvm)
+{
+	struct list_head *cur, *next;
+	struct token_entry *tep;
+
+	spin_lock(&token_lock);
+	list_for_each_safe(cur, next, &token_list) {
+		tep = list_entry(cur, struct token_entry, token_list);
+
+		if (tep->kvm == kvm) {
+			list_del(&tep->token_list);
+			kfree(tep);
+		}
+	}
+	spin_unlock(&token_lock);
+
+}
+
+
+static int add_to_list(gpa_t map_gpa, struct kvm *machine, gpa_t req_gpa)
+{
+	struct host_map *map;
+
+	map = kmalloc(sizeof(struct host_map), GFP_KERNEL);
+	if (map == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&map->mapping_list);
+	map->map_gpa = map_gpa;
+	map->machine = machine;
+	map->req_gpa = req_gpa;
+
+	spin_lock(&mapping_lock);
+	list_add_tail(&map->mapping_list, &mapping_list);
+	spin_unlock(&mapping_lock);
+
+	return 0;
+}
+
+static struct host_map *extract_from_list(gpa_t map_gpa)
+{
+	struct list_head *cur;
+	struct host_map *map;
+
+	spin_lock(&mapping_lock);
+	list_for_each(cur, &mapping_list) {
+		map = list_entry(cur, struct host_map, mapping_list);
+
+		/* found - extract and return */
+		if (map->map_gpa == map_gpa) {
+			list_del(&map->mapping_list);
+			spin_unlock(&mapping_lock);
+
+			return map;
+		}
+	}
+	spin_unlock(&mapping_lock);
+
+	return NULL;
+}
+
+static void remove_vm_from_list(struct kvm *kvm)
+{
+	struct list_head *cur, *next;
+	struct host_map *map;
+
+	spin_lock(&mapping_lock);
+
+	list_for_each_safe(cur, next, &mapping_list) {
+		map = list_entry(cur, struct host_map, mapping_list);
+
+		if (map->machine == kvm) {
+			list_del(&map->mapping_list);
+			kfree(map);
+		}
+	}
+
+	spin_unlock(&mapping_lock);
+}
+
+static void remove_entry(struct host_map *map)
+{
+	kfree(map);
+}
+
+
+static struct vm_area_struct *isolate_page_vma(struct vm_area_struct *vma,
+					       unsigned long addr)
+{
+	int result;
+
+	/* corner case */
+	if (vma_pages(vma) == 1)
+		return vma;
+
+	if (addr != vma->vm_start) {
+		/* first split only if address in the middle */
+		result = split_vma(vma->vm_mm, vma, addr, false);
+		if (IS_ERR_VALUE((long)result))
+			return ERR_PTR((long)result);
+
+		vma = find_vma(vma->vm_mm, addr);
+		if (vma == NULL)
+			return ERR_PTR(-ENOENT);
+
+		/* corner case (again) */
+		if (vma_pages(vma) == 1)
+			return vma;
+	}
+
+	result = split_vma(vma->vm_mm, vma, addr + PAGE_SIZE, true);
+	if (IS_ERR_VALUE((long)result))
+		return ERR_PTR((long)result);
+
+	vma = find_vma(vma->vm_mm, addr);
+	if (vma == NULL)
+		return ERR_PTR(-ENOENT);
+
+	BUG_ON(vma_pages(vma) != 1);
+
+	return vma;
+}
+
+static int redirect_rmap(struct vm_area_struct *req_vma, struct page *req_page,
+			 struct vm_area_struct *map_vma)
+{
+	int result;
+
+	unlink_anon_vmas(map_vma);
+
+	result = anon_vma_fork(map_vma, req_vma);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	page_dup_rmap(req_page, false);
+
+out:
+	return result;
+}
+
+static int host_map_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
+			     struct page *req_page, struct page *map_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	pte_t newpte;
+
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+
+	/* classic replace_page() code */
+	pmd = mm_find_pmd(map_mm, map_hva);
+	if (!pmd)
+		return -EFAULT;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
+
+	/* create new PTE based on requested page */
+	newpte = mk_pte(req_page, map_vma->vm_page_prot);
+	newpte = pte_set_flags(newpte, pte_flags(*ptep));
+
+	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+	ptep_clear_flush_notify(map_vma, map_hva, ptep);
+	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
+
+	return 0;
+}
+
+static void discard_page(struct page *map_page)
+{
+	lock_page(map_page);
+	// TODO: put_anon_vma() ???? - should be here
+	page_remove_rmap(map_page, false);
+	if (!page_mapped(map_page))
+		try_to_free_swap(map_page);
+	unlock_page(map_page);
+	put_page(map_page);
+}
+
+static void kvmi_split_huge_pmd(struct vm_area_struct *req_vma,
+				hva_t req_hva, struct page *req_page)
+{
+	bool tail = false;
+
+	/* move reference count from compound head... */
+	if (PageTail(req_page)) {
+		tail = true;
+		put_page(req_page);
+	}
+
+	if (PageCompound(req_page))
+		split_huge_pmd_address(req_vma, req_hva, false, NULL);
+
+	/* ... to the actual page, after splitting */
+	if (tail)
+		get_page(req_page);
+}
+
+static int kvmi_map_action(struct mm_struct *req_mm, hva_t req_hva,
+			   struct mm_struct *map_mm, hva_t map_hva)
+{
+	struct vm_area_struct *req_vma;
+	struct page *req_page = NULL;
+
+	struct vm_area_struct *map_vma;
+	struct page *map_page;
+
+	long nrpages;
+	int result = 0;
+
+	/* VMAs will be modified */
+	down_write(&req_mm->mmap_sem);
+	down_write(&map_mm->mmap_sem);
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm,
+		req_hva, 1, 0,
+		&req_page, &req_vma, NULL);
+	if (nrpages == 0) {
+		kvm_err("kvmi: no page for req_hva %016lx\n", req_hva);
+		result = -ENOENT;
+		goto out_err;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		kvm_err("kvmi: get_user_pages_remote() failed with result %d\n",
+			result);
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page before remap");
+
+	/* find (not get) local page corresponding to target address */
+	map_vma = find_vma(map_mm, map_hva);
+	if (map_vma == NULL) {
+		kvm_err("kvmi: no local VMA found for remapping\n");
+		result = -ENOENT;
+		goto out_err;
+	}
+
+	map_page = follow_page(map_vma, map_hva, 0);
+	if (IS_ERR_VALUE(map_page)) {
+		result = PTR_ERR(map_page);
+		kvm_debug("kvmi: follow_page() failed with result %d\n",
+			result);
+		goto out_err;
+	} else if (map_page == NULL) {
+		result = -ENOENT;
+		kvm_debug("kvmi: follow_page() returned no page\n");
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(map_page, "map_page before remap");
+
+	/* split local VMA for rmap redirecting */
+	map_vma = isolate_page_vma(map_vma, map_hva);
+	if (IS_ERR_VALUE(map_vma)) {
+		result = PTR_ERR(map_vma);
+		kvm_debug("kvmi: isolate_page_vma() failed with result %d\n",
+			result);
+		goto out_err;
+	}
+
+	/* split remote huge page */
+	kvmi_split_huge_pmd(req_vma, req_hva, req_page);
+
+	/* re-link VMAs */
+	result = redirect_rmap(req_vma, req_page, map_vma);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	/* also redirect page tables */
+	result = host_map_fix_ptes(map_vma, map_hva, req_page, map_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	/* the old page will be discarded */
+	discard_page(map_page);
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(map_page, "map_page after being discarded");
+
+	/* done */
+	goto out_finalize;
+
+out_err:
+	/* get_user_pages_remote() incremented page reference count */
+	if (req_page != NULL)
+		put_page(req_page);
+
+out_finalize:
+	/* release semaphores in reverse order */
+	up_write(&map_mm->mmap_sem);
+	up_write(&req_mm->mmap_sem);
+
+	return result;
+}
+
+int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
+	gpa_t req_gpa, gpa_t map_gpa)
+{
+	int result = 0;
+	struct kvm *target_kvm;
+
+	gfn_t req_gfn;
+	hva_t req_hva;
+	struct mm_struct *req_mm;
+
+	gfn_t map_gfn;
+	hva_t map_hva;
+	struct mm_struct *map_mm = vcpu->kvm->mm;
+
+	kvm_debug("kvmi: mapping request req_gpa %016llx, map_gpa %016llx\n",
+		  req_gpa, map_gpa);
+
+	/* get the struct kvm * corresponding to the token */
+	target_kvm = find_machine_at(vcpu, tkn_gva);
+	if (IS_ERR_VALUE(target_kvm))
+		return PTR_ERR(target_kvm);
+	else if (target_kvm == NULL) {
+		kvm_err("kvmi: unable to find target machine\n");
+		return -ENOENT;
+	}
+	kvm_get_kvm(target_kvm);
+	req_mm = target_kvm->mm;
+
+	/* translate source addresses */
+	req_gfn = gpa_to_gfn(req_gpa);
+	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
+	if (kvm_is_error_hva(req_hva)) {
+		kvm_err("kvmi: invalid req HVA %016lx\n", req_hva);
+		result = -EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: req_gpa %016llx, req_gfn %016llx, req_hva %016lx\n",
+		  req_gpa, req_gfn, req_hva);
+
+	/* translate destination addresses */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		kvm_err("kvmi: invalid map HVA %016lx\n", map_hva);
+		result = -EFAULT;
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
+		map_gpa, map_gfn, map_hva);
+
+	/* go to step 2 */
+	result = kvmi_map_action(req_mm, req_hva, map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	/* add mapping to list */
+	result = add_to_list(map_gpa, target_kvm, req_gpa);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	/* all fine */
+	kvm_debug("kvmi: mapping of req_gpa %016llx successful\n", req_gpa);
+
+out:
+	/* mandatory dec refernce count */
+	kvm_put_kvm(target_kvm);
+
+	return result;
+}
+
+
+static int restore_rmap(struct vm_area_struct *map_vma, hva_t map_hva,
+			struct page *req_page, struct page *new_page)
+{
+	int result;
+
+	/* decouple links to anon_vmas */
+	unlink_anon_vmas(map_vma);
+	map_vma->anon_vma = NULL;
+
+	/* allocate new anon_vma */
+	result = anon_vma_prepare(map_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	lock_page(new_page);
+	page_add_new_anon_rmap(new_page, map_vma, map_hva, false);
+	unlock_page(new_page);
+
+	/* decrease req_page mapcount */
+	atomic_dec(&req_page->_mapcount);
+
+	return 0;
+}
+
+static int host_unmap_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
+			       struct page *new_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	pte_t newpte;
+
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+
+	/* page replacing code */
+	pmd = mm_find_pmd(map_mm, map_hva);
+	if (!pmd)
+		return -EFAULT;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
+
+	newpte = mk_pte(new_page, map_vma->vm_page_prot);
+	newpte = pte_set_flags(newpte, pte_flags(*ptep));
+
+	/* clear cache & MMU notifier entries */
+	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+	ptep_clear_flush_notify(map_vma, map_hva, ptep);
+	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
+
+	return 0;
+}
+
+static int kvmi_unmap_action(struct mm_struct *req_mm,
+			     struct mm_struct *map_mm, hva_t map_hva)
+{
+	struct vm_area_struct *map_vma;
+	struct page *req_page = NULL;
+	struct page *new_page = NULL;
+
+	int result;
+
+	/* VMAs will be modified */
+	down_write(&req_mm->mmap_sem);
+	down_write(&map_mm->mmap_sem);
+
+	/* find destination VMA for mapping */
+	map_vma = find_vma(map_mm, map_hva);
+	if (map_vma == NULL) {
+		result = -ENOENT;
+		kvm_err("kvmi: no local VMA found for unmapping\n");
+		goto out_err;
+	}
+
+	/* find (not get) page mapped to destination address */
+	req_page = follow_page(map_vma, map_hva, 0);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		kvm_err("kvmi: follow_page() failed with result %d\n", result);
+		goto out_err;
+	} else if (req_page == NULL) {
+		result = -ENOENT;
+		kvm_err("kvmi: follow_page() returned no page\n");
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page before decoupling");
+
+	/* Returns NULL when no page can be allocated. */
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, map_vma, map_hva);
+	if (new_page == NULL) {
+		result = -ENOMEM;
+		goto out_err;
+	}
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(new_page, "new_page after allocation");
+
+	/* should fix the rmap tree */
+	result = restore_rmap(map_vma, map_hva, req_page, new_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(req_page, "req_page after decoupling");
+
+	/* page table fixing here */
+	result = host_unmap_fix_ptes(map_vma, map_hva, new_page);
+	if (IS_ERR_VALUE((long)result))
+		goto out_err;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		dump_page(new_page, "new_page after unmapping");
+
+	goto out_finalize;
+
+out_err:
+	if (new_page != NULL)
+		put_page(new_page);
+
+out_finalize:
+	/* reference count was inc during get_user_pages_remote() */
+	if (req_page != NULL) {
+		put_page(req_page);
+
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			dump_page(req_page, "req_page after release");
+	}
+
+	/* release semaphores in reverse order */
+	up_write(&map_mm->mmap_sem);
+	up_write(&req_mm->mmap_sem);
+
+	return result;
+}
+
+int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
+{
+	struct kvm *target_kvm;
+	struct mm_struct *req_mm;
+
+	struct host_map *map;
+	int result;
+
+	gfn_t map_gfn;
+	hva_t map_hva;
+	struct mm_struct *map_mm = vcpu->kvm->mm;
+
+	kvm_debug("kvmi: unmap request for map_gpa %016llx\n", map_gpa);
+
+	/* get the struct kvm * corresponding to map_gpa */
+	map = extract_from_list(map_gpa);
+	if (map == NULL) {
+		kvm_err("kvmi: map_gpa %016llx not mapped\n", map_gpa);
+		return -ENOENT;
+	}
+	target_kvm = map->machine;
+	kvm_get_kvm(target_kvm);
+	req_mm = target_kvm->mm;
+
+	kvm_debug("kvmi: req_gpa %016llx of machine %016lx mapped in map_gpa %016llx\n",
+		  map->req_gpa, (unsigned long) map->machine, map->map_gpa);
+
+	/* address where we did the remapping */
+	map_gfn = gpa_to_gfn(map_gpa);
+	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
+	if (kvm_is_error_hva(map_hva)) {
+		result = -EFAULT;
+		kvm_err("kvmi: invalid HVA %016lx\n", map_hva);
+		goto out;
+	}
+
+	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
+		  map_gpa, map_gfn, map_hva);
+
+	/* go to step 2 */
+	result = kvmi_unmap_action(req_mm, map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result))
+		goto out;
+
+	kvm_debug("kvmi: unmap of map_gpa %016llx successful\n", map_gpa);
+
+out:
+	kvm_put_kvm(target_kvm);
+
+	/* remove entry whatever happens above */
+	remove_entry(map);
+
+	return result;
+}
+
+void kvmi_mem_destroy_vm(struct kvm *kvm)
+{
+	kvm_debug("kvmi: machine %016lx was torn down\n",
+		(unsigned long) kvm);
+
+	remove_vm_from_list(kvm);
+	remove_vm_token(kvm);
+}
+
+
+int kvm_intro_host_init(void)
+{
+	/* token database */
+	INIT_LIST_HEAD(&token_list);
+	spin_lock_init(&token_lock);
+
+	/* mapping database */
+	INIT_LIST_HEAD(&mapping_list);
+	spin_lock_init(&mapping_lock);
+
+	kvm_info("kvmi: initialized host memory introspection\n");
+
+	return 0;
+}
+
+void kvm_intro_host_exit(void)
+{
+	// ...
+}
+
+module_init(kvm_intro_host_init)
+module_exit(kvm_intro_host_exit)
diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
new file mode 100644
index 000000000000..b1b20eb6332d
--- /dev/null
+++ b/virt/kvm/kvmi_msg.c
@@ -0,0 +1,1134 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM introspection
+ *
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ */
+#include <linux/file.h>
+#include <linux/net.h>
+#include <linux/kvm_host.h>
+#include <linux/kvmi.h>
+#include <asm/virtext.h>
+
+#include <uapi/linux/kvmi.h>
+#include <uapi/asm/kvmi.h>
+
+#include "kvmi_int.h"
+
+#include <trace/events/kvmi.h>
+
+/*
+ * TODO: break these call paths
+ *   kvmi.c        work_cb
+ *   kvmi_msg.c    kvmi_dispatch_message
+ *   kvmi.c        kvmi_cmd_... / kvmi_make_request
+ *   kvmi_msg.c    kvmi_msg_reply
+ *
+ *   kvmi.c        kvmi_X_event
+ *   kvmi_msg.c    kvmi_send_event
+ *   kvmi.c        kvmi_handle_request
+ */
+
+/* TODO: move some of the code to arch/x86 */
+
+static atomic_t seq_ev = ATOMIC_INIT(0);
+
+static u32 new_seq(void)
+{
+	return atomic_inc_return(&seq_ev);
+}
+
+static const char *event_str(unsigned int e)
+{
+	switch (e) {
+	case KVMI_EVENT_CR:
+		return "CR";
+	case KVMI_EVENT_MSR:
+		return "MSR";
+	case KVMI_EVENT_XSETBV:
+		return "XSETBV";
+	case KVMI_EVENT_BREAKPOINT:
+		return "BREAKPOINT";
+	case KVMI_EVENT_HYPERCALL:
+		return "HYPERCALL";
+	case KVMI_EVENT_PAGE_FAULT:
+		return "PAGE_FAULT";
+	case KVMI_EVENT_TRAP:
+		return "TRAP";
+	case KVMI_EVENT_DESCRIPTOR:
+		return "DESCRIPTOR";
+	case KVMI_EVENT_CREATE_VCPU:
+		return "CREATE_VCPU";
+	case KVMI_EVENT_PAUSE_VCPU:
+		return "PAUSE_VCPU";
+	default:
+		return "EVENT?";
+	}
+}
+
+static const char * const msg_IDs[] = {
+	[KVMI_GET_VERSION]      = "KVMI_GET_VERSION",
+	[KVMI_GET_GUEST_INFO]   = "KVMI_GET_GUEST_INFO",
+	[KVMI_PAUSE_VCPU]       = "KVMI_PAUSE_VCPU",
+	[KVMI_GET_REGISTERS]    = "KVMI_GET_REGISTERS",
+	[KVMI_SET_REGISTERS]    = "KVMI_SET_REGISTERS",
+	[KVMI_GET_PAGE_ACCESS]  = "KVMI_GET_PAGE_ACCESS",
+	[KVMI_SET_PAGE_ACCESS]  = "KVMI_SET_PAGE_ACCESS",
+	[KVMI_INJECT_EXCEPTION] = "KVMI_INJECT_EXCEPTION",
+	[KVMI_READ_PHYSICAL]    = "KVMI_READ_PHYSICAL",
+	[KVMI_WRITE_PHYSICAL]   = "KVMI_WRITE_PHYSICAL",
+	[KVMI_GET_MAP_TOKEN]    = "KVMI_GET_MAP_TOKEN",
+	[KVMI_CONTROL_EVENTS]   = "KVMI_CONTROL_EVENTS",
+	[KVMI_CONTROL_CR]       = "KVMI_CONTROL_CR",
+	[KVMI_CONTROL_MSR]      = "KVMI_CONTROL_MSR",
+	[KVMI_EVENT]            = "KVMI_EVENT",
+	[KVMI_EVENT_REPLY]      = "KVMI_EVENT_REPLY",
+	[KVMI_GET_CPUID]        = "KVMI_GET_CPUID",
+	[KVMI_GET_XSAVE]        = "KVMI_GET_XSAVE",
+};
+
+static size_t sizeof_get_registers(const void *r)
+{
+	const struct kvmi_get_registers *req = r;
+
+	return sizeof(*req) + sizeof(req->msrs_idx[0]) * req->nmsrs;
+}
+
+static size_t sizeof_get_page_access(const void *r)
+{
+	const struct kvmi_get_page_access *req = r;
+
+	return sizeof(*req) + sizeof(req->gpa[0]) * req->count;
+}
+
+static size_t sizeof_set_page_access(const void *r)
+{
+	const struct kvmi_set_page_access *req = r;
+
+	return sizeof(*req) + sizeof(req->entries[0]) * req->count;
+}
+
+static size_t sizeof_write_physical(const void *r)
+{
+	const struct kvmi_write_physical *req = r;
+
+	return sizeof(*req) + req->size;
+}
+
+static const struct {
+	size_t size;
+	size_t (*cbk_full_size)(const void *msg);
+} msg_bytes[] = {
+	[KVMI_GET_VERSION]      = { 0, NULL },
+	[KVMI_GET_GUEST_INFO]   = { sizeof(struct kvmi_get_guest_info), NULL },
+	[KVMI_PAUSE_VCPU]       = { sizeof(struct kvmi_pause_vcpu), NULL },
+	[KVMI_GET_REGISTERS]    = { sizeof(struct kvmi_get_registers),
+						sizeof_get_registers },
+	[KVMI_SET_REGISTERS]    = { sizeof(struct kvmi_set_registers), NULL },
+	[KVMI_GET_PAGE_ACCESS]  = { sizeof(struct kvmi_get_page_access),
+						sizeof_get_page_access },
+	[KVMI_SET_PAGE_ACCESS]  = { sizeof(struct kvmi_set_page_access),
+						sizeof_set_page_access },
+	[KVMI_INJECT_EXCEPTION] = { sizeof(struct kvmi_inject_exception),
+					NULL },
+	[KVMI_READ_PHYSICAL]    = { sizeof(struct kvmi_read_physical), NULL },
+	[KVMI_WRITE_PHYSICAL]   = { sizeof(struct kvmi_write_physical),
+						sizeof_write_physical },
+	[KVMI_GET_MAP_TOKEN]    = { 0, NULL },
+	[KVMI_CONTROL_EVENTS]   = { sizeof(struct kvmi_control_events), NULL },
+	[KVMI_CONTROL_CR]       = { sizeof(struct kvmi_control_cr), NULL },
+	[KVMI_CONTROL_MSR]      = { sizeof(struct kvmi_control_msr), NULL },
+	[KVMI_GET_CPUID]        = { sizeof(struct kvmi_get_cpuid), NULL },
+	[KVMI_GET_XSAVE]        = { sizeof(struct kvmi_get_xsave), NULL },
+};
+
+static int kvmi_sock_read(struct kvmi *ikvm, void *buf, size_t size)
+{
+	struct kvec i = {
+		.iov_base = buf,
+		.iov_len = size,
+	};
+	struct msghdr m = { };
+	int rc;
+
+	read_lock(&ikvm->sock_lock);
+
+	if (likely(ikvm->sock))
+		rc = kernel_recvmsg(ikvm->sock, &m, &i, 1, size, MSG_WAITALL);
+	else
+		rc = -EPIPE;
+
+	if (rc > 0)
+		print_hex_dump_debug("read: ", DUMP_PREFIX_NONE, 32, 1,
+					buf, rc, false);
+
+	read_unlock(&ikvm->sock_lock);
+
+	if (unlikely(rc != size)) {
+		kvm_err("kernel_recvmsg: %d\n", rc);
+		if (rc >= 0)
+			rc = -EPIPE;
+		return rc;
+	}
+
+	return 0;
+}
+
+static int kvmi_sock_write(struct kvmi *ikvm, struct kvec *i, size_t n,
+			   size_t size)
+{
+	struct msghdr m = { };
+	int rc, k;
+
+	read_lock(&ikvm->sock_lock);
+
+	if (likely(ikvm->sock))
+		rc = kernel_sendmsg(ikvm->sock, &m, i, n, size);
+	else
+		rc = -EPIPE;
+
+	for (k = 0; k < n; k++)
+		print_hex_dump_debug("write: ", DUMP_PREFIX_NONE, 32, 1,
+				     i[k].iov_base, i[k].iov_len, false);
+
+	read_unlock(&ikvm->sock_lock);
+
+	if (unlikely(rc != size)) {
+		kvm_err("kernel_sendmsg: %d\n", rc);
+		if (rc >= 0)
+			rc = -EPIPE;
+		return rc;
+	}
+
+	return 0;
+}
+
+static const char *id2str(int i)
+{
+	return (i < ARRAY_SIZE(msg_IDs) && msg_IDs[i] ? msg_IDs[i] : "unknown");
+}
+
+static struct kvmi_vcpu *kvmi_vcpu_waiting_for_reply(struct kvm *kvm, u32 seq)
+{
+	struct kvmi_vcpu *found = NULL;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	mutex_lock(&kvm->lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/* kvmi_send_event */
+		smp_rmb();
+		if (READ_ONCE(IVCPU(vcpu)->ev_rpl_waiting)
+		    && seq == IVCPU(vcpu)->ev_seq) {
+			found = IVCPU(vcpu);
+			break;
+		}
+	}
+
+	mutex_unlock(&kvm->lock);
+
+	return found;
+}
+
+static bool kvmi_msg_dispatch_reply(struct kvmi *ikvm,
+				    const struct kvmi_msg_hdr *msg)
+{
+	struct kvmi_vcpu *ivcpu;
+	int err;
+
+	ivcpu = kvmi_vcpu_waiting_for_reply(ikvm->kvm, msg->seq);
+	if (!ivcpu) {
+		kvm_err("%s: unexpected event reply (seq=%u)\n", __func__,
+			msg->seq);
+		return false;
+	}
+
+	if (msg->size == sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size) {
+		err = kvmi_sock_read(ikvm, &ivcpu->ev_rpl,
+					sizeof(ivcpu->ev_rpl));
+		if (!err && ivcpu->ev_rpl_size)
+			err = kvmi_sock_read(ikvm, ivcpu->ev_rpl_ptr,
+						ivcpu->ev_rpl_size);
+	} else {
+		kvm_err("%s: invalid event reply size (max=%zu, recv=%u, expected=%zu)\n",
+			__func__, ivcpu->ev_rpl_size, msg->size,
+			sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size);
+		err = -1;
+	}
+
+	ivcpu->ev_rpl_received = err ? -1 : ivcpu->ev_rpl_size;
+
+	kvmi_make_request(ivcpu, REQ_REPLY);
+
+	return (err == 0);
+}
+
+static bool consume_sock_bytes(struct kvmi *ikvm, size_t n)
+{
+	while (n) {
+		u8 buf[256];
+		size_t chunk = min(n, sizeof(buf));
+
+		if (kvmi_sock_read(ikvm, buf, chunk) != 0)
+			return false;
+
+		n -= chunk;
+	}
+
+	return true;
+}
+
+static int kvmi_msg_reply(struct kvmi *ikvm,
+			  const struct kvmi_msg_hdr *msg,
+			  int err, const void *rpl, size_t rpl_size)
+{
+	struct kvmi_error_code ec;
+	struct kvmi_msg_hdr h;
+	struct kvec vec[3] = {
+		{.iov_base = &h,           .iov_len = sizeof(h) },
+		{.iov_base = &ec,          .iov_len = sizeof(ec)},
+		{.iov_base = (void *) rpl, .iov_len = rpl_size  },
+	};
+	size_t size = sizeof(h) + sizeof(ec) + (err ? 0 : rpl_size);
+	size_t n = err ? ARRAY_SIZE(vec)-1 : ARRAY_SIZE(vec);
+
+	memset(&h, 0, sizeof(h));
+	h.id = msg->id;
+	h.seq = msg->seq;
+	h.size = size - sizeof(h);
+
+	memset(&ec, 0, sizeof(ec));
+	ec.err = err;
+
+	return kvmi_sock_write(ikvm, vec, n, size);
+}
+
+static int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				int err, const void *rpl, size_t size)
+{
+	/*
+	 * As soon as we reply to this vCPU command, we can get another one,
+	 * and we must signal that the incoming buffer (ivcpu->msg_buf)
+	 * is ready by clearing this bit/request.
+	 */
+	kvmi_clear_request(IVCPU(vcpu), REQ_CMD);
+
+	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, size);
+}
+
+bool kvmi_msg_init(struct kvmi *ikvm, int fd)
+{
+	struct socket *sock;
+	int r;
+
+	sock = sockfd_lookup(fd, &r);
+
+	if (!sock) {
+		kvm_err("Invalid file handle: %d\n", fd);
+		return false;
+	}
+
+	WRITE_ONCE(ikvm->sock, sock);
+
+	return true;
+}
+
+void kvmi_msg_uninit(struct kvmi *ikvm)
+{
+	kvm_info("Wake up the receiving thread\n");
+
+	read_lock(&ikvm->sock_lock);
+
+	if (ikvm->sock)
+		kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
+
+	read_unlock(&ikvm->sock_lock);
+
+	kvm_info("Wait for the receiving thread to complete\n");
+	wait_for_completion(&ikvm->finished);
+}
+
+static int handle_get_version(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_version_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	rpl.version = KVMI_VERSION;
+
+	return kvmi_msg_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
+}
+
+static struct kvm_vcpu *kvmi_get_vcpu(struct kvmi *ikvm, int vcpu_id)
+{
+	struct kvm *kvm = ikvm->kvm;
+
+	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
+		return NULL;
+
+	return kvm_get_vcpu(kvm, vcpu_id);
+}
+
+static bool invalid_page_access(u64 gpa, u64 size)
+{
+	u64 off = gpa & ~PAGE_MASK;
+
+	return (size == 0 || size > PAGE_SIZE || off + size > PAGE_SIZE);
+}
+
+static int handle_read_physical(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_read_physical *req = _req;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	return kvmi_cmd_read_physical(ikvm->kvm, req->gpa, req->size,
+				      kvmi_msg_reply, msg);
+}
+
+static int handle_write_physical(struct kvmi *ikvm,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req)
+{
+	const struct kvmi_write_physical *req = _req;
+	int ec;
+
+	if (invalid_page_access(req->gpa, req->size))
+		return -EINVAL;
+
+	ec = kvmi_cmd_write_physical(ikvm->kvm, req->gpa, req->size, req->data);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+static int handle_get_map_token(struct kvmi *ikvm,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	struct kvmi_get_map_token_reply rpl;
+	int ec;
+
+	ec = kvmi_cmd_alloc_token(ikvm->kvm, &rpl.token);
+
+	return kvmi_msg_reply(ikvm, msg, ec, &rpl, sizeof(rpl));
+}
+
+static int handle_control_cr(struct kvmi *ikvm,
+			     const struct kvmi_msg_hdr *msg, const void *_req)
+{
+	const struct kvmi_control_cr *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_control_cr(ikvm, req->enable, req->cr);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+static int handle_control_msr(struct kvmi *ikvm,
+			      const struct kvmi_msg_hdr *msg, const void *_req)
+{
+	const struct kvmi_control_msr *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_control_msr(ikvm->kvm, req->enable, req->msr);
+
+	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
+}
+
+/*
+ * These commands are executed on the receiving thread/worker.
+ */
+static int (*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
+			     const void *) = {
+	[KVMI_GET_VERSION]    = handle_get_version,
+	[KVMI_READ_PHYSICAL]  = handle_read_physical,
+	[KVMI_WRITE_PHYSICAL] = handle_write_physical,
+	[KVMI_GET_MAP_TOKEN]  = handle_get_map_token,
+	[KVMI_CONTROL_CR]     = handle_control_cr,
+	[KVMI_CONTROL_MSR]    = handle_control_msr,
+};
+
+static int handle_get_guest_info(struct kvm_vcpu *vcpu,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *req)
+{
+	struct kvmi_get_guest_info_reply rpl;
+
+	memset(&rpl, 0, sizeof(rpl));
+	kvmi_cmd_get_guest_info(vcpu, &rpl.vcpu_count, &rpl.tsc_speed);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, 0, &rpl, sizeof(rpl));
+}
+
+static int handle_pause_vcpu(struct kvm_vcpu *vcpu,
+			     const struct kvmi_msg_hdr *msg,
+			     const void *req)
+{
+	int ec = kvmi_cmd_pause_vcpu(vcpu);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
+				       const struct kvmi_get_registers *req,
+				       size_t *rpl_size)
+{
+	struct kvmi_get_registers_reply *rpl;
+	u16 k, n = req->nmsrs;
+
+	*rpl_size = sizeof(*rpl) + sizeof(rpl->msrs.entries[0]) * n;
+
+	rpl = kzalloc(*rpl_size, GFP_KERNEL);
+
+	if (rpl) {
+		rpl->msrs.nmsrs = n;
+
+		for (k = 0; k < n; k++)
+			rpl->msrs.entries[k].index = req->msrs_idx[k];
+	}
+
+	return rpl;
+}
+
+static int handle_get_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_registers_reply *rpl;
+	size_t rpl_size = 0;
+	int err, ec;
+
+	rpl = alloc_get_registers_reply(msg, req, &rpl_size);
+
+	if (!rpl)
+		ec = -KVM_ENOMEM;
+	else
+		ec = kvmi_cmd_get_registers(vcpu, &rpl->mode,
+						&rpl->regs, &rpl->sregs,
+						&rpl->msrs);
+
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+static int handle_set_registers(struct kvm_vcpu *vcpu,
+				const struct kvmi_msg_hdr *msg,
+				const void *_req)
+{
+	const struct kvmi_set_registers *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_set_registers(vcpu, &req->regs);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_get_page_access(struct kvm_vcpu *vcpu,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *_req)
+{
+	const struct kvmi_get_page_access *req = _req;
+	struct kvmi_get_page_access_reply *rpl = NULL;
+	size_t rpl_size = 0;
+	u16 k, n = req->count;
+	int err, ec = 0;
+
+	if (req->view != 0 && !kvm_eptp_switching_supported) {
+		ec = -KVM_ENOSYS;
+		goto out;
+	}
+
+	if (req->view != 0) { /* TODO */
+		ec = -KVM_EINVAL;
+		goto out;
+	}
+
+	rpl_size = sizeof(*rpl) + sizeof(rpl->access[0]) * n;
+	rpl = kzalloc(rpl_size, GFP_KERNEL);
+
+	if (!rpl) {
+		ec = -KVM_ENOMEM;
+		goto out;
+	}
+
+	for (k = 0; k < n && ec == 0; k++)
+		ec = kvmi_cmd_get_page_access(vcpu, req->gpa[k],
+						&rpl->access[k]);
+
+out:
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+static int handle_set_page_access(struct kvm_vcpu *vcpu,
+				  const struct kvmi_msg_hdr *msg,
+				  const void *_req)
+{
+	const struct kvmi_set_page_access *req = _req;
+	struct kvm *kvm = vcpu->kvm;
+	u16 k, n = req->count;
+	int ec = 0;
+
+	if (req->view != 0) {
+		if (!kvm_eptp_switching_supported)
+			ec = -KVM_ENOSYS;
+		else
+			ec = -KVM_EINVAL; /* TODO */
+	} else {
+		for (k = 0; k < n; k++) {
+			u64 gpa   = req->entries[k].gpa;
+			u8 access = req->entries[k].access;
+			int ec0;
+
+			if (access &  ~(KVMI_PAGE_ACCESS_R |
+					KVMI_PAGE_ACCESS_W |
+					KVMI_PAGE_ACCESS_X))
+				ec0 = -KVM_EINVAL;
+			else
+				ec0 = kvmi_set_mem_access(kvm, gpa, access);
+
+			if (ec0 && !ec)
+				ec = ec0;
+
+			trace_kvmi_set_mem_access(gpa_to_gfn(gpa), access, ec0);
+		}
+	}
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_inject_exception(struct kvm_vcpu *vcpu,
+				   const struct kvmi_msg_hdr *msg,
+				   const void *_req)
+{
+	const struct kvmi_inject_exception *req = _req;
+	int ec;
+
+	ec = kvmi_cmd_inject_exception(vcpu, req->nr, req->has_error,
+				       req->error_code, req->address);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_control_events(struct kvm_vcpu *vcpu,
+				 const struct kvmi_msg_hdr *msg,
+				 const void *_req)
+{
+	const struct kvmi_control_events *req = _req;
+	u32 not_allowed = ~IKVM(vcpu->kvm)->event_allow_mask;
+	u32 unknown = ~KVMI_KNOWN_EVENTS;
+	int ec;
+
+	if (req->events & unknown)
+		ec = -KVM_EINVAL;
+	else if (req->events & not_allowed)
+		ec = -KVM_EPERM;
+	else
+		ec = kvmi_cmd_control_events(vcpu, req->events);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
+}
+
+static int handle_get_cpuid(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg,
+			    const void *_req)
+{
+	const struct kvmi_get_cpuid *req = _req;
+	struct kvmi_get_cpuid_reply rpl;
+	int ec;
+
+	memset(&rpl, 0, sizeof(rpl));
+
+	ec = kvmi_cmd_get_cpuid(vcpu, req->function, req->index,
+					&rpl.eax, &rpl.ebx, &rpl.ecx,
+					&rpl.edx);
+
+	return kvmi_msg_vcpu_reply(vcpu, msg, ec, &rpl, sizeof(rpl));
+}
+
+static int handle_get_xsave(struct kvm_vcpu *vcpu,
+			    const struct kvmi_msg_hdr *msg, const void *req)
+{
+	struct kvmi_get_xsave_reply *rpl;
+	size_t rpl_size = sizeof(*rpl) + sizeof(struct kvm_xsave);
+	int ec = 0, err;
+
+	rpl = kzalloc(rpl_size, GFP_KERNEL);
+
+	if (!rpl)
+		ec = -KVM_ENOMEM;
+	else {
+		struct kvm_xsave *area;
+
+		area = (struct kvm_xsave *)&rpl->region[0];
+		kvm_vcpu_ioctl_x86_get_xsave(vcpu, area);
+	}
+
+	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
+	kfree(rpl);
+	return err;
+}
+
+/*
+ * These commands are executed on the vCPU thread. The receiving thread
+ * saves the command into kvmi_vcpu.msg_buf[] and signals the vCPU to handle
+ * the command (including sending back the reply).
+ */
+static int (*const msg_vcpu[])(struct kvm_vcpu *,
+			       const struct kvmi_msg_hdr *, const void *) = {
+	[KVMI_GET_GUEST_INFO]   = handle_get_guest_info,
+	[KVMI_PAUSE_VCPU]       = handle_pause_vcpu,
+	[KVMI_GET_REGISTERS]    = handle_get_registers,
+	[KVMI_SET_REGISTERS]    = handle_set_registers,
+	[KVMI_GET_PAGE_ACCESS]  = handle_get_page_access,
+	[KVMI_SET_PAGE_ACCESS]  = handle_set_page_access,
+	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
+	[KVMI_CONTROL_EVENTS]   = handle_control_events,
+	[KVMI_GET_CPUID]        = handle_get_cpuid,
+	[KVMI_GET_XSAVE]        = handle_get_xsave,
+};
+
+void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_msg_hdr *msg = (void *) ivcpu->msg_buf;
+	u8 *req = ivcpu->msg_buf + sizeof(*msg);
+	int err;
+
+	err = msg_vcpu[msg->id](vcpu, msg, req);
+
+	if (err)
+		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
+			id2str(msg->id), err);
+
+	/*
+	 * No error code is returned.
+	 *
+	 * The introspector gets its error code from the message handler
+	 * or the socket is closed (and QEMU should reconnect).
+	 */
+}
+
+static int kvmi_msg_recv_varlen(struct kvmi *ikvm, size_t(*cbk) (const void *),
+				size_t min_n, size_t msg_size)
+{
+	size_t extra_n;
+	u8 *extra_buf;
+	int err;
+
+	if (min_n > msg_size) {
+		kvm_err("%s: got %zu bytes instead of min %zu\n",
+			__func__, msg_size, min_n);
+		return -EINVAL;
+	}
+
+	if (!min_n)
+		return 0;
+
+	err = kvmi_sock_read(ikvm, ikvm->msg_buf, min_n);
+
+	extra_buf = ikvm->msg_buf + min_n;
+	extra_n = msg_size - min_n;
+
+	if (!err && extra_n) {
+		if (cbk(ikvm->msg_buf) == msg_size)
+			err = kvmi_sock_read(ikvm, extra_buf, extra_n);
+		else
+			err = -EINVAL;
+	}
+
+	return err;
+}
+
+static int kvmi_msg_recv_n(struct kvmi *ikvm, size_t n, size_t msg_size)
+{
+	if (n != msg_size) {
+		kvm_err("%s: got %zu bytes instead of %zu\n",
+			__func__, msg_size, n);
+		return -EINVAL;
+	}
+
+	if (!n)
+		return 0;
+
+	return kvmi_sock_read(ikvm, ikvm->msg_buf, n);
+}
+
+static int kvmi_msg_recv(struct kvmi *ikvm, const struct kvmi_msg_hdr *msg)
+{
+	size_t (*cbk)(const void *) = msg_bytes[msg->id].cbk_full_size;
+	size_t expected = msg_bytes[msg->id].size;
+
+	if (cbk)
+		return kvmi_msg_recv_varlen(ikvm, cbk, expected, msg->size);
+	else
+		return kvmi_msg_recv_n(ikvm, expected, msg->size);
+}
+
+struct vcpu_msg_hdr {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+static int kvmi_msg_queue_to_vcpu(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg)
+{
+	struct vcpu_msg_hdr *vcpu_hdr = (struct vcpu_msg_hdr *)ikvm->msg_buf;
+	struct kvmi_vcpu *ivcpu;
+	struct kvm_vcpu *vcpu;
+
+	if (msg->size < sizeof(*vcpu_hdr)) {
+		kvm_err("%s: invalid vcpu message: %d\n", __func__, msg->size);
+		return -EINVAL; /* ABI error */
+	}
+
+	vcpu = kvmi_get_vcpu(ikvm, vcpu_hdr->vcpu);
+
+	if (!vcpu) {
+		kvm_err("%s: invalid vcpu: %d\n", __func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EINVAL, NULL, 0);
+	}
+
+	ivcpu = vcpu->kvmi;
+
+	if (!ivcpu) {
+		kvm_err("%s: not introspected vcpu: %d\n",
+			__func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EAGAIN, NULL, 0);
+	}
+
+	if (test_bit(REQ_CMD, &ivcpu->requests)) {
+		kvm_err("%s: vcpu is busy: %d\n", __func__, vcpu_hdr->vcpu);
+		return kvmi_msg_reply(ikvm, msg, -KVM_EBUSY, NULL, 0);
+	}
+
+	memcpy(ivcpu->msg_buf, msg, sizeof(*msg));
+	memcpy(ivcpu->msg_buf + sizeof(*msg), ikvm->msg_buf, msg->size);
+
+	kvmi_make_request(ivcpu, REQ_CMD);
+	kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+	kvm_vcpu_kick(vcpu);
+
+	return 0;
+}
+
+static bool kvmi_msg_dispatch_cmd(struct kvmi *ikvm,
+				  const struct kvmi_msg_hdr *msg)
+{
+	int err = kvmi_msg_recv(ikvm, msg);
+
+	if (err)
+		goto out;
+
+	if (!KVMI_ALLOWED_COMMAND(msg->id, ikvm->cmd_allow_mask)) {
+		err = kvmi_msg_reply(ikvm, msg, -KVM_EPERM, NULL, 0);
+		goto out;
+	}
+
+	if (msg_vcpu[msg->id])
+		err = kvmi_msg_queue_to_vcpu(ikvm, msg);
+	else
+		err = msg_vm[msg->id](ikvm, msg, ikvm->msg_buf);
+
+out:
+	if (err)
+		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
+			id2str(msg->id), err);
+
+	return (err == 0);
+}
+
+static bool handle_unsupported_msg(struct kvmi *ikvm,
+				   const struct kvmi_msg_hdr *msg)
+{
+	int err;
+
+	kvm_err("%s: %u\n", __func__, msg->id);
+
+	err = consume_sock_bytes(ikvm, msg->size);
+
+	if (!err)
+		err = kvmi_msg_reply(ikvm, msg, -KVM_ENOSYS, NULL, 0);
+
+	return (err == 0);
+}
+
+static bool kvmi_msg_dispatch(struct kvmi *ikvm)
+{
+	struct kvmi_msg_hdr msg;
+	int err;
+
+	err = kvmi_sock_read(ikvm, &msg, sizeof(msg));
+
+	if (err) {
+		kvm_err("%s: can't read\n", __func__);
+		return false;
+	}
+
+	trace_kvmi_msg_dispatch(msg.id, msg.size);
+
+	kvm_debug("%s: id:%u (%s) size:%u\n", __func__, msg.id,
+		  id2str(msg.id), msg.size);
+
+	if (msg.id == KVMI_EVENT_REPLY)
+		return kvmi_msg_dispatch_reply(ikvm, &msg);
+
+	if (msg.id >= ARRAY_SIZE(msg_bytes)
+	    || (!msg_vm[msg.id] && !msg_vcpu[msg.id]))
+		return handle_unsupported_msg(ikvm, &msg);
+
+	return kvmi_msg_dispatch_cmd(ikvm, &msg);
+}
+
+static void kvmi_sock_close(struct kvmi *ikvm)
+{
+	kvm_info("%s\n", __func__);
+
+	write_lock(&ikvm->sock_lock);
+
+	if (ikvm->sock) {
+		kvm_info("Release the socket\n");
+		sockfd_put(ikvm->sock);
+
+		ikvm->sock = NULL;
+	}
+
+	write_unlock(&ikvm->sock_lock);
+}
+
+bool kvmi_msg_process(struct kvmi *ikvm)
+{
+	if (!kvmi_msg_dispatch(ikvm)) {
+		kvmi_sock_close(ikvm);
+		return false;
+	}
+	return true;
+}
+
+static void kvmi_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev,
+			     u32 ev_id)
+{
+	memset(ev, 0, sizeof(*ev));
+	ev->vcpu = vcpu->vcpu_id;
+	ev->event = ev_id;
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, &ev->regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &ev->sregs);
+	ev->mode = kvmi_vcpu_mode(vcpu, &ev->sregs);
+	kvmi_get_msrs(vcpu, ev);
+}
+
+static bool kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
+			    void *ev,  size_t ev_size,
+			    void *rpl, size_t rpl_size)
+{
+	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
+	struct kvmi_event common;
+	struct kvmi_msg_hdr h;
+	struct kvec vec[3] = {
+		{.iov_base = &h,      .iov_len = sizeof(h)     },
+		{.iov_base = &common, .iov_len = sizeof(common)},
+		{.iov_base = ev,      .iov_len = ev_size       },
+	};
+	size_t msg_size = sizeof(h) + sizeof(common) + ev_size;
+	size_t n = ev_size ? ARRAY_SIZE(vec) : ARRAY_SIZE(vec)-1;
+
+	memset(&h, 0, sizeof(h));
+	h.id = KVMI_EVENT;
+	h.seq = new_seq();
+	h.size = msg_size - sizeof(h);
+
+	kvmi_setup_event(vcpu, &common, ev_id);
+
+	ivcpu->ev_rpl_ptr = rpl;
+	ivcpu->ev_rpl_size = rpl_size;
+	ivcpu->ev_seq = h.seq;
+	ivcpu->ev_rpl_received = -1;
+	WRITE_ONCE(ivcpu->ev_rpl_waiting, true);
+	/* kvmi_vcpu_waiting_for_reply() */
+	smp_wmb();
+
+	trace_kvmi_send_event(ev_id);
+
+	kvm_debug("%s: %-11s(seq:%u) size:%lu vcpu:%d\n",
+		  __func__, event_str(ev_id), h.seq, ev_size, vcpu->vcpu_id);
+
+	if (kvmi_sock_write(IKVM(vcpu->kvm), vec, n, msg_size) == 0)
+		kvmi_handle_request(vcpu);
+
+	kvm_debug("%s: reply for %-11s(seq:%u) size:%lu vcpu:%d\n",
+		  __func__, event_str(ev_id), h.seq, rpl_size, vcpu->vcpu_id);
+
+	return (ivcpu->ev_rpl_received >= 0);
+}
+
+u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
+		     u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_cr e;
+	struct kvmi_event_cr_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.cr = cr;
+	e.old_value = old_value;
+	e.new_value = new_value;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_CR, &e, sizeof(e),
+				&r, sizeof(r))) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
+		      u64 new_value, u64 *ret_value)
+{
+	struct kvmi_event_msr e;
+	struct kvmi_event_msr_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.msr = msr;
+	e.old_value = old_value;
+	e.new_value = new_value;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_MSR, &e, sizeof(e),
+				&r, sizeof(r))) {
+		*ret_value = new_value;
+		return KVMI_EVENT_ACTION_CONTINUE;
+	}
+
+	*ret_value = r.new_val;
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_XSETBV, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa)
+{
+	struct kvmi_event_breakpoint e;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_BREAKPOINT,
+				&e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_HYPERCALL, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
+		      u32 *action, bool *trap_access, u8 *ctx_data,
+		      u32 *ctx_size)
+{
+	u32 max_ctx_size = *ctx_size;
+	struct kvmi_event_page_fault e;
+	struct kvmi_event_page_fault_reply r;
+
+	memset(&e, 0, sizeof(e));
+	e.gpa = gpa;
+	e.gva = gva;
+	e.mode = mode;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAGE_FAULT, &e, sizeof(e),
+				&r, sizeof(r)))
+		return false;
+
+	*action = IVCPU(vcpu)->ev_rpl.action;
+	*trap_access = r.trap_access;
+	*ctx_size = 0;
+
+	if (r.ctx_size <= max_ctx_size) {
+		*ctx_size = min_t(u32, r.ctx_size, sizeof(r.ctx_data));
+		if (*ctx_size)
+			memcpy(ctx_data, r.ctx_data, *ctx_size);
+	} else {
+		kvm_err("%s: ctx_size (recv:%u max:%u)\n", __func__,
+			r.ctx_size, *ctx_size);
+		/*
+		 * TODO: This is an ABI error.
+		 * We should shutdown the socket?
+		 */
+	}
+
+	return true;
+}
+
+u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
+		       u32 error_code, u64 cr2)
+{
+	struct kvmi_event_trap e;
+
+	memset(&e, 0, sizeof(e));
+	e.vector = vector;
+	e.type = type;
+	e.error_code = error_code;
+	e.cr2 = cr2;
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_TRAP, &e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
+			     u64 exit_qualification, u8 descriptor, u8 write)
+{
+	struct kvmi_event_descriptor e;
+
+	memset(&e, 0, sizeof(e));
+	e.descriptor = descriptor;
+	e.write = write;
+
+	if (cpu_has_vmx()) {
+		e.arch.vmx.instr_info = info;
+		e.arch.vmx.exit_qualification = exit_qualification;
+	} else {
+		e.arch.svm.exit_info = info;
+	}
+
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_DESCRIPTOR,
+				&e, sizeof(e), NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_CREATE_VCPU, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}
+
+u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu)
+{
+	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAUSE_VCPU, NULL, 0, NULL, 0))
+		return KVMI_EVENT_ACTION_CONTINUE;
+
+	return IVCPU(vcpu)->ev_rpl.action;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 09/18] kvm: hook in the VM introspection subsystem
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Handle the new KVM_INTROSPECTION ioctl and pass the socket from QEMU to
the KVMI subsystem. Notify KVMI on vCPU create/destroy and VM destroy
events. Also, the EPT AD bits feature is disabled by this patch.

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/vmx.c  |  3 ++-
 virt/kvm/kvm_main.c | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 093a2e1f7ea6..c03580abf9e8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -34,6 +34,7 @@
 #include <linux/tboot.h>
 #include <linux/hrtimer.h>
 #include <linux/frame.h>
+#include <linux/kvmi.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -6785,7 +6786,7 @@ static __init int hardware_setup(void)
 	    !cpu_has_vmx_invept_global())
 		enable_ept = 0;
 
-	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
+	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
 		enable_ept_ad_bits = 0;
 
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 210bf820385a..7895d490bd71 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/kvmi.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -298,6 +299,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
+
+	kvmi_vcpu_init(vcpu);
+
 	return 0;
 
 fail_free_run:
@@ -315,6 +319,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	 * descriptors are already gone.
 	 */
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
+	kvmi_vcpu_uninit(vcpu);
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
 }
@@ -711,6 +716,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	kvmi_destroy_vm(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
@@ -3118,6 +3124,15 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_INTROSPECTION: {
+		struct kvm_introspection i;
+
+		r = -EFAULT;
+		if (copy_from_user(&i, argp, sizeof(i)) || !kvmi_hook(kvm, &i))
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -4072,6 +4087,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_vfio_ops_init();
 	WARN_ON(r);
 
+	r = kvmi_init();
+	WARN_ON(r);
+
 	return 0;
 
 out_undebugfs:
@@ -4100,6 +4118,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvmi_uninit();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 09/18] kvm: hook in the VM introspection subsystem
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Handle the new KVM_INTROSPECTION ioctl and pass the socket from QEMU to
the KVMI subsystem. Notify KVMI on vCPU create/destroy and VM destroy
events. Also, the EPT AD bits feature is disabled by this patch.

Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
---
 arch/x86/kvm/vmx.c  |  3 ++-
 virt/kvm/kvm_main.c | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 093a2e1f7ea6..c03580abf9e8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -34,6 +34,7 @@
 #include <linux/tboot.h>
 #include <linux/hrtimer.h>
 #include <linux/frame.h>
+#include <linux/kvmi.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -6785,7 +6786,7 @@ static __init int hardware_setup(void)
 	    !cpu_has_vmx_invept_global())
 		enable_ept = 0;
 
-	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
+	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
 		enable_ept_ad_bits = 0;
 
 	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 210bf820385a..7895d490bd71 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/slab.h>
 #include <linux/sort.h>
 #include <linux/bsearch.h>
+#include <linux/kvmi.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -298,6 +299,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
+
+	kvmi_vcpu_init(vcpu);
+
 	return 0;
 
 fail_free_run:
@@ -315,6 +319,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 	 * descriptors are already gone.
 	 */
 	put_pid(rcu_dereference_protected(vcpu->pid, 1));
+	kvmi_vcpu_uninit(vcpu);
 	kvm_arch_vcpu_uninit(vcpu);
 	free_page((unsigned long)vcpu->run);
 }
@@ -711,6 +716,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	kvmi_destroy_vm(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
@@ -3118,6 +3124,15 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_INTROSPECTION: {
+		struct kvm_introspection i;
+
+		r = -EFAULT;
+		if (copy_from_user(&i, argp, sizeof(i)) || !kvmi_hook(kvm, &i))
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -4072,6 +4087,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_vfio_ops_init();
 	WARN_ON(r);
 
+	r = kvmi_init();
+	WARN_ON(r);
+
 	return 0;
 
 out_undebugfs:
@@ -4100,6 +4118,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvmi_uninit();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 10/18] kvm: x86: handle the new vCPU request (KVM_REQ_INTROSPECTION)
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

The thread/worker receiving vCPU commands (from the guest introspection
tool) signals the vCPU by placing the command in a vCPU buffer, sets this
vCPU request bit and kicks the vCPU.

This patch adds the call to the introspection handler that will:
 - execute the command
 - send the results back to the introspection tool
 - pause the vCPU (if requested)

Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cdfc7200a018..9889e96f64e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -20,6 +20,7 @@
  */
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 #include "irq.h"
 #include "mmu.h"
 #include "i8254.h"
@@ -6904,6 +6905,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu))
+			kvmi_handle_request(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 10/18] kvm: x86: handle the new vCPU request (KVM_REQ_INTROSPECTION)
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

The thread/worker receiving vCPU commands (from the guest introspection
tool) signals the vCPU by placing the command in a vCPU buffer, sets this
vCPU request bit and kicks the vCPU.

This patch adds the call to the introspection handler that will:
 - execute the command
 - send the results back to the introspection tool
 - pause the vCPU (if requested)

Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cdfc7200a018..9889e96f64e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -20,6 +20,7 @@
  */
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 #include "irq.h"
 #include "mmu.h"
 #include "i8254.h"
@@ -6904,6 +6905,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu))
+			kvmi_handle_request(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 11/18] kvm: x86: hook in the page tracking
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a read/write/execute access is
happening on a page of interest (configured via a KVMI_SET_PAGE_ACCESS
request).

The introspection tool can respond to allow the emulation to continue,
with or without custom input (for read access), retry to guest (if the
tool has changed the program counter, has emulated the instruction or
the page is no longer of interest).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/include/asm/kvm_emulate.h |  1 +
 arch/x86/kvm/emulate.c             |  9 +++++++-
 arch/x86/kvm/mmu.c                 |  3 +++
 arch/x86/kvm/x86.c                 | 46 +++++++++++++++++++++++++++++---------
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index b24b1c8b3979..e257cae3a745 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -438,6 +438,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_USER_EXIT 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index abe74f779f9d..94886040f8f1 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5263,7 +5263,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 					ctxt->memopp->addr.mem.ea + ctxt->_eip);
 
 done:
-	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
+	else if (rc == X86EMUL_CONTINUE)
+		return EMULATION_OK;
+	else
+		return EMULATION_FAILED;
 }
 
 bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt)
@@ -5633,6 +5638,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 	if (rc == X86EMUL_INTERCEPTED)
 		return EMULATION_INTERCEPTED;
 
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
 	if (rc == X86EMUL_CONTINUE)
 		writeback_registers(ctxt);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 19dc17b00db2..18205c710233 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5066,6 +5066,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 
 	if (mmio_info_in_cache(vcpu, cr2, direct))
 		emulation_type = 0;
+	if (kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2),
+				     KVM_PAGE_TRACK_PREEXEC))
+		emulation_type = EMULTYPE_NO_REEXECUTE;
 emulate:
 	er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9889e96f64e6..caf50b7307a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4501,6 +4501,9 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 	if (unlikely(gpa == UNMAPPED_GVA))
 		return X86EMUL_PROPAGATE_FAULT;
 
+	if (!kvm_page_track_preexec(vcpu, gpa))
+		return X86EMUL_RETRY_INSTR;
+
 	offset = addr & (PAGE_SIZE-1);
 	if (WARN_ON(offset + bytes > PAGE_SIZE))
 		bytes = (unsigned)PAGE_SIZE - offset;
@@ -4622,13 +4625,26 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			const void *val, int bytes)
 {
-	int ret;
-
-	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
-	if (ret < 0)
-		return 0;
+	if (!kvm_page_track_prewrite(vcpu, gpa, val, bytes))
+		return X86EMUL_RETRY_INSTR;
+	if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
 	kvm_page_track_write(vcpu, gpa, val, bytes);
-	return 1;
+	return X86EMUL_CONTINUE;
+}
+
+static int emulator_read_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+			      void *val, int bytes)
+{
+	bool data_ready;
+
+	if (!kvm_page_track_preread(vcpu, gpa, val, bytes, &data_ready))
+		return X86EMUL_RETRY_INSTR;
+	if (data_ready)
+		return X86EMUL_CONTINUE;
+	if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
+	return X86EMUL_CONTINUE;
 }
 
 struct read_write_emulator_ops {
@@ -4658,7 +4674,7 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
 			void *val, int bytes)
 {
-	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+	return emulator_read_phys(vcpu, gpa, val, bytes);
 }
 
 static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -4733,8 +4749,11 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
-		return X86EMUL_CONTINUE;
+	if (!ret) {
+		ret = ops->read_write_emulate(vcpu, gpa, val, bytes);
+		if (ret == X86EMUL_CONTINUE || ret == X86EMUL_RETRY_INSTR)
+			return ret;
+	}
 
 	/*
 	 * Is this MMIO handled locally?
@@ -4869,6 +4888,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (is_error_page(page))
 		goto emul_write;
 
+	if (!kvm_page_track_prewrite(vcpu, gpa, new, bytes))
+		return X86EMUL_RETRY_INSTR;
+
 	kaddr = kmap_atomic(page);
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
@@ -5721,7 +5743,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 		trace_kvm_emulate_insn_start(vcpu);
 		++vcpu->stat.insn_emulation;
-		if (r != EMULATION_OK)  {
+		if (r == EMULATION_USER_EXIT)
+			return EMULATE_DONE;
+		if (r != EMULATION_OK) {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
@@ -5758,6 +5782,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	r = x86_emulate_insn(ctxt);
 
+	if (r == EMULATION_USER_EXIT)
+		return EMULATE_DONE;
 	if (r == EMULATION_INTERCEPTED)
 		return EMULATE_DONE;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 11/18] kvm: x86: hook in the page tracking
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a read/write/execute access is
happening on a page of interest (configured via a KVMI_SET_PAGE_ACCESS
request).

The introspection tool can respond to allow the emulation to continue,
with or without custom input (for read access), retry to guest (if the
tool has changed the program counter, has emulated the instruction or
the page is no longer of interest).

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
---
 arch/x86/include/asm/kvm_emulate.h |  1 +
 arch/x86/kvm/emulate.c             |  9 +++++++-
 arch/x86/kvm/mmu.c                 |  3 +++
 arch/x86/kvm/x86.c                 | 46 +++++++++++++++++++++++++++++---------
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index b24b1c8b3979..e257cae3a745 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -438,6 +438,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
 #define EMULATION_INTERCEPTED 2
+#define EMULATION_USER_EXIT 3
 void init_decode_cache(struct x86_emulate_ctxt *ctxt);
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index abe74f779f9d..94886040f8f1 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5263,7 +5263,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 					ctxt->memopp->addr.mem.ea + ctxt->_eip);
 
 done:
-	return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
+	else if (rc == X86EMUL_CONTINUE)
+		return EMULATION_OK;
+	else
+		return EMULATION_FAILED;
 }
 
 bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt)
@@ -5633,6 +5638,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 	if (rc == X86EMUL_INTERCEPTED)
 		return EMULATION_INTERCEPTED;
 
+	if (rc == X86EMUL_RETRY_INSTR)
+		return EMULATION_USER_EXIT;
 	if (rc == X86EMUL_CONTINUE)
 		writeback_registers(ctxt);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 19dc17b00db2..18205c710233 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5066,6 +5066,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 
 	if (mmio_info_in_cache(vcpu, cr2, direct))
 		emulation_type = 0;
+	if (kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2),
+				     KVM_PAGE_TRACK_PREEXEC))
+		emulation_type = EMULTYPE_NO_REEXECUTE;
 emulate:
 	er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9889e96f64e6..caf50b7307a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4501,6 +4501,9 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 	if (unlikely(gpa == UNMAPPED_GVA))
 		return X86EMUL_PROPAGATE_FAULT;
 
+	if (!kvm_page_track_preexec(vcpu, gpa))
+		return X86EMUL_RETRY_INSTR;
+
 	offset = addr & (PAGE_SIZE-1);
 	if (WARN_ON(offset + bytes > PAGE_SIZE))
 		bytes = (unsigned)PAGE_SIZE - offset;
@@ -4622,13 +4625,26 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			const void *val, int bytes)
 {
-	int ret;
-
-	ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
-	if (ret < 0)
-		return 0;
+	if (!kvm_page_track_prewrite(vcpu, gpa, val, bytes))
+		return X86EMUL_RETRY_INSTR;
+	if (kvm_vcpu_write_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
 	kvm_page_track_write(vcpu, gpa, val, bytes);
-	return 1;
+	return X86EMUL_CONTINUE;
+}
+
+static int emulator_read_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
+			      void *val, int bytes)
+{
+	bool data_ready;
+
+	if (!kvm_page_track_preread(vcpu, gpa, val, bytes, &data_ready))
+		return X86EMUL_RETRY_INSTR;
+	if (data_ready)
+		return X86EMUL_CONTINUE;
+	if (kvm_vcpu_read_guest(vcpu, gpa, val, bytes) < 0)
+		return X86EMUL_UNHANDLEABLE;
+	return X86EMUL_CONTINUE;
 }
 
 struct read_write_emulator_ops {
@@ -4658,7 +4674,7 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
 static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
 			void *val, int bytes)
 {
-	return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
+	return emulator_read_phys(vcpu, gpa, val, bytes);
 }
 
 static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -4733,8 +4749,11 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 			return X86EMUL_PROPAGATE_FAULT;
 	}
 
-	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
-		return X86EMUL_CONTINUE;
+	if (!ret) {
+		ret = ops->read_write_emulate(vcpu, gpa, val, bytes);
+		if (ret == X86EMUL_CONTINUE || ret == X86EMUL_RETRY_INSTR)
+			return ret;
+	}
 
 	/*
 	 * Is this MMIO handled locally?
@@ -4869,6 +4888,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	if (is_error_page(page))
 		goto emul_write;
 
+	if (!kvm_page_track_prewrite(vcpu, gpa, new, bytes))
+		return X86EMUL_RETRY_INSTR;
+
 	kaddr = kmap_atomic(page);
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
@@ -5721,7 +5743,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 		trace_kvm_emulate_insn_start(vcpu);
 		++vcpu->stat.insn_emulation;
-		if (r != EMULATION_OK)  {
+		if (r == EMULATION_USER_EXIT)
+			return EMULATE_DONE;
+		if (r != EMULATION_OK) {
 			if (emulation_type & EMULTYPE_TRAP_UD)
 				return EMULATE_FAIL;
 			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
@@ -5758,6 +5782,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 	r = x86_emulate_insn(ctxt);
 
+	if (r == EMULATION_USER_EXIT)
+		return EMULATE_DONE;
 	if (r == EMULATION_INTERCEPTED)
 		return EMULATE_DONE;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 12/18] kvm: x86: hook in kvmi_breakpoint_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a breakpoint instruction (INT3)
is being executed. These one-byte intructions are placed in the slack
space of various functions and used as notification for when the OS or
an application has reached a certain state or is trying to perform a
certain operation (like creating a process).

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/svm.c |  6 ++++++
 arch/x86/kvm/vmx.c | 15 +++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f41e4d7008d7..8903e0c58609 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) "SVM: " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 
 #include "irq.h"
 #include "mmu.h"
@@ -45,6 +46,7 @@
 #include <asm/debugreg.h>
 #include <asm/kvm_para.h>
 #include <asm/irq_remapping.h>
+#include <asm/kvmi.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -2194,6 +2196,10 @@ static int bp_interception(struct vcpu_svm *svm)
 {
 	struct kvm_run *kvm_run = svm->vcpu.run;
 
+	if (kvmi_breakpoint_event(&svm->vcpu,
+		svm->vmcb->save.cs.base + svm->vmcb->save.rip))
+		return 1;
+
 	kvm_run->exit_reason = KVM_EXIT_DEBUG;
 	kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
 	kvm_run->debug.arch.exception = BP_VECTOR;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c03580abf9e8..fbdfa8507d4f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -51,6 +51,7 @@
 #include <asm/apic.h>
 #include <asm/irq_remapping.h>
 #include <asm/mmu_context.h>
+#include <asm/kvmi.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -5904,7 +5905,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_run *kvm_run = vcpu->run;
 	u32 intr_info, ex_no, error_code;
-	unsigned long cr2, rip, dr6;
+	unsigned long cr2, dr6;
 	u32 vect_info;
 	enum emulation_result er;
 
@@ -5978,7 +5979,13 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		kvm_run->debug.arch.dr6 = dr6 | DR6_FIXED_1;
 		kvm_run->debug.arch.dr7 = vmcs_readl(GUEST_DR7);
 		/* fall through */
-	case BP_VECTOR:
+	case BP_VECTOR: {
+		unsigned long gva = vmcs_readl(GUEST_CS_BASE) +
+			kvm_rip_read(vcpu);
+
+		if (kvmi_breakpoint_event(vcpu, gva))
+			return 1;
+
 		/*
 		 * Update instruction length as we may reinject #BP from
 		 * user space while in guest debugging mode. Reading it for
@@ -5987,10 +5994,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		vmx->vcpu.arch.event_exit_inst_len =
 			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
 		kvm_run->exit_reason = KVM_EXIT_DEBUG;
-		rip = kvm_rip_read(vcpu);
-		kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip;
+		kvm_run->debug.arch.pc = gva;
 		kvm_run->debug.arch.exception = ex_no;
 		break;
+	}
 	default:
 		kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
 		kvm_run->ex.exception = ex_no;

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 12/18] kvm: x86: hook in kvmi_breakpoint_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a breakpoint instruction (INT3)
is being executed. These one-byte intructions are placed in the slack
space of various functions and used as notification for when the OS or
an application has reached a certain state or is trying to perform a
certain operation (like creating a process).

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/svm.c |  6 ++++++
 arch/x86/kvm/vmx.c | 15 +++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f41e4d7008d7..8903e0c58609 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) "SVM: " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvmi.h>
 
 #include "irq.h"
 #include "mmu.h"
@@ -45,6 +46,7 @@
 #include <asm/debugreg.h>
 #include <asm/kvm_para.h>
 #include <asm/irq_remapping.h>
+#include <asm/kvmi.h>
 
 #include <asm/virtext.h>
 #include "trace.h"
@@ -2194,6 +2196,10 @@ static int bp_interception(struct vcpu_svm *svm)
 {
 	struct kvm_run *kvm_run = svm->vcpu.run;
 
+	if (kvmi_breakpoint_event(&svm->vcpu,
+		svm->vmcb->save.cs.base + svm->vmcb->save.rip))
+		return 1;
+
 	kvm_run->exit_reason = KVM_EXIT_DEBUG;
 	kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
 	kvm_run->debug.arch.exception = BP_VECTOR;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c03580abf9e8..fbdfa8507d4f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -51,6 +51,7 @@
 #include <asm/apic.h>
 #include <asm/irq_remapping.h>
 #include <asm/mmu_context.h>
+#include <asm/kvmi.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -5904,7 +5905,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_run *kvm_run = vcpu->run;
 	u32 intr_info, ex_no, error_code;
-	unsigned long cr2, rip, dr6;
+	unsigned long cr2, dr6;
 	u32 vect_info;
 	enum emulation_result er;
 
@@ -5978,7 +5979,13 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		kvm_run->debug.arch.dr6 = dr6 | DR6_FIXED_1;
 		kvm_run->debug.arch.dr7 = vmcs_readl(GUEST_DR7);
 		/* fall through */
-	case BP_VECTOR:
+	case BP_VECTOR: {
+		unsigned long gva = vmcs_readl(GUEST_CS_BASE) +
+			kvm_rip_read(vcpu);
+
+		if (kvmi_breakpoint_event(vcpu, gva))
+			return 1;
+
 		/*
 		 * Update instruction length as we may reinject #BP from
 		 * user space while in guest debugging mode. Reading it for
@@ -5987,10 +5994,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		vmx->vcpu.arch.event_exit_inst_len =
 			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
 		kvm_run->exit_reason = KVM_EXIT_DEBUG;
-		rip = kvm_rip_read(vcpu);
-		kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip;
+		kvm_run->debug.arch.pc = gva;
 		kvm_run->debug.arch.exception = ex_no;
 		break;
+	}
 	default:
 		kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
 		kvm_run->ex.exception = ex_no;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 13/18] kvm: x86: hook in kvmi_descriptor_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a system table pointer register
(GDTR, IDTR, LDTR, TR) has been accessed.

Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c | 41 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c | 27 +++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8903e0c58609..3b4911205081 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4109,6 +4109,39 @@ static int avic_unaccelerated_access_interception(struct vcpu_svm *svm)
 	return ret;
 }
 
+static int descriptor_access_interception(struct vcpu_svm *svm)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct vmcb_control_area *c = &svm->vmcb->control;
+
+	switch (c->exit_code) {
+	case SVM_EXIT_IDTR_READ:
+	case SVM_EXIT_IDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_IDTR, c->exit_code == SVM_EXIT_IDTR_WRITE);
+		break;
+	case SVM_EXIT_GDTR_READ:
+	case SVM_EXIT_GDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_GDTR, c->exit_code == SVM_EXIT_GDTR_WRITE);
+		break;
+	case SVM_EXIT_LDTR_READ:
+	case SVM_EXIT_LDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_LDTR, c->exit_code == SVM_EXIT_LDTR_WRITE);
+		break;
+	case SVM_EXIT_TR_READ:
+	case SVM_EXIT_TR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_TR, c->exit_code == SVM_EXIT_TR_WRITE);
+		break;
+	default:
+		break;
+	}
+
+	return 1;
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -4173,6 +4206,14 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_RSM]                          = emulate_on_interception,
 	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
 	[SVM_EXIT_AVIC_UNACCELERATED_ACCESS]	= avic_unaccelerated_access_interception,
+	[SVM_EXIT_IDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_TR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_IDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_TR_WRITE]			= descriptor_access_interception,
 };
 
 static void dump_vmcb(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fbdfa8507d4f..ab744f04ae90 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8047,6 +8047,31 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_descriptor_access(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exit_reason = vmx->exit_reason;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	unsigned char store = (vmx_instruction_info >> 29) & 0x1;
+	unsigned char descriptor = 0;
+
+	if (exit_reason == EXIT_REASON_GDTR_IDTR) {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_IDTR;
+		else
+			descriptor = KVMI_DESC_GDTR;
+	} else {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_TR;
+		else
+			descriptor = KVMI_DESC_LDTR;
+	}
+
+	return kvmi_descriptor_event(vcpu, vmx_instruction_info,
+				     exit_qualification, descriptor, store);
+}
+
 static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -8219,6 +8244,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
 	[EXIT_REASON_VMFUNC]                  = handle_vmfunc,
 	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
+	[EXIT_REASON_GDTR_IDTR]               = handle_descriptor_access,
+	[EXIT_REASON_LDTR_TR]                 = handle_descriptor_access,
 };
 
 static const int kvm_vmx_max_exit_handlers =

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 13/18] kvm: x86: hook in kvmi_descriptor_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that a system table pointer register
(GDTR, IDTR, LDTR, TR) has been accessed.

Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
---
 arch/x86/kvm/svm.c | 41 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c | 27 +++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8903e0c58609..3b4911205081 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4109,6 +4109,39 @@ static int avic_unaccelerated_access_interception(struct vcpu_svm *svm)
 	return ret;
 }
 
+static int descriptor_access_interception(struct vcpu_svm *svm)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct vmcb_control_area *c = &svm->vmcb->control;
+
+	switch (c->exit_code) {
+	case SVM_EXIT_IDTR_READ:
+	case SVM_EXIT_IDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_IDTR, c->exit_code == SVM_EXIT_IDTR_WRITE);
+		break;
+	case SVM_EXIT_GDTR_READ:
+	case SVM_EXIT_GDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_GDTR, c->exit_code == SVM_EXIT_GDTR_WRITE);
+		break;
+	case SVM_EXIT_LDTR_READ:
+	case SVM_EXIT_LDTR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_LDTR, c->exit_code == SVM_EXIT_LDTR_WRITE);
+		break;
+	case SVM_EXIT_TR_READ:
+	case SVM_EXIT_TR_WRITE:
+		kvmi_descriptor_event(vcpu, c->exit_info_1, 0,
+			KVMI_DESC_TR, c->exit_code == SVM_EXIT_TR_WRITE);
+		break;
+	default:
+		break;
+	}
+
+	return 1;
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -4173,6 +4206,14 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_RSM]                          = emulate_on_interception,
 	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
 	[SVM_EXIT_AVIC_UNACCELERATED_ACCESS]	= avic_unaccelerated_access_interception,
+	[SVM_EXIT_IDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_TR_READ]			= descriptor_access_interception,
+	[SVM_EXIT_IDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_GDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_LDTR_WRITE]			= descriptor_access_interception,
+	[SVM_EXIT_TR_WRITE]			= descriptor_access_interception,
 };
 
 static void dump_vmcb(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fbdfa8507d4f..ab744f04ae90 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8047,6 +8047,31 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_descriptor_access(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exit_reason = vmx->exit_reason;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	unsigned char store = (vmx_instruction_info >> 29) & 0x1;
+	unsigned char descriptor = 0;
+
+	if (exit_reason == EXIT_REASON_GDTR_IDTR) {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_IDTR;
+		else
+			descriptor = KVMI_DESC_GDTR;
+	} else {
+		if ((vmx_instruction_info >> 28) & 0x1)
+			descriptor = KVMI_DESC_TR;
+		else
+			descriptor = KVMI_DESC_LDTR;
+	}
+
+	return kvmi_descriptor_event(vcpu, vmx_instruction_info,
+				     exit_qualification, descriptor, store);
+}
+
 static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -8219,6 +8244,8 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
 	[EXIT_REASON_VMFUNC]                  = handle_vmfunc,
 	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
+	[EXIT_REASON_GDTR_IDTR]               = handle_descriptor_access,
+	[EXIT_REASON_LDTR_TR]                 = handle_descriptor_access,
 };
 
 static const int kvm_vmx_max_exit_handlers =

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 14/18] kvm: x86: hook in kvmi_cr_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Notify the guest introspection tool that cr{0,3,4} is going to be
changed. The function kvmi_cr_event() will load in crX the new value
if the tool permits it.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index caf50b7307a4..8f5cc81c8760 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -676,6 +676,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 0, old_cr0, &cr0))
+		return 1;
+
 	kvm_x86_ops->set_cr0(vcpu, cr0);
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
@@ -816,6 +819,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if (!kvmi_cr_event(vcpu, 4, old_cr4, &cr4))
+		return 1;
+
 	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
@@ -832,11 +838,13 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);
 
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
+	unsigned long old_cr3 = kvm_read_cr3(vcpu);
+
 #ifdef CONFIG_X86_64
 	cr3 &= ~CR3_PCID_INVD;
 #endif
 
-	if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+	if (cr3 == old_cr3 && !pdptrs_changed(vcpu)) {
 		kvm_mmu_sync_roots(vcpu);
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		return 0;
@@ -849,6 +857,9 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 3, old_cr3, &cr3))
+		return 1;
+
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	kvm_mmu_new_cr3(vcpu);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 14/18] kvm: x86: hook in kvmi_cr_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Notify the guest introspection tool that cr{0,3,4} is going to be
changed. The function kvmi_cr_event() will load in crX the new value
if the tool permits it.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index caf50b7307a4..8f5cc81c8760 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -676,6 +676,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 0, old_cr0, &cr0))
+		return 1;
+
 	kvm_x86_ops->set_cr0(vcpu, cr0);
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
@@ -816,6 +819,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if (!kvmi_cr_event(vcpu, 4, old_cr4, &cr4))
+		return 1;
+
 	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
@@ -832,11 +838,13 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);
 
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
+	unsigned long old_cr3 = kvm_read_cr3(vcpu);
+
 #ifdef CONFIG_X86_64
 	cr3 &= ~CR3_PCID_INVD;
 #endif
 
-	if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+	if (cr3 == old_cr3 && !pdptrs_changed(vcpu)) {
 		kvm_mmu_sync_roots(vcpu);
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		return 0;
@@ -849,6 +857,9 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 3, old_cr3, &cr3))
+		return 1;
+
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	kvm_mmu_new_cr3(vcpu);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 15/18] kvm: x86: hook in kvmi_xsetbv_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Notify the guest introspection tool that the extended control register
has been changed.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f5cc81c8760..284bb4c740fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -765,6 +765,9 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
+	if (xcr != vcpu->arch.xcr0)
+		kvmi_xsetbv_event(vcpu);
+
 	if (kvm_x86_ops->get_cpl(vcpu) != 0 ||
 	    __kvm_set_xcr(vcpu, index, xcr)) {
 		kvm_inject_gp(vcpu, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 15/18] kvm: x86: hook in kvmi_xsetbv_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Notify the guest introspection tool that the extended control register
has been changed.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f5cc81c8760..284bb4c740fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -765,6 +765,9 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
+	if (xcr != vcpu->arch.xcr0)
+		kvmi_xsetbv_event(vcpu);
+
 	if (kvm_x86_ops->get_cpl(vcpu) != 0 ||
 	    __kvm_set_xcr(vcpu, index, xcr)) {
 		kvm_inject_gp(vcpu, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 16/18] kvm: x86: hook in kvmi_msr_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that an MSR is going to be changed.

The kvmi_msr_event() function will check a bitmap of MSR-s of interest
(configured via a KVMI_CONTROL_EVENTS(KVMI_MSR_CONTROL) request) and, if
the new value differs from the previous one, it will generate a
notification. The introspection tool can respond by allowing the guest
to continue with normal execution or by discarding the change.

This is meant to prevent malicious changes to MSR-s such as
MSR_IA32_SYSENTER_EIP.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 284bb4c740fa..271028ccbeca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1111,6 +1111,9 @@ EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
  */
 int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
+	if (!kvmi_msr_event(vcpu, msr))
+		return 1;
+
 	switch (msr->index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 16/18] kvm: x86: hook in kvmi_msr_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that an MSR is going to be changed.

The kvmi_msr_event() function will check a bitmap of MSR-s of interest
(configured via a KVMI_CONTROL_EVENTS(KVMI_MSR_CONTROL) request) and, if
the new value differs from the previous one, it will generate a
notification. The introspection tool can respond by allowing the guest
to continue with normal execution or by discarding the change.

This is meant to prevent malicious changes to MSR-s such as
MSR_IA32_SYSENTER_EIP.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 284bb4c740fa..271028ccbeca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1111,6 +1111,9 @@ EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
  */
 int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
+	if (!kvmi_msr_event(vcpu, msr))
+		return 1;
+
 	switch (msr->index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 17/18] kvm: x86: handle the introspection hypercalls
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Mircea Cîrjaliu, Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Two hypercalls (KVM_HC_MEM_MAP, KVM_HC_MEM_UNMAP) are used by the
introspection tool running in a VM to map/unmap memory from the
introspected VM-s.

The third hypercall (KVM_HC_XEN_HVM_OP) is used by the code residing
inside the introspected guest to call the introspection tool and to report
certain details about its operation. For example, a classic antimalware
remediation tool can report what it has found during a scan.

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 271028ccbeca..9a3c315b13e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6333,7 +6333,8 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 
 	r = kvm_skip_emulated_instruction(vcpu);
 
-	if (kvm_hv_hypercall_enabled(vcpu->kvm))
+	if (kvm_hv_hypercall_enabled(vcpu->kvm)
+			&& !kvmi_is_agent_hypercall(vcpu))
 		return kvm_hv_hypercall(vcpu);
 
 	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
@@ -6371,6 +6372,16 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
 #endif
+	case KVM_HC_MEM_MAP:
+		ret = kvmi_host_mem_map(vcpu, (gva_t)a0, (gpa_t)a1, (gpa_t)a2);
+		break;
+	case KVM_HC_MEM_UNMAP:
+		ret = kvmi_host_mem_unmap(vcpu, (gpa_t)a0);
+		break;
+	case KVM_HC_XEN_HVM_OP:
+		kvmi_hypercall_event(vcpu);
+		ret = 0;
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 17/18] kvm: x86: handle the introspection hypercalls
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar,
	Mircea Cîrjaliu, Nicușor Cîțu

From: Adalbert Lazar <alazar@bitdefender.com>

Two hypercalls (KVM_HC_MEM_MAP, KVM_HC_MEM_UNMAP) are used by the
introspection tool running in a VM to map/unmap memory from the
introspected VM-s.

The third hypercall (KVM_HC_XEN_HVM_OP) is used by the code residing
inside the introspected guest to call the introspection tool and to report
certain details about its operation. For example, a classic antimalware
remediation tool can report what it has found during a scan.

Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 271028ccbeca..9a3c315b13e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6333,7 +6333,8 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 
 	r = kvm_skip_emulated_instruction(vcpu);
 
-	if (kvm_hv_hypercall_enabled(vcpu->kvm))
+	if (kvm_hv_hypercall_enabled(vcpu->kvm)
+			&& !kvmi_is_agent_hypercall(vcpu))
 		return kvm_hv_hypercall(vcpu);
 
 	nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
@@ -6371,6 +6372,16 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
 #endif
+	case KVM_HC_MEM_MAP:
+		ret = kvmi_host_mem_map(vcpu, (gva_t)a0, (gpa_t)a1, (gpa_t)a2);
+		break;
+	case KVM_HC_MEM_UNMAP:
+		ret = kvmi_host_mem_unmap(vcpu, (gpa_t)a0);
+		break;
+	case KVM_HC_XEN_HVM_OP:
+		kvmi_hypercall_event(vcpu);
+		ret = 0;
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 18/18] kvm: x86: hook in kvmi_trap_event()
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2017-12-18 19:06   ` Adalber Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that the exception (from a previous
KVMI_INJECT_EXCEPTION command) was not successfully injected.

It can happen for the tool to queue a pagefault but have it overwritten
by an interrupt picked up during guest reentry. kvmi_trap_event() is
used to inform the tool of all pending traps giving it a chance to
determine if it should try again later.

Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a3c315b13e4..b3825658528a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7058,6 +7058,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
+	if (kvmi_lost_exception(vcpu)) {
+		local_irq_enable();
+		preempt_enable();
+		vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+		r = 1;
+		kvmi_trap_event(vcpu);
+		goto cancel_injection;
+	}
+
 	kvm_load_guest_xcr0(vcpu);
 
 	if (req_immediate_exit) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [RFC PATCH v4 18/18] kvm: x86: hook in kvmi_trap_event()
@ 2017-12-18 19:06   ` Adalber Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalber Lazăr @ 2017-12-18 19:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Adalbert Lazar

From: Adalbert Lazar <alazar@bitdefender.com>

Inform the guest introspection tool that the exception (from a previous
KVMI_INJECT_EXCEPTION command) was not successfully injected.

It can happen for the tool to queue a pagefault but have it overwritten
by an interrupt picked up during guest reentry. kvmi_trap_event() is
used to inform the tool of all pending traps giving it a chance to
determine if it should try again later.

Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a3c315b13e4..b3825658528a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7058,6 +7058,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
+	if (kvmi_lost_exception(vcpu)) {
+		local_irq_enable();
+		preempt_enable();
+		vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+		r = 1;
+		kvmi_trap_event(vcpu);
+		goto cancel_injection;
+	}
+
 	kvm_load_guest_xcr0(vcpu);
 
 	if (req_immediate_exit) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-21 21:17     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:17 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Mircea Cîrjaliu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> An introspection tool running in a dedicated VM can use the new device
> (/dev/kvmmem) to map memory from other introspected VM-s.
> 
> Two ioctl operations are supported:
>    - KVM_INTRO_MEM_MAP/struct kvmi_mem_map
>    - KVM_INTRO_MEM_UNMAP/unsigned long
> 
> In order to map an introspected gpa to the local gva, the process using
> this device needs to obtain a token from the host KVMI subsystem (see
> Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).
> 
> Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_INTRO_MEM_UNMAP)
> to pass the requests to the host kernel/KVMi (see hypercalls.txt).
> 
> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
> ---
>   arch/x86/Kconfig                  |   9 +
>   arch/x86/include/asm/kvmi_guest.h |  10 +
>   arch/x86/kernel/Makefile          |   1 +
>   arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
>   virt/kvm/kvmi_mem_guest.c         | 379 ++++++++++++++++++++++++++++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 arch/x86/include/asm/kvmi_guest.h
>   create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
>   create mode 100644 virt/kvm/kvmi_mem_guest.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8eed3f94bfc7..6e2548f4d44c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -782,6 +782,15 @@ config KVM_DEBUG_FS
>   	  Statistics are displayed in debugfs filesystem. Enabling this option
>   	  may incur significant overhead.
>   
> +config KVMI_MEM_GUEST
> +	bool "KVM Memory Introspection support on Guest"
> +	depends on KVM_GUEST
> +	default n
> +	---help---
> +	  This option enables functions and hypercalls for security applications
> +	  running in a separate VM to control the execution of other VM-s, query
> +	  the state of the vCPU-s (GPR-s, MSR-s etc.).
> +
>   config PARAVIRT_TIME_ACCOUNTING
>   	bool "Paravirtual steal time accounting"
>   	depends on PARAVIRT
> diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
> new file mode 100644
> index 000000000000..c7ed53a938e0
> --- /dev/null
> +++ b/arch/x86/include/asm/kvmi_guest.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_GUEST_H__
> +#define __KVMI_GUEST_H__
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +	gpa_t req_gpa, gpa_t map_gpa);
> +long kvmi_arch_unmap_hc(gpa_t map_gpa);
> +
> +
> +#endif
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 81bb565f4497..fdb54b65e46e 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -111,6 +111,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>   obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
>   obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>   obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
> +obj-$(CONFIG_KVMI_MEM_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
>   
>   obj-$(CONFIG_EISA)		+= eisa.o
>   obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
> new file mode 100644
> index 000000000000..c4e2613f90f3
> --- /dev/null
> +++ b/arch/x86/kernel/kvmi_mem_guest.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/linux/kvm_para.h>
> +#include <linux/kvm_types.h>
> +#include <asm/kvm_para.h>
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +		       gpa_t req_gpa, gpa_t map_gpa)
> +{
> +	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
> +			      req_gpa, map_gpa);
> +}
> +
> +long kvmi_arch_unmap_hc(gpa_t map_gpa)
> +{
> +	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
> +}
> diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
> new file mode 100644
> index 000000000000..118c22ca47c5
> --- /dev/null
> +++ b/virt/kvm/kvmi_mem_guest.c
> @@ -0,0 +1,379 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/types.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvm_para.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/rmap.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <uapi/linux/kvmi.h>
> +#include <asm/kvmi_guest.h>
> +
> +#define ASSERT(exp) BUG_ON(!(exp))
> +
> +
> +static struct list_head file_list;
> +static spinlock_t file_lock;
> +
> +struct file_map {
> +	struct list_head file_list;
> +	struct file *file;
> +	struct list_head map_list;
> +	struct mutex lock;
> +	int active;	/* for tearing down */
> +};
> +
> +struct page_map {
> +	struct list_head map_list;
> +	__u64 gpa;
> +	unsigned long vaddr;
> +	unsigned long paddr;
> +};
> +
> +
> +static int kvm_dev_open(struct inode *inodep, struct file *filp)
> +{
> +	struct file_map *fmp;
> +
> +	pr_debug("kvmi: file %016lx opened by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* link the file 1:1 with such a structure */
> +	fmp = kmalloc(sizeof(struct file_map), GFP_KERNEL);

I think this is supposed to be "kmalloc(sizeof(*fmp), GFP_KERNEL)".

> +	if (fmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&fmp->file_list);
> +	fmp->file = filp;
> +	filp->private_data = fmp;
> +	INIT_LIST_HEAD(&fmp->map_list);
> +	mutex_init(&fmp->lock);
> +	fmp->active = 1;
> +
> +	/* add the entry to the global list */
> +	spin_lock(&file_lock);
> +	list_add_tail(&fmp->file_list, &file_list);
> +	spin_unlock(&file_lock);
> +
> +	return 0;
> +}
> +
> +/* actually does the mapping of a page */
> +static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmap)

Here you have a "struct page_map" and call it pmap. However, in the rest 
of the code, whenever there's a "struct page_map" it's called pmp. It 
seems that it would be good to stay consistent with the naming, so 
perhaps rename it here to pmp as well?


> +{
> +	unsigned long paddr;
> +	struct vm_area_struct *vma = NULL; > +	struct page *page;

Out of curiosity, why do you set "*vma = NULL" but not "*page = NULL"?

> +	long result;
> +
> +	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
> +		 map_req->gpa, map_req->gva);
> +
> +	/* check access to memory location */
> +	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
> +		pr_err("kvmi: invalid virtual address for mapping\n");
> +		return -EINVAL;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	/* find the page to be replaced */
> +	vma = find_vma(current->mm, map_req->gva);
> +	if (IS_ERR_OR_NULL(vma)) {
> +		result = PTR_ERR(vma);
> +		pr_err("kvmi: find_vma() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	page = follow_page(vma, map_req->gva, 0);
> +	if (IS_ERR_OR_NULL(page)) {
> +		result = PTR_ERR(page);
> +		pr_err("kvmi: follow_page() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(page, "page to map_req into");
> +
> +	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
> +
> +	/* get the physical address and store it in page_map */
> +	paddr = page_to_phys(page);
> +	pr_debug("kvmi: page phys addr %016lx\n", paddr);
> +	pmap->paddr = paddr;
> +
> +	/* last thing to do is host mapping */
> +	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
> +	if (IS_ERR_VALUE(result)) {
> +		pr_err("kvmi: HC failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +out:
> +	up_read(&current->mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +/* actually does the unmapping of a page */
> +static long _do_unmapping(unsigned long paddr)
> +{
> +	long result;
> +
> +	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
> +
> +	/* local GPA uniquely identifies the mapping on the host */
> +	result = kvmi_arch_unmap_hc(paddr);
> +	if (IS_ERR_VALUE(result))
> +		pr_warn("kvmi: HC failed with result %ld\n", result);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_map(struct file_map *fmp, struct kvmi_mem_map *map)
> +{
> +	struct page_map *pmp;
> +	long result = 0;

Out of curiosity again, why do you set "result = 0" here when it's 
always set before used (and, for e.g., _do_unmapping() doesn't do 
"result = 0")?

> +
> +	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +
> +	/* prepare list entry */
> +	pmp = kmalloc(sizeof(struct page_map), GFP_KERNEL);

This should also probably be "kmalloc(sizeof(*pmp), GFP_KERNEL)".

> +	if (pmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&pmp->map_list);
> +	pmp->gpa = map->gpa;
> +	pmp->vaddr = map->gva;
> +
> +	/* acquire the file mapping */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		result = -ENODEV;
> +		pr_warn("kvmi: unable to map, file is being closed\n");
> +		goto out_err;
> +	}
> +
> +	/* do the actual mapping */
> +	result = _do_mapping(map, pmp);
> +	if (IS_ERR_VALUE(result))
> +		goto out_err;
> +
> +	/* link to list */
> +	list_add_tail(&pmp->map_list, &fmp->map_list);
> +
> +	/* all fine */
> +	result = 0;
> +	goto out_finalize;
> +
> +out_err:
> +	kfree(pmp);
> +
> +out_finalize:
> +	mutex_unlock(&fmp->lock);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long vaddr)
> +{
> +	struct list_head *cur;
> +	struct page_map *pmp;
> +	bool found = false;
> +
> +	/* acquire the file */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		mutex_unlock(&fmp->lock);

Wouldn't this be better replaced with a "goto out_err" like in 
kvm_dev_ioctl_map()?

> +		pr_warn("kvmi: unable to unmap, file is being closed\n");
> +		return -ENODEV;
> +	}
> +
> +	/* check that this address belongs to us */
> +	list_for_each(cur, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* found */
> +		if (pmp->vaddr == vaddr) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	/* not found ? */
> +	if (!found) {
> +		mutex_unlock(&fmp->lock);

Here too: "goto out_err".

> +		pr_err("kvmi: address %016lx not mapped\n", vaddr);
> +		return -ENOENT;
> +	}
> +
> +	/* decouple guest mapping */
> +	list_del(&pmp->map_list);
> +	mutex_unlock(&fmp->lock);

In kvm_dev_ioctl_map(), the fmp mutex is held across the _do_mapping() 
call. Is there any particular reason why here the mutex doesn't need to 
be held across the _do_unmapping() call? Or was that more an artifact of 
having a common "out_err" exit in kvm_dev_ioctl_map()?

> +
> +	/* unmap & ignore result */
> +	_do_unmapping(pmp->paddr);
> +
> +	/* free guest mapping */
> +	kfree(pmp);
> +
> +	return 0;
> +}
> +
> +static long kvm_dev_ioctl(struct file *filp,
> +			  unsigned int ioctl, unsigned long arg)
> +{
> +	void __user *argp = (void __user *) arg;
> +	struct file_map *fmp;
> +	long result;
> +
> +	/* minor check */
> +	fmp = filp->private_data;
> +	ASSERT(fmp->file == filp);
> +
> +	switch (ioctl) {
> +	case KVM_INTRO_MEM_MAP: {
> +		struct kvmi_mem_map map;
> +
> +		result = -EFAULT;
> +		if (copy_from_user(&map, argp, sizeof(map)))
> +			break;
> +
> +		result = kvm_dev_ioctl_map(fmp, &map);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Since kvm_dev_ioctl_map() either returns an error or 0, couldn't this 
just be reduced to:
		result = kvm_dev_ioctl_map(fmap, &map);
		break;
	}

> +	case KVM_INTRO_MEM_UNMAP: {
> +		unsigned long vaddr = (unsigned long) arg;
> +
> +		result = kvm_dev_ioctl_unmap(fmp, vaddr);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Ditto here.

> +	default:
> +		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +static int kvm_dev_release(struct inode *inodep, struct file *filp)
> +{
> +	int result = 0;

You set "result = 0" here, but result isn't used until the end, and just 
to return it.

> +	struct file_map *fmp;
> +	struct list_head *cur, *next;
> +	struct page_map *pmp;
> +
> +	pr_debug("kvmi: file %016lx closed by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* acquire the file */
> +	fmp = filp->private_data;
> +	mutex_lock(&fmp->lock);
> +
> +	/* mark for teardown */
> +	fmp->active = 0;
> +
> +	/* release mappings taken on this instance of the file */
> +	list_for_each_safe(cur, next, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* unmap address */
> +		_do_unmapping(pmp->paddr);
> +
> +		/* decouple & free guest mapping */
> +		list_del(&pmp->map_list);
> +		kfree(pmp);
> +	}
> +
> +	/* done processing this file mapping */
> +	mutex_unlock(&fmp->lock);
> +
> +	/* decouple file mapping */
> +	spin_lock(&file_lock);
> +	list_del(&fmp->file_list);
> +	spin_unlock(&file_lock);
> +
> +	/* free it */
> +	kfree(fmp);
> +
> +	return result;

This is the first time result is used. Couldn't this just be replaced 
with "return 0"?

> +}
> +
> +
> +static const struct file_operations kvmmem_ops = {
> +	.open		= kvm_dev_open,
> +	.unlocked_ioctl = kvm_dev_ioctl,
> +	.compat_ioctl   = kvm_dev_ioctl,
> +	.release	= kvm_dev_release,
> +};

Here you have all the rvals aligned...

> +
> +static struct miscdevice kvm_mem_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "kvmmem",
> +	.fops = &kvmmem_ops,
> +};

...but here you don't. I'm not sure what the "proper" style is, but I 
think it should at least just be consistent.

> +
> +int __init kvm_intro_guest_init(void)
> +{
> +	int result;
> +
> +	if (!kvm_para_available()) {
> +		pr_err("kvmi: paravirt not available\n");
> +		return -EPERM;
> +	}
> +
> +	result = misc_register(&kvm_mem_dev);
> +	if (result) {
> +		pr_err("kvmi: misc device register failed: %d\n", result);
> +		return result;
> +	}
> +
> +	INIT_LIST_HEAD(&file_list);
> +	spin_lock_init(&file_lock);
> +
> +	pr_info("kvmi: guest introspection device created\n");
> +
> +	return 0;
> +}
> +
> +void kvm_intro_guest_exit(void)
> +{
> +	misc_deregister(&kvm_mem_dev);
> +}
> +
> +module_init(kvm_intro_guest_init)
> +module_exit(kvm_intro_guest_exit)
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
@ 2017-12-21 21:17     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:17 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Mircea Cîrjaliu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> An introspection tool running in a dedicated VM can use the new device
> (/dev/kvmmem) to map memory from other introspected VM-s.
> 
> Two ioctl operations are supported:
>    - KVM_INTRO_MEM_MAP/struct kvmi_mem_map
>    - KVM_INTRO_MEM_UNMAP/unsigned long
> 
> In order to map an introspected gpa to the local gva, the process using
> this device needs to obtain a token from the host KVMI subsystem (see
> Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).
> 
> Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_INTRO_MEM_UNMAP)
> to pass the requests to the host kernel/KVMi (see hypercalls.txt).
> 
> Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
> ---
>   arch/x86/Kconfig                  |   9 +
>   arch/x86/include/asm/kvmi_guest.h |  10 +
>   arch/x86/kernel/Makefile          |   1 +
>   arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
>   virt/kvm/kvmi_mem_guest.c         | 379 ++++++++++++++++++++++++++++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 arch/x86/include/asm/kvmi_guest.h
>   create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
>   create mode 100644 virt/kvm/kvmi_mem_guest.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8eed3f94bfc7..6e2548f4d44c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -782,6 +782,15 @@ config KVM_DEBUG_FS
>   	  Statistics are displayed in debugfs filesystem. Enabling this option
>   	  may incur significant overhead.
>   
> +config KVMI_MEM_GUEST
> +	bool "KVM Memory Introspection support on Guest"
> +	depends on KVM_GUEST
> +	default n
> +	---help---
> +	  This option enables functions and hypercalls for security applications
> +	  running in a separate VM to control the execution of other VM-s, query
> +	  the state of the vCPU-s (GPR-s, MSR-s etc.).
> +
>   config PARAVIRT_TIME_ACCOUNTING
>   	bool "Paravirtual steal time accounting"
>   	depends on PARAVIRT
> diff --git a/arch/x86/include/asm/kvmi_guest.h b/arch/x86/include/asm/kvmi_guest.h
> new file mode 100644
> index 000000000000..c7ed53a938e0
> --- /dev/null
> +++ b/arch/x86/include/asm/kvmi_guest.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_GUEST_H__
> +#define __KVMI_GUEST_H__
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +	gpa_t req_gpa, gpa_t map_gpa);
> +long kvmi_arch_unmap_hc(gpa_t map_gpa);
> +
> +
> +#endif
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 81bb565f4497..fdb54b65e46e 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -111,6 +111,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>   obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
>   obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>   obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
> +obj-$(CONFIG_KVMI_MEM_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
>   
>   obj-$(CONFIG_EISA)		+= eisa.o
>   obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/kvmi_mem_guest.c b/arch/x86/kernel/kvmi_mem_guest.c
> new file mode 100644
> index 000000000000..c4e2613f90f3
> --- /dev/null
> +++ b/arch/x86/kernel/kvmi_mem_guest.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/linux/kvm_para.h>
> +#include <linux/kvm_types.h>
> +#include <asm/kvm_para.h>
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +		       gpa_t req_gpa, gpa_t map_gpa)
> +{
> +	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
> +			      req_gpa, map_gpa);
> +}
> +
> +long kvmi_arch_unmap_hc(gpa_t map_gpa)
> +{
> +	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa);
> +}
> diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c
> new file mode 100644
> index 000000000000..118c22ca47c5
> --- /dev/null
> +++ b/virt/kvm/kvmi_mem_guest.c
> @@ -0,0 +1,379 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/types.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvm_para.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/rmap.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <uapi/linux/kvmi.h>
> +#include <asm/kvmi_guest.h>
> +
> +#define ASSERT(exp) BUG_ON(!(exp))
> +
> +
> +static struct list_head file_list;
> +static spinlock_t file_lock;
> +
> +struct file_map {
> +	struct list_head file_list;
> +	struct file *file;
> +	struct list_head map_list;
> +	struct mutex lock;
> +	int active;	/* for tearing down */
> +};
> +
> +struct page_map {
> +	struct list_head map_list;
> +	__u64 gpa;
> +	unsigned long vaddr;
> +	unsigned long paddr;
> +};
> +
> +
> +static int kvm_dev_open(struct inode *inodep, struct file *filp)
> +{
> +	struct file_map *fmp;
> +
> +	pr_debug("kvmi: file %016lx opened by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* link the file 1:1 with such a structure */
> +	fmp = kmalloc(sizeof(struct file_map), GFP_KERNEL);

I think this is supposed to be "kmalloc(sizeof(*fmp), GFP_KERNEL)".

> +	if (fmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&fmp->file_list);
> +	fmp->file = filp;
> +	filp->private_data = fmp;
> +	INIT_LIST_HEAD(&fmp->map_list);
> +	mutex_init(&fmp->lock);
> +	fmp->active = 1;
> +
> +	/* add the entry to the global list */
> +	spin_lock(&file_lock);
> +	list_add_tail(&fmp->file_list, &file_list);
> +	spin_unlock(&file_lock);
> +
> +	return 0;
> +}
> +
> +/* actually does the mapping of a page */
> +static long _do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmap)

Here you have a "struct page_map" and call it pmap. However, in the rest 
of the code, whenever there's a "struct page_map" it's called pmp. It 
seems that it would be good to stay consistent with the naming, so 
perhaps rename it here to pmp as well?


> +{
> +	unsigned long paddr;
> +	struct vm_area_struct *vma = NULL; > +	struct page *page;

Out of curiosity, why do you set "*vma = NULL" but not "*page = NULL"?

> +	long result;
> +
> +	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
> +		 map_req->gpa, map_req->gva);
> +
> +	/* check access to memory location */
> +	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
> +		pr_err("kvmi: invalid virtual address for mapping\n");
> +		return -EINVAL;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	/* find the page to be replaced */
> +	vma = find_vma(current->mm, map_req->gva);
> +	if (IS_ERR_OR_NULL(vma)) {
> +		result = PTR_ERR(vma);
> +		pr_err("kvmi: find_vma() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	page = follow_page(vma, map_req->gva, 0);
> +	if (IS_ERR_OR_NULL(page)) {
> +		result = PTR_ERR(page);
> +		pr_err("kvmi: follow_page() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(page, "page to map_req into");
> +
> +	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
> +
> +	/* get the physical address and store it in page_map */
> +	paddr = page_to_phys(page);
> +	pr_debug("kvmi: page phys addr %016lx\n", paddr);
> +	pmap->paddr = paddr;
> +
> +	/* last thing to do is host mapping */
> +	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
> +	if (IS_ERR_VALUE(result)) {
> +		pr_err("kvmi: HC failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +out:
> +	up_read(&current->mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +/* actually does the unmapping of a page */
> +static long _do_unmapping(unsigned long paddr)
> +{
> +	long result;
> +
> +	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
> +
> +	/* local GPA uniquely identifies the mapping on the host */
> +	result = kvmi_arch_unmap_hc(paddr);
> +	if (IS_ERR_VALUE(result))
> +		pr_warn("kvmi: HC failed with result %ld\n", result);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_map(struct file_map *fmp, struct kvmi_mem_map *map)
> +{
> +	struct page_map *pmp;
> +	long result = 0;

Out of curiosity again, why do you set "result = 0" here when it's 
always set before used (and, for e.g., _do_unmapping() doesn't do 
"result = 0")?

> +
> +	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +
> +	/* prepare list entry */
> +	pmp = kmalloc(sizeof(struct page_map), GFP_KERNEL);

This should also probably be "kmalloc(sizeof(*pmp), GFP_KERNEL)".

> +	if (pmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&pmp->map_list);
> +	pmp->gpa = map->gpa;
> +	pmp->vaddr = map->gva;
> +
> +	/* acquire the file mapping */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		result = -ENODEV;
> +		pr_warn("kvmi: unable to map, file is being closed\n");
> +		goto out_err;
> +	}
> +
> +	/* do the actual mapping */
> +	result = _do_mapping(map, pmp);
> +	if (IS_ERR_VALUE(result))
> +		goto out_err;
> +
> +	/* link to list */
> +	list_add_tail(&pmp->map_list, &fmp->map_list);
> +
> +	/* all fine */
> +	result = 0;
> +	goto out_finalize;
> +
> +out_err:
> +	kfree(pmp);
> +
> +out_finalize:
> +	mutex_unlock(&fmp->lock);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long vaddr)
> +{
> +	struct list_head *cur;
> +	struct page_map *pmp;
> +	bool found = false;
> +
> +	/* acquire the file */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		mutex_unlock(&fmp->lock);

Wouldn't this be better replaced with a "goto out_err" like in 
kvm_dev_ioctl_map()?

> +		pr_warn("kvmi: unable to unmap, file is being closed\n");
> +		return -ENODEV;
> +	}
> +
> +	/* check that this address belongs to us */
> +	list_for_each(cur, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* found */
> +		if (pmp->vaddr == vaddr) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	/* not found ? */
> +	if (!found) {
> +		mutex_unlock(&fmp->lock);

Here too: "goto out_err".

> +		pr_err("kvmi: address %016lx not mapped\n", vaddr);
> +		return -ENOENT;
> +	}
> +
> +	/* decouple guest mapping */
> +	list_del(&pmp->map_list);
> +	mutex_unlock(&fmp->lock);

In kvm_dev_ioctl_map(), the fmp mutex is held across the _do_mapping() 
call. Is there any particular reason why here the mutex doesn't need to 
be held across the _do_unmapping() call? Or was that more an artifact of 
having a common "out_err" exit in kvm_dev_ioctl_map()?

> +
> +	/* unmap & ignore result */
> +	_do_unmapping(pmp->paddr);
> +
> +	/* free guest mapping */
> +	kfree(pmp);
> +
> +	return 0;
> +}
> +
> +static long kvm_dev_ioctl(struct file *filp,
> +			  unsigned int ioctl, unsigned long arg)
> +{
> +	void __user *argp = (void __user *) arg;
> +	struct file_map *fmp;
> +	long result;
> +
> +	/* minor check */
> +	fmp = filp->private_data;
> +	ASSERT(fmp->file == filp);
> +
> +	switch (ioctl) {
> +	case KVM_INTRO_MEM_MAP: {
> +		struct kvmi_mem_map map;
> +
> +		result = -EFAULT;
> +		if (copy_from_user(&map, argp, sizeof(map)))
> +			break;
> +
> +		result = kvm_dev_ioctl_map(fmp, &map);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Since kvm_dev_ioctl_map() either returns an error or 0, couldn't this 
just be reduced to:
		result = kvm_dev_ioctl_map(fmap, &map);
		break;
	}

> +	case KVM_INTRO_MEM_UNMAP: {
> +		unsigned long vaddr = (unsigned long) arg;
> +
> +		result = kvm_dev_ioctl_unmap(fmp, vaddr);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Ditto here.

> +	default:
> +		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +static int kvm_dev_release(struct inode *inodep, struct file *filp)
> +{
> +	int result = 0;

You set "result = 0" here, but result isn't used until the end, and just 
to return it.

> +	struct file_map *fmp;
> +	struct list_head *cur, *next;
> +	struct page_map *pmp;
> +
> +	pr_debug("kvmi: file %016lx closed by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* acquire the file */
> +	fmp = filp->private_data;
> +	mutex_lock(&fmp->lock);
> +
> +	/* mark for teardown */
> +	fmp->active = 0;
> +
> +	/* release mappings taken on this instance of the file */
> +	list_for_each_safe(cur, next, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* unmap address */
> +		_do_unmapping(pmp->paddr);
> +
> +		/* decouple & free guest mapping */
> +		list_del(&pmp->map_list);
> +		kfree(pmp);
> +	}
> +
> +	/* done processing this file mapping */
> +	mutex_unlock(&fmp->lock);
> +
> +	/* decouple file mapping */
> +	spin_lock(&file_lock);
> +	list_del(&fmp->file_list);
> +	spin_unlock(&file_lock);
> +
> +	/* free it */
> +	kfree(fmp);
> +
> +	return result;

This is the first time result is used. Couldn't this just be replaced 
with "return 0"?

> +}
> +
> +
> +static const struct file_operations kvmmem_ops = {
> +	.open		= kvm_dev_open,
> +	.unlocked_ioctl = kvm_dev_ioctl,
> +	.compat_ioctl   = kvm_dev_ioctl,
> +	.release	= kvm_dev_release,
> +};

Here you have all the rvals aligned...

> +
> +static struct miscdevice kvm_mem_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "kvmmem",
> +	.fops = &kvmmem_ops,
> +};

...but here you don't. I'm not sure what the "proper" style is, but I 
think it should at least just be consistent.

> +
> +int __init kvm_intro_guest_init(void)
> +{
> +	int result;
> +
> +	if (!kvm_para_available()) {
> +		pr_err("kvmi: paravirt not available\n");
> +		return -EPERM;
> +	}
> +
> +	result = misc_register(&kvm_mem_dev);
> +	if (result) {
> +		pr_err("kvmi: misc device register failed: %d\n", result);
> +		return result;
> +	}
> +
> +	INIT_LIST_HEAD(&file_list);
> +	spin_lock_init(&file_lock);
> +
> +	pr_info("kvmi: guest introspection device created\n");
> +
> +	return 0;
> +}
> +
> +void kvm_intro_guest_exit(void)
> +{
> +	misc_deregister(&kvm_mem_dev);
> +}
> +
> +module_init(kvm_intro_guest_init)
> +module_exit(kvm_intro_guest_exit)
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-21 21:29     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:29 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> These are helper functions used by the VM introspection subsytem on the
> PF call path.
> 
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  7 +++++++
>   arch/x86/include/asm/vmx.h      |  2 ++
>   arch/x86/kvm/mmu.c              | 10 ++++++++++
>   arch/x86/kvm/svm.c              |  8 ++++++++
>   arch/x86/kvm/vmx.c              |  9 +++++++++
>   5 files changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8842d8e1e4ee..239eb628f8fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
>   	/* set at EPT violation at this point */
>   	unsigned long exit_qualification;
>   
> +	/* #PF translated error code from EPT/NPT exit reason */
> +	u64 error_code;
> +
>   	/* pv related host specific info */
>   	struct {
>   		bool pv_unhalted;
> @@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
>   	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
>   
>   	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
> +	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
>   };
>   
>   struct kvm_arch_async_pf {
> @@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   
>   void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   				bool enable);
> +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
> +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
> +
>   #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 8b6780751132..7036125349dd 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -530,6 +530,7 @@ struct vmx_msr_entry {
>   #define EPT_VIOLATION_READABLE_BIT	3
>   #define EPT_VIOLATION_WRITABLE_BIT	4
>   #define EPT_VIOLATION_EXECUTABLE_BIT	5
> +#define EPT_VIOLATION_GLA_VALID_BIT	7
>   #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
>   #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
>   #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
> @@ -537,6 +538,7 @@ struct vmx_msr_entry {
>   #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
>   #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
>   #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> +#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
>   #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
>   
>   /*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c4deb1f34faa..55fcb0292724 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
>   	unregister_shrinker(&mmu_shrinker);
>   	mmu_audit_disable();
>   }
> +
> +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_x86_ops->fault_gla(vcpu);
> +}
> +
> +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
> +{
> +	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
> +}
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5f7482851223..f41e4d7008d7 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
>   	u64 fault_address = svm->vmcb->control.exit_info_2;
>   	u64 error_code = svm->vmcb->control.exit_info_1;
>   
> +	svm->vcpu.arch.error_code = error_code;
> +
>   	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
>   			svm->vmcb->control.insn_bytes,
>   			svm->vmcb->control.insn_len);
> @@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   	set_msr_interception(msrpm, msr, enable, enable);
>   }
>   
> +static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	return ~0ull;
> +}
> +
>   static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   	.cpu_has_kvm_support = has_svm,
>   	.disabled_by_bios = is_disabled,
> @@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   	.enable_smi_window = enable_smi_window,
>   
>   	.msr_intercept = svm_msr_intercept,
> +	.fault_gla = svm_fault_gla

Minor nit, it seems like this line should probably end with a "," so 
that future additions don't need to modify this line.

>   };
>   
>   static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9c984bbe263e..5487e0242030 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>   
>   	vcpu->arch.exit_qualification = exit_qualification;
> +	vcpu->arch.error_code = error_code;
>   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   
> @@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   	}
>   }
>   
> +static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
> +		return vmcs_readl(GUEST_LINEAR_ADDRESS);
> +	return ~0ul;

Should this not be "return ~0ull" (like in svm_fault_gla())?

> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.enable_smi_window = enable_smi_window,
>   
>   	.msr_intercept = vmx_msr_intercept,
> +	.fault_gla = vmx_fault_gla

Same deal here with the trailing ","

>   };
>   
>   static int __init vmx_init(void)
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
@ 2017-12-21 21:29     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:29 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> These are helper functions used by the VM introspection subsytem on the
> PF call path.
> 
> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  7 +++++++
>   arch/x86/include/asm/vmx.h      |  2 ++
>   arch/x86/kvm/mmu.c              | 10 ++++++++++
>   arch/x86/kvm/svm.c              |  8 ++++++++
>   arch/x86/kvm/vmx.c              |  9 +++++++++
>   5 files changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8842d8e1e4ee..239eb628f8fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
>   	/* set at EPT violation at this point */
>   	unsigned long exit_qualification;
>   
> +	/* #PF translated error code from EPT/NPT exit reason */
> +	u64 error_code;
> +
>   	/* pv related host specific info */
>   	struct {
>   		bool pv_unhalted;
> @@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
>   	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
>   
>   	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
> +	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
>   };
>   
>   struct kvm_arch_async_pf {
> @@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   
>   void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   				bool enable);
> +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
> +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
> +
>   #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 8b6780751132..7036125349dd 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -530,6 +530,7 @@ struct vmx_msr_entry {
>   #define EPT_VIOLATION_READABLE_BIT	3
>   #define EPT_VIOLATION_WRITABLE_BIT	4
>   #define EPT_VIOLATION_EXECUTABLE_BIT	5
> +#define EPT_VIOLATION_GLA_VALID_BIT	7
>   #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
>   #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
>   #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
> @@ -537,6 +538,7 @@ struct vmx_msr_entry {
>   #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
>   #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
>   #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> +#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
>   #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
>   
>   /*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c4deb1f34faa..55fcb0292724 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
>   	unregister_shrinker(&mmu_shrinker);
>   	mmu_audit_disable();
>   }
> +
> +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_x86_ops->fault_gla(vcpu);
> +}
> +
> +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
> +{
> +	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
> +}
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5f7482851223..f41e4d7008d7 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
>   	u64 fault_address = svm->vmcb->control.exit_info_2;
>   	u64 error_code = svm->vmcb->control.exit_info_1;
>   
> +	svm->vcpu.arch.error_code = error_code;
> +
>   	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
>   			svm->vmcb->control.insn_bytes,
>   			svm->vmcb->control.insn_len);
> @@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   	set_msr_interception(msrpm, msr, enable, enable);
>   }
>   
> +static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	return ~0ull;
> +}
> +
>   static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   	.cpu_has_kvm_support = has_svm,
>   	.disabled_by_bios = is_disabled,
> @@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   	.enable_smi_window = enable_smi_window,
>   
>   	.msr_intercept = svm_msr_intercept,
> +	.fault_gla = svm_fault_gla

Minor nit, it seems like this line should probably end with a "," so 
that future additions don't need to modify this line.

>   };
>   
>   static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9c984bbe263e..5487e0242030 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>   
>   	vcpu->arch.exit_qualification = exit_qualification;
> +	vcpu->arch.error_code = error_code;
>   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   
> @@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
>   	}
>   }
>   
> +static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
> +		return vmcs_readl(GUEST_LINEAR_ADDRESS);
> +	return ~0ul;

Should this not be "return ~0ull" (like in svm_fault_gla())?

> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>   	.enable_smi_window = enable_smi_window,
>   
>   	.msr_intercept = vmx_msr_intercept,
> +	.fault_gla = vmx_fault_gla

Same deal here with the trailing ","

>   };
>   
>   static int __init vmx_init(void)
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-21 21:39     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:39 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
> the exceptions vector.
> 
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> ---
>   arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
>   include/linux/kvm_host.h |  1 +
>   2 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e1a3c2c6ec08..4b0c3692386d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>   	return 0;
>   }
>   
> +/*
> + * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
> + * the exceptions
> + */
> +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> +{
> +	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
> +	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
> +	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
> +	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
> +	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
> +	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
> +	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
> +#ifdef CONFIG_X86_64
> +	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
> +	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
> +	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
> +	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
> +	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
> +	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
> +	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
> +	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
> +#endif
> +
> +	kvm_rip_write(vcpu, regs->rip);
> +	kvm_set_rflags(vcpu, regs->rflags);
> +
> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> +}
> +

kvm_arch_vcpu_ioctl_set_regs() returns an int (so that, for e.g., in ARM 
it can return an error to indicate that the function is not 
supported/implemented). Is there a reason this function shouldn't do the 
same (is it only ever going to be implemented for x86)?

>   void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
>   {
>   	struct kvm_segment cs;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6bdd4b9f6611..68e4d756f5c9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
>   
>   int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
>   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
>   int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
>   				  struct kvm_sregs *sregs);
>   int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
@ 2017-12-21 21:39     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 21:39 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
> the exceptions vector.
> 
> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> ---
>   arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
>   include/linux/kvm_host.h |  1 +
>   2 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e1a3c2c6ec08..4b0c3692386d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>   	return 0;
>   }
>   
> +/*
> + * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
> + * the exceptions
> + */
> +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> +{
> +	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
> +	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
> +	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
> +	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
> +	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
> +	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
> +	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
> +#ifdef CONFIG_X86_64
> +	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
> +	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
> +	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
> +	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
> +	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
> +	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
> +	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
> +	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
> +#endif
> +
> +	kvm_rip_write(vcpu, regs->rip);
> +	kvm_set_rflags(vcpu, regs->rflags);
> +
> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> +}
> +

kvm_arch_vcpu_ioctl_set_regs() returns an int (so that, for e.g., in ARM 
it can return an error to indicate that the function is not 
supported/implemented). Is there a reason this function shouldn't do the 
same (is it only ever going to be implemented for x86)?

>   void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
>   {
>   	struct kvm_segment cs;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6bdd4b9f6611..68e4d756f5c9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
>   
>   int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
>   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
>   int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
>   				  struct kvm_sregs *sregs);
>   int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-21 22:01     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 22:01 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> These callbacks return a boolean value. If false, the emulation should
> stop and the instruction should be reexecuted in guest. The preread
> callback can return the bytes needed by the read operation.
> 
> The kvm_page_track_create_memslot() was extended in order to track gfn-s
> as soon as the memory slots are created.
> 
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_page_track.h |  24 +++++-
>   arch/x86/kvm/mmu.c                    | 143 ++++++++++++++++++++++++++++++----
>   arch/x86/kvm/mmu.h                    |   4 +
>   arch/x86/kvm/page_track.c             | 129 ++++++++++++++++++++++++++++--
>   arch/x86/kvm/x86.c                    |   2 +-
>   5 files changed, 281 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index 172f9749dbb2..77adc7f43754 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,8 +3,11 @@
>   #define _ASM_X86_KVM_PAGE_TRACK_H
>   
>   enum kvm_page_track_mode {
> +	KVM_PAGE_TRACK_PREREAD,
> +	KVM_PAGE_TRACK_PREWRITE,
>   	KVM_PAGE_TRACK_WRITE,
> -	KVM_PAGE_TRACK_MAX,
> +	KVM_PAGE_TRACK_PREEXEC,
> +	KVM_PAGE_TRACK_MAX
>   };

The comma at the end of KVM_PAGE_TRACK_MAX should probably just be left 
in. This will tighten up this diff and prevent another commit 
potentially needing to add it back in later.


>   
>   /*
> @@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
>   struct kvm_page_track_notifier_node {
>   	struct hlist_node node;
>   
> +	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			      int bytes,
> +			      struct kvm_page_track_notifier_node *node,
> +			      bool *data_ready);
> +	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			       int bytes,
> +			       struct kvm_page_track_notifier_node *node);
>   	/*
>   	 * It is called when guest is writing the write-tracked page
>   	 * and write emulation is finished at that time.
> @@ -34,6 +44,11 @@ struct kvm_page_track_notifier_node {
>   	 */
>   	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   			    int bytes, struct kvm_page_track_notifier_node *node);
> +	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			      struct kvm_page_track_notifier_node *node);
> +	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				  unsigned long npages,
> +				  struct kvm_page_track_notifier_node *node);
>   	/*
>   	 * It is called when memory slot is being moved or removed
>   	 * users can drop write-protection for the pages in that memory slot
> @@ -51,7 +66,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
>   
>   void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
>   				 struct kvm_memory_slot *dont);
> -int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
> +int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   				  unsigned long npages);
>   
>   void kvm_slot_page_track_add_page(struct kvm *kvm,
> @@ -69,7 +84,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
>   void
>   kvm_page_track_unregister_notifier(struct kvm *kvm,
>   				   struct kvm_page_track_notifier_node *n);
> +bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			    int bytes, bool *data_ready);
> +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			     int bytes);
>   void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   			  int bytes);
> +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa);
>   void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
>   #endif
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 55fcb0292724..19dc17b00db2 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1014,9 +1014,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	slot = __gfn_to_memslot(slots, gfn);
>   
>   	/* the non-leaf shadow pages are keeping readonly. */
> -	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> -		return kvm_slot_page_track_add_page(kvm, slot, gfn,
> -						    KVM_PAGE_TRACK_WRITE);
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
> +		kvm_slot_page_track_add_page(kvm, slot, gfn,
> +					     KVM_PAGE_TRACK_PREWRITE);
> +		kvm_slot_page_track_add_page(kvm, slot, gfn,
> +					     KVM_PAGE_TRACK_WRITE);
> +		return;
> +	}
>   
>   	kvm_mmu_gfn_disallow_lpage(slot, gfn);
>   }
> @@ -1031,9 +1035,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	gfn = sp->gfn;
>   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
>   	slot = __gfn_to_memslot(slots, gfn);
> -	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> -		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
> -						       KVM_PAGE_TRACK_WRITE);
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
> +		kvm_slot_page_track_remove_page(kvm, slot, gfn,
> +						KVM_PAGE_TRACK_PREWRITE);
> +		kvm_slot_page_track_remove_page(kvm, slot, gfn,
> +						KVM_PAGE_TRACK_WRITE);
> +		return;
> +	}
>   
>   	kvm_mmu_gfn_allow_lpage(slot, gfn);
>   }
> @@ -1416,6 +1424,29 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>   	return mmu_spte_update(sptep, spte);
>   }
>   
> +static bool spte_read_protect(u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
> +
> +	/* TODO: verify if the CPU supports EPT-execute-only */
> +	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +
> +static bool spte_exec_protect(u64 *sptep, bool pt_protect)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
> +
> +	spte = spte & ~PT_USER_MASK;
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +
>   static bool __rmap_write_protect(struct kvm *kvm,
>   				 struct kvm_rmap_head *rmap_head,
>   				 bool pt_protect)
> @@ -1430,6 +1461,34 @@ static bool __rmap_write_protect(struct kvm *kvm,
>   	return flush;
>   }
>   
> +static bool __rmap_read_protect(struct kvm *kvm,
> +				struct kvm_rmap_head *rmap_head,
> +				bool pt_protect)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for_each_rmap_spte(rmap_head, &iter, sptep)
> +		flush |= spte_read_protect(sptep);
> +
> +	return flush;
> +}
> +
> +static bool __rmap_exec_protect(struct kvm *kvm,
> +				struct kvm_rmap_head *rmap_head,
> +				bool pt_protect)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for_each_rmap_spte(rmap_head, &iter, sptep)
> +		flush |= spte_exec_protect(sptep, pt_protect);
> +
> +	return flush;
> +}
> +
>   static bool spte_clear_dirty(u64 *sptep)
>   {
>   	u64 spte = *sptep;
> @@ -1600,6 +1659,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>   	return write_protected;
>   }
>   
> +bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn)
> +{
> +	struct kvm_rmap_head *rmap_head;
> +	int i;
> +	bool read_protected = false;
> +
> +	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
> +		rmap_head = __gfn_to_rmap(gfn, i, slot);
> +		read_protected |= __rmap_read_protect(kvm, rmap_head, true);
> +	}
> +
> +	return read_protected;
> +}
> +
> +bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn)
> +{
> +	struct kvm_rmap_head *rmap_head;
> +	int i;
> +	bool exec_protected = false;
> +
> +	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
> +		rmap_head = __gfn_to_rmap(gfn, i, slot);
> +		exec_protected |= __rmap_exec_protect(kvm, rmap_head, true);
> +	}
> +
> +	return exec_protected;
> +}
> +
>   static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
>   {
>   	struct kvm_memory_slot *slot;
> @@ -2688,7 +2777,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>   {
>   	struct kvm_mmu_page *sp;
>   
> -	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> +	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
>   		return true;
>   
>   	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> @@ -2953,6 +3043,21 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>   	__direct_pte_prefetch(vcpu, sp, sptep);
>   }
>   
> +static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	unsigned int acc = ACC_ALL;
> +
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
> +		acc &= ~ACC_USER_MASK;
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> +	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +		acc &= ~ACC_WRITE_MASK;
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
> +		acc &= ~ACC_EXEC_MASK;
> +
> +	return acc;
> +}
> +
>   static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>   			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
>   {
> @@ -2966,7 +3071,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>   
>   	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
>   		if (iterator.level == level) {
> -			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
> +			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn);
> +
> +			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
>   					       write, level, gfn, pfn, prefault,
>   					       map_writable);
>   			direct_pte_prefetch(vcpu, iterator.sptep);
> @@ -3713,15 +3820,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
>   	if (unlikely(error_code & PFERR_RSVD_MASK))
>   		return false;
>   
> -	if (!(error_code & PFERR_PRESENT_MASK) ||
> -	      !(error_code & PFERR_WRITE_MASK))
> +	if (!(error_code & PFERR_PRESENT_MASK))
>   		return false;
>   
>   	/*
> -	 * guest is writing the page which is write tracked which can
> +	 * guest is reading/writing/fetching the page which is
> +	 * read/write/execute tracked which can
>   	 * not be fixed by page fault handler.
>   	 */
> -	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +	if (((error_code & PFERR_USER_MASK)
> +		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
> +	    || ((error_code & PFERR_WRITE_MASK)
> +		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
> +		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
> +	    || ((error_code & PFERR_FETCH_MASK)
> +		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
>   		return true;
>   
>   	return false;
> @@ -4942,7 +5055,11 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
>   	 * and resume the guest.
>   	 */
>   	if (vcpu->arch.mmu.direct_map &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREREAD) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREWRITE) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_WRITE) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREEXEC)) {
>   		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>   		return 1;
>   	}
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 5b408c0ad612..57c947752490 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -193,5 +193,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>   void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>   bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>   				    struct kvm_memory_slot *slot, u64 gfn);
> +bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn);
> +bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn);
>   int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>   #endif
> diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
> index 01c1371f39f8..8bf6581d25d5 100644
> --- a/arch/x86/kvm/page_track.c
> +++ b/arch/x86/kvm/page_track.c
> @@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
>   		}
>   }
>   
> -int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
> +int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   				  unsigned long npages)
>   {
> -	int  i;
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	int i;
>   
>   	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
>   		slot->arch.gfn_track[i] = kvzalloc(npages *
> @@ -46,6 +49,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
>   			goto track_free;
>   	}
>   
> +	head = &kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return 0;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_create_slot)
> +			n->track_create_slot(kvm, slot, npages, n);
> +	srcu_read_unlock(&head->track_srcu, idx);
> +
>   	return 0;
>   
>   track_free:
> @@ -86,7 +100,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
>    * @kvm: the guest instance we are interested in.
>    * @slot: the @gfn belongs to.
>    * @gfn: the guest page.
> - * @mode: tracking mode, currently only write track is supported.
> + * @mode: tracking mode.
>    */
>   void kvm_slot_page_track_add_page(struct kvm *kvm,
>   				  struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -104,9 +118,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
>   	 */
>   	kvm_mmu_gfn_disallow_lpage(slot, gfn);
>   
> -	if (mode == KVM_PAGE_TRACK_WRITE)
> +	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
>   		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
>   			kvm_flush_remote_tlbs(kvm);
> +	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
> +		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
> +			kvm_flush_remote_tlbs(kvm);
> +	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
> +		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
> +			kvm_flush_remote_tlbs(kvm);
> +	}
>   }
>   EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
>   
> @@ -121,7 +142,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
>    * @kvm: the guest instance we are interested in.
>    * @slot: the @gfn belongs to.
>    * @gfn: the guest page.
> - * @mode: tracking mode, currently only write track is supported.
> + * @mode: tracking mode.
>    */
>   void kvm_slot_page_track_remove_page(struct kvm *kvm,
>   				     struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -214,6 +235,75 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
>   }
>   EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
>   
> +/*
> + * Notify the node that a read access is about to happen. Returning false
> + * doesn't stop the other nodes from being called, but it will stop
> + * the emulation.
> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/

> + * interested in by itself.
> + *
> + * The nodes will always be in conflict if they track the same page:
> + * - accepting a read won't guarantee that the next node will not override
> + *   the data (filling new/bytes and setting data_ready)
> + * - filling new/bytes with custom data won't guarantee that the next node
> + *   will not override that
> + */
> +bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			    int bytes, bool *data_ready)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	*data_ready = false;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_preread)
> +			if (!n->track_preread(vcpu, gpa, new, bytes, n,
> +					       data_ready))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
> +/*
> + * Notify the node that an write access is about to happen. Returning false

s/an write/a write/

> + * doesn't stop the other nodes from being called, but it will stop
> + * the emulation.
> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/


> + * interested in by itself.
> + */
> +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			     int bytes)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_prewrite)
> +			if (!n->track_prewrite(vcpu, gpa, new, bytes, n))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
>   /*
>    * Notify the node that write access is intercepted and write emulation is
>    * finished at this time.
> @@ -240,6 +330,35 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   	srcu_read_unlock(&head->track_srcu, idx);
>   }
>   
> +/*
> + * Notify the node that an instruction is about to be executed.
> + * Returning false doesn't stop the other nodes from being called,
> + * but it will stop the emulation with ?!.

With what?

> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/

> + * interested in by itself.
> + */
> +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_preexec)
> +			if (!n->track_preexec(vcpu, gpa, n))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
>   /*
>    * Notify the node that memory slot is being removed or moved so that it can
>    * drop write-protection for the pages in the memory slot.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e7db70ac1f82..74839859c0fd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8421,7 +8421,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		}
>   	}
>   
> -	if (kvm_page_track_create_memslot(slot, npages))
> +	if (kvm_page_track_create_memslot(kvm, slot, npages))
>   		goto out_free;
>   
>   	return 0;
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
@ 2017-12-21 22:01     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-21 22:01 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> These callbacks return a boolean value. If false, the emulation should
> stop and the instruction should be reexecuted in guest. The preread
> callback can return the bytes needed by the read operation.
> 
> The kvm_page_track_create_memslot() was extended in order to track gfn-s
> as soon as the memory slots are created.
> 
> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_page_track.h |  24 +++++-
>   arch/x86/kvm/mmu.c                    | 143 ++++++++++++++++++++++++++++++----
>   arch/x86/kvm/mmu.h                    |   4 +
>   arch/x86/kvm/page_track.c             | 129 ++++++++++++++++++++++++++++--
>   arch/x86/kvm/x86.c                    |   2 +-
>   5 files changed, 281 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index 172f9749dbb2..77adc7f43754 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,8 +3,11 @@
>   #define _ASM_X86_KVM_PAGE_TRACK_H
>   
>   enum kvm_page_track_mode {
> +	KVM_PAGE_TRACK_PREREAD,
> +	KVM_PAGE_TRACK_PREWRITE,
>   	KVM_PAGE_TRACK_WRITE,
> -	KVM_PAGE_TRACK_MAX,
> +	KVM_PAGE_TRACK_PREEXEC,
> +	KVM_PAGE_TRACK_MAX
>   };

The comma at the end of KVM_PAGE_TRACK_MAX should probably just be left 
in. This will tighten up this diff and prevent another commit 
potentially needing to add it back in later.


>   
>   /*
> @@ -22,6 +25,13 @@ struct kvm_page_track_notifier_head {
>   struct kvm_page_track_notifier_node {
>   	struct hlist_node node;
>   
> +	bool (*track_preread)(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			      int bytes,
> +			      struct kvm_page_track_notifier_node *node,
> +			      bool *data_ready);
> +	bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			       int bytes,
> +			       struct kvm_page_track_notifier_node *node);
>   	/*
>   	 * It is called when guest is writing the write-tracked page
>   	 * and write emulation is finished at that time.
> @@ -34,6 +44,11 @@ struct kvm_page_track_notifier_node {
>   	 */
>   	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   			    int bytes, struct kvm_page_track_notifier_node *node);
> +	bool (*track_preexec)(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			      struct kvm_page_track_notifier_node *node);
> +	void (*track_create_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				  unsigned long npages,
> +				  struct kvm_page_track_notifier_node *node);
>   	/*
>   	 * It is called when memory slot is being moved or removed
>   	 * users can drop write-protection for the pages in that memory slot
> @@ -51,7 +66,7 @@ void kvm_page_track_cleanup(struct kvm *kvm);
>   
>   void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
>   				 struct kvm_memory_slot *dont);
> -int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
> +int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   				  unsigned long npages);
>   
>   void kvm_slot_page_track_add_page(struct kvm *kvm,
> @@ -69,7 +84,12 @@ kvm_page_track_register_notifier(struct kvm *kvm,
>   void
>   kvm_page_track_unregister_notifier(struct kvm *kvm,
>   				   struct kvm_page_track_notifier_node *n);
> +bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			    int bytes, bool *data_ready);
> +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			     int bytes);
>   void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   			  int bytes);
> +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa);
>   void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
>   #endif
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 55fcb0292724..19dc17b00db2 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1014,9 +1014,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	slot = __gfn_to_memslot(slots, gfn);
>   
>   	/* the non-leaf shadow pages are keeping readonly. */
> -	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> -		return kvm_slot_page_track_add_page(kvm, slot, gfn,
> -						    KVM_PAGE_TRACK_WRITE);
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
> +		kvm_slot_page_track_add_page(kvm, slot, gfn,
> +					     KVM_PAGE_TRACK_PREWRITE);
> +		kvm_slot_page_track_add_page(kvm, slot, gfn,
> +					     KVM_PAGE_TRACK_WRITE);
> +		return;
> +	}
>   
>   	kvm_mmu_gfn_disallow_lpage(slot, gfn);
>   }
> @@ -1031,9 +1035,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	gfn = sp->gfn;
>   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
>   	slot = __gfn_to_memslot(slots, gfn);
> -	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> -		return kvm_slot_page_track_remove_page(kvm, slot, gfn,
> -						       KVM_PAGE_TRACK_WRITE);
> +	if (sp->role.level > PT_PAGE_TABLE_LEVEL) {
> +		kvm_slot_page_track_remove_page(kvm, slot, gfn,
> +						KVM_PAGE_TRACK_PREWRITE);
> +		kvm_slot_page_track_remove_page(kvm, slot, gfn,
> +						KVM_PAGE_TRACK_WRITE);
> +		return;
> +	}
>   
>   	kvm_mmu_gfn_allow_lpage(slot, gfn);
>   }
> @@ -1416,6 +1424,29 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>   	return mmu_spte_update(sptep, spte);
>   }
>   
> +static bool spte_read_protect(u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_read_protect: spte %p %llx\n", sptep, *sptep);
> +
> +	/* TODO: verify if the CPU supports EPT-execute-only */
> +	spte = spte & ~(PT_WRITABLE_MASK | PT_PRESENT_MASK);
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +
> +static bool spte_exec_protect(u64 *sptep, bool pt_protect)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_exec_protect: spte %p %llx\n", sptep, *sptep);
> +
> +	spte = spte & ~PT_USER_MASK;
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +
>   static bool __rmap_write_protect(struct kvm *kvm,
>   				 struct kvm_rmap_head *rmap_head,
>   				 bool pt_protect)
> @@ -1430,6 +1461,34 @@ static bool __rmap_write_protect(struct kvm *kvm,
>   	return flush;
>   }
>   
> +static bool __rmap_read_protect(struct kvm *kvm,
> +				struct kvm_rmap_head *rmap_head,
> +				bool pt_protect)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for_each_rmap_spte(rmap_head, &iter, sptep)
> +		flush |= spte_read_protect(sptep);
> +
> +	return flush;
> +}
> +
> +static bool __rmap_exec_protect(struct kvm *kvm,
> +				struct kvm_rmap_head *rmap_head,
> +				bool pt_protect)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for_each_rmap_spte(rmap_head, &iter, sptep)
> +		flush |= spte_exec_protect(sptep, pt_protect);
> +
> +	return flush;
> +}
> +
>   static bool spte_clear_dirty(u64 *sptep)
>   {
>   	u64 spte = *sptep;
> @@ -1600,6 +1659,36 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>   	return write_protected;
>   }
>   
> +bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn)
> +{
> +	struct kvm_rmap_head *rmap_head;
> +	int i;
> +	bool read_protected = false;
> +
> +	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
> +		rmap_head = __gfn_to_rmap(gfn, i, slot);
> +		read_protected |= __rmap_read_protect(kvm, rmap_head, true);
> +	}
> +
> +	return read_protected;
> +}
> +
> +bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn)
> +{
> +	struct kvm_rmap_head *rmap_head;
> +	int i;
> +	bool exec_protected = false;
> +
> +	for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
> +		rmap_head = __gfn_to_rmap(gfn, i, slot);
> +		exec_protected |= __rmap_exec_protect(kvm, rmap_head, true);
> +	}
> +
> +	return exec_protected;
> +}
> +
>   static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
>   {
>   	struct kvm_memory_slot *slot;
> @@ -2688,7 +2777,8 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
>   {
>   	struct kvm_mmu_page *sp;
>   
> -	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> +	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
>   		return true;
>   
>   	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> @@ -2953,6 +3043,21 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>   	__direct_pte_prefetch(vcpu, sp, sptep);
>   }
>   
> +static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	unsigned int acc = ACC_ALL;
> +
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
> +		acc &= ~ACC_USER_MASK;
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> +	    kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +		acc &= ~ACC_WRITE_MASK;
> +	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC))
> +		acc &= ~ACC_EXEC_MASK;
> +
> +	return acc;
> +}
> +
>   static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>   			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
>   {
> @@ -2966,7 +3071,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
>   
>   	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
>   		if (iterator.level == level) {
> -			emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
> +			unsigned int acc = kvm_mmu_page_track_acc(vcpu, gfn);
> +
> +			emulate = mmu_set_spte(vcpu, iterator.sptep, acc,
>   					       write, level, gfn, pfn, prefault,
>   					       map_writable);
>   			direct_pte_prefetch(vcpu, iterator.sptep);
> @@ -3713,15 +3820,21 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
>   	if (unlikely(error_code & PFERR_RSVD_MASK))
>   		return false;
>   
> -	if (!(error_code & PFERR_PRESENT_MASK) ||
> -	      !(error_code & PFERR_WRITE_MASK))
> +	if (!(error_code & PFERR_PRESENT_MASK))
>   		return false;
>   
>   	/*
> -	 * guest is writing the page which is write tracked which can
> +	 * guest is reading/writing/fetching the page which is
> +	 * read/write/execute tracked which can
>   	 * not be fixed by page fault handler.
>   	 */
> -	if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
> +	if (((error_code & PFERR_USER_MASK)
> +		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
> +	    || ((error_code & PFERR_WRITE_MASK)
> +		&& (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE)
> +		 || kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_WRITE)))
> +	    || ((error_code & PFERR_FETCH_MASK)
> +		&& kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREEXEC)))
>   		return true;
>   
>   	return false;
> @@ -4942,7 +5055,11 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
>   	 * and resume the guest.
>   	 */
>   	if (vcpu->arch.mmu.direct_map &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREREAD) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREWRITE) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_WRITE) &&
> +	    !kvm_page_track_is_active(vcpu, gpa_to_gfn(cr2), KVM_PAGE_TRACK_PREEXEC)) {
>   		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>   		return 1;
>   	}
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 5b408c0ad612..57c947752490 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -193,5 +193,9 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>   void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>   bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>   				    struct kvm_memory_slot *slot, u64 gfn);
> +bool kvm_mmu_slot_gfn_read_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn);
> +bool kvm_mmu_slot_gfn_exec_protect(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot, u64 gfn);
>   int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>   #endif
> diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
> index 01c1371f39f8..8bf6581d25d5 100644
> --- a/arch/x86/kvm/page_track.c
> +++ b/arch/x86/kvm/page_track.c
> @@ -34,10 +34,13 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
>   		}
>   }
>   
> -int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
> +int kvm_page_track_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   				  unsigned long npages)
>   {
> -	int  i;
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	int i;
>   
>   	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
>   		slot->arch.gfn_track[i] = kvzalloc(npages *
> @@ -46,6 +49,17 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
>   			goto track_free;
>   	}
>   
> +	head = &kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return 0;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_create_slot)
> +			n->track_create_slot(kvm, slot, npages, n);
> +	srcu_read_unlock(&head->track_srcu, idx);
> +
>   	return 0;
>   
>   track_free:
> @@ -86,7 +100,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
>    * @kvm: the guest instance we are interested in.
>    * @slot: the @gfn belongs to.
>    * @gfn: the guest page.
> - * @mode: tracking mode, currently only write track is supported.
> + * @mode: tracking mode.
>    */
>   void kvm_slot_page_track_add_page(struct kvm *kvm,
>   				  struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -104,9 +118,16 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
>   	 */
>   	kvm_mmu_gfn_disallow_lpage(slot, gfn);
>   
> -	if (mode == KVM_PAGE_TRACK_WRITE)
> +	if (mode == KVM_PAGE_TRACK_PREWRITE || mode == KVM_PAGE_TRACK_WRITE) {
>   		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
>   			kvm_flush_remote_tlbs(kvm);
> +	} else if (mode == KVM_PAGE_TRACK_PREREAD) {
> +		if (kvm_mmu_slot_gfn_read_protect(kvm, slot, gfn))
> +			kvm_flush_remote_tlbs(kvm);
> +	} else if (mode == KVM_PAGE_TRACK_PREEXEC) {
> +		if (kvm_mmu_slot_gfn_exec_protect(kvm, slot, gfn))
> +			kvm_flush_remote_tlbs(kvm);
> +	}
>   }
>   EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
>   
> @@ -121,7 +142,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_add_page);
>    * @kvm: the guest instance we are interested in.
>    * @slot: the @gfn belongs to.
>    * @gfn: the guest page.
> - * @mode: tracking mode, currently only write track is supported.
> + * @mode: tracking mode.
>    */
>   void kvm_slot_page_track_remove_page(struct kvm *kvm,
>   				     struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -214,6 +235,75 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
>   }
>   EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
>   
> +/*
> + * Notify the node that a read access is about to happen. Returning false
> + * doesn't stop the other nodes from being called, but it will stop
> + * the emulation.
> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/

> + * interested in by itself.
> + *
> + * The nodes will always be in conflict if they track the same page:
> + * - accepting a read won't guarantee that the next node will not override
> + *   the data (filling new/bytes and setting data_ready)
> + * - filling new/bytes with custom data won't guarantee that the next node
> + *   will not override that
> + */
> +bool kvm_page_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa, u8 *new,
> +			    int bytes, bool *data_ready)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	*data_ready = false;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_preread)
> +			if (!n->track_preread(vcpu, gpa, new, bytes, n,
> +					       data_ready))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
> +/*
> + * Notify the node that an write access is about to happen. Returning false

s/an write/a write/

> + * doesn't stop the other nodes from being called, but it will stop
> + * the emulation.
> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/


> + * interested in by itself.
> + */
> +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> +			     int bytes)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_prewrite)
> +			if (!n->track_prewrite(vcpu, gpa, new, bytes, n))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
>   /*
>    * Notify the node that write access is intercepted and write emulation is
>    * finished at this time.
> @@ -240,6 +330,35 @@ void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   	srcu_read_unlock(&head->track_srcu, idx);
>   }
>   
> +/*
> + * Notify the node that an instruction is about to be executed.
> + * Returning false doesn't stop the other nodes from being called,
> + * but it will stop the emulation with ?!.

With what?

> + *
> + * The node should figure out if the written page is the one that node is

s/that node/that the node/

> + * interested in by itself.
> + */
> +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	struct kvm_page_track_notifier_head *head;
> +	struct kvm_page_track_notifier_node *n;
> +	int idx;
> +	bool ret = true;
> +
> +	head = &vcpu->kvm->arch.track_notifier_head;
> +
> +	if (hlist_empty(&head->track_notifier_list))
> +		return ret;
> +
> +	idx = srcu_read_lock(&head->track_srcu);
> +	hlist_for_each_entry_rcu(n, &head->track_notifier_list, node)
> +		if (n->track_preexec)
> +			if (!n->track_preexec(vcpu, gpa, n))
> +				ret = false;
> +	srcu_read_unlock(&head->track_srcu, idx);
> +	return ret;
> +}
> +
>   /*
>    * Notify the node that memory slot is being removed or moved so that it can
>    * drop write-protection for the pages in the memory slot.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e7db70ac1f82..74839859c0fd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8421,7 +8421,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		}
>   	}
>   
> -	if (kvm_page_track_create_memslot(slot, npages))
> +	if (kvm_page_track_create_memslot(kvm, slot, npages))
>   		goto out_free;
>   
>   	return 0;
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-22  7:34     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22  7:34 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This subsystem is split into three source files:
>   - kvmi_msg.c - ABI and socket related functions
>   - kvmi_mem.c - handle map/unmap requests from the introspector
>   - kvmi.c - all the other
> 
> The new data used by this subsystem is attached to the 'kvm' and
> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> structures).
> 
> Besides the KVMI system, this patch exports the
> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> 
> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/kvm/Makefile           |    1 +
>   arch/x86/kvm/x86.c              |    4 +-
>   include/linux/kvm_host.h        |    4 +
>   include/linux/kvmi.h            |   32 +
>   include/linux/mm.h              |    3 +
>   include/trace/events/kvmi.h     |  174 +++++
>   include/uapi/linux/kvm.h        |    8 +
>   mm/internal.h                   |    5 -
>   virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>   virt/kvm/kvmi_int.h             |  121 ++++
>   virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>   virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>   13 files changed, 3620 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/kvmi.h
>   create mode 100644 include/trace/events/kvmi.h
>   create mode 100644 virt/kvm/kvmi.c
>   create mode 100644 virt/kvm/kvmi_int.h
>   create mode 100644 virt/kvm/kvmi_mem.c
>   create mode 100644 virt/kvm/kvmi_msg.c
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2cf03ed181e6..1e9e49eaee3b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -73,6 +73,7 @@
>   #define KVM_REQ_HV_RESET		KVM_ARCH_REQ(20)
>   #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
>   #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
> +#define KVM_REQ_INTROSPECTION           KVM_ARCH_REQ(23)
>   
>   #define CR0_RESERVED_BITS                                               \
>   	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index dc4f2fdf5e57..ab6225563526 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,6 +9,7 @@ CFLAGS_vmx.o := -I.
>   KVM := ../../../virt/kvm
>   
>   kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> +				$(KVM)/kvmi.o $(KVM)/kvmi_msg.o $(KVM)/kvmi_mem.o \
>   				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
>   kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 74839859c0fd..cdfc7200a018 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3346,8 +3346,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
>   	}
>   }
>   
> -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> -					 struct kvm_xsave *guest_xsave)
> +void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> +				  struct kvm_xsave *guest_xsave)
>   {
>   	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
>   		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 68e4d756f5c9..eae0598e18a5 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -274,6 +274,7 @@ struct kvm_vcpu {
>   	bool preempted;
>   	struct kvm_vcpu_arch arch;
>   	struct dentry *debugfs_dentry;
> +	void *kvmi;
>   };
>   
>   static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -446,6 +447,7 @@ struct kvm {
>   	struct srcu_struct srcu;
>   	struct srcu_struct irq_srcu;
>   	pid_t userspace_pid;
> +	void *kvmi;
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -779,6 +781,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>   int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>   					struct kvm_guest_debug *dbg);
>   int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
> +void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> +				  struct kvm_xsave *guest_xsave);
>   
>   int kvm_arch_init(void *opaque);
>   void kvm_arch_exit(void);
> diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
> new file mode 100644
> index 000000000000..7fac1d23f67c
> --- /dev/null
> +++ b/include/linux/kvmi.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_H__
> +#define __KVMI_H__
> +
> +#define kvmi_is_present() 1
> +
> +int kvmi_init(void);
> +void kvmi_uninit(void);
> +void kvmi_destroy_vm(struct kvm *kvm);
> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu);
> +void kvmi_vcpu_init(struct kvm_vcpu *vcpu);
> +void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
> +bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
> +		   unsigned long old_value, unsigned long *new_value);
> +bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
> +void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva);
> +bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu);
> +void kvmi_hypercall_event(struct kvm_vcpu *vcpu);
> +bool kvmi_lost_exception(struct kvm_vcpu *vcpu);
> +void kvmi_trap_event(struct kvm_vcpu *vcpu);
> +bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
> +			   unsigned long exit_qualification,
> +			   unsigned char descriptor, unsigned char write);
> +void kvmi_flush_mem_access(struct kvm *kvm);
> +void kvmi_handle_request(struct kvm_vcpu *vcpu);
> +int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
> +			     gpa_t req_gpa, gpa_t map_gpa);
> +int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa);
> +
> +
> +#endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ea818ff739cd..b659c7436789 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1115,6 +1115,9 @@ void page_address_init(void);
>   #define page_address_init()  do { } while(0)
>   #endif
>   
> +/* rmap.c */
> +extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +
>   extern void *page_rmapping(struct page *page);
>   extern struct anon_vma *page_anon_vma(struct page *page);
>   extern struct address_space *page_mapping(struct page *page);
> diff --git a/include/trace/events/kvmi.h b/include/trace/events/kvmi.h
> new file mode 100644
> index 000000000000..dc36fd3b30dc
> --- /dev/null
> +++ b/include/trace/events/kvmi.h
> @@ -0,0 +1,174 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM kvmi
> +
> +#if !defined(_TRACE_KVMI_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_KVMI_H
> +
> +#include <linux/tracepoint.h>
> +
> +#ifndef __TRACE_KVMI_STRUCTURES
> +#define __TRACE_KVMI_STRUCTURES
> +
> +#undef EN
> +#define EN(x) { x, #x }
> +
> +static const struct trace_print_flags kvmi_msg_id_symbol[] = {
> +	EN(KVMI_GET_VERSION),
> +	EN(KVMI_PAUSE_VCPU),
> +	EN(KVMI_GET_GUEST_INFO),
> +	EN(KVMI_GET_REGISTERS),
> +	EN(KVMI_SET_REGISTERS),
> +	EN(KVMI_GET_PAGE_ACCESS),
> +	EN(KVMI_SET_PAGE_ACCESS),
> +	EN(KVMI_INJECT_EXCEPTION),
> +	EN(KVMI_READ_PHYSICAL),
> +	EN(KVMI_WRITE_PHYSICAL),
> +	EN(KVMI_GET_MAP_TOKEN),
> +	EN(KVMI_CONTROL_EVENTS),
> +	EN(KVMI_CONTROL_CR),
> +	EN(KVMI_CONTROL_MSR),
> +	EN(KVMI_EVENT),
> +	EN(KVMI_EVENT_REPLY),
> +	EN(KVMI_GET_CPUID),
> +	EN(KVMI_GET_XSAVE),
> +	{-1, NULL}
> +};
> +
> +static const struct trace_print_flags kvmi_event_id_symbol[] = {
> +	EN(KVMI_EVENT_CR),
> +	EN(KVMI_EVENT_MSR),
> +	EN(KVMI_EVENT_XSETBV),
> +	EN(KVMI_EVENT_BREAKPOINT),
> +	EN(KVMI_EVENT_HYPERCALL),
> +	EN(KVMI_EVENT_PAGE_FAULT),
> +	EN(KVMI_EVENT_TRAP),
> +	EN(KVMI_EVENT_DESCRIPTOR),
> +	EN(KVMI_EVENT_CREATE_VCPU),
> +	EN(KVMI_EVENT_PAUSE_VCPU),
> +	{-1, NULL}
> +};
> +
> +static const struct trace_print_flags kvmi_action_symbol[] = {
> +	{KVMI_EVENT_ACTION_CONTINUE, "continue"},
> +	{KVMI_EVENT_ACTION_RETRY, "retry"},
> +	{KVMI_EVENT_ACTION_CRASH, "crash"},
> +	{-1, NULL}
> +};
> +
> +#endif /* __TRACE_KVMI_STRUCTURES */
> +
> +TRACE_EVENT(
> +	kvmi_msg_dispatch,
> +	TP_PROTO(__u16 id, __u16 size),
> +	TP_ARGS(id, size),
> +	TP_STRUCT__entry(
> +		__field(__u16, id)
> +		__field(__u16, size)
> +	),
> +	TP_fast_assign(
> +		__entry->id = id;
> +		__entry->size = size;
> +	),
> +	TP_printk("%s size %u",
> +		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
> +		  __entry->size)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_send_event,
> +	TP_PROTO(__u32 id),
> +	TP_ARGS(id),
> +	TP_STRUCT__entry(
> +		__field(__u32, id)
> +	),
> +	TP_fast_assign(
> +		__entry->id = id;
> +	),
> +	TP_printk("%s",
> +		trace_print_symbols_seq(p, __entry->id, kvmi_event_id_symbol))
> +);
> +
> +#define KVMI_ACCESS_PRINTK() ({                                         \
> +	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
> +	static const char * const access_str[] = {			\
> +		"---", "r--", "-w-", "rw-", "--x", "r-x", "-wx", "rwx"  \
> +	};							        \
> +	trace_seq_printf(p, "%s", access_str[__entry->access & 7]);	\
> +	saved_ptr;							\
> +})
> +
> +TRACE_EVENT(
> +	kvmi_set_mem_access,
> +	TP_PROTO(__u64 gfn, __u8 access, int err),
> +	TP_ARGS(gfn, access, err),
> +	TP_STRUCT__entry(
> +		__field(__u64, gfn)
> +		__field(__u8, access)
> +		__field(int, err)
> +	),
> +	TP_fast_assign(
> +		__entry->gfn = gfn;
> +		__entry->access = access;
> +		__entry->err = err;
> +	),
> +	TP_printk("gfn %llx %s %s %d",
> +		  __entry->gfn, KVMI_ACCESS_PRINTK(),
> +		  __entry->err ? "failed" : "succeeded", __entry->err)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_apply_mem_access,
> +	TP_PROTO(__u64 gfn, __u8 access, int err),
> +	TP_ARGS(gfn, access, err),
> +	TP_STRUCT__entry(
> +		__field(__u64, gfn)
> +		__field(__u8, access)
> +		__field(int, err)
> +	),
> +	TP_fast_assign(
> +		__entry->gfn = gfn;
> +		__entry->access = access;
> +		__entry->err = err;
> +	),
> +	TP_printk("gfn %llx %s flush %s %d",
> +		  __entry->gfn, KVMI_ACCESS_PRINTK(),
> +		  __entry->err ? "failed" : "succeeded", __entry->err)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_event_page_fault,
> +	TP_PROTO(__u64 gpa, __u64 gva, __u8 access, __u64 old_rip,
> +		 __u32 action, __u64 new_rip, __u32 ctx_size),
> +	TP_ARGS(gpa, gva, access, old_rip, action, new_rip, ctx_size),
> +	TP_STRUCT__entry(
> +		__field(__u64, gpa)
> +		__field(__u64, gva)
> +		__field(__u8, access)
> +		__field(__u64, old_rip)
> +		__field(__u32, action)
> +		__field(__u64, new_rip)
> +		__field(__u32, ctx_size)
> +	),
> +	TP_fast_assign(
> +		__entry->gpa = gpa;
> +		__entry->gva = gva;
> +		__entry->access = access;
> +		__entry->old_rip = old_rip;
> +		__entry->action = action;
> +		__entry->new_rip = new_rip;
> +		__entry->ctx_size = ctx_size;
> +	),
> +	TP_printk("gpa %llx %s gva %llx rip %llx -> %s rip %llx ctx %u",
> +		  __entry->gpa,
> +		  KVMI_ACCESS_PRINTK(),
> +		  __entry->gva,
> +		  __entry->old_rip,
> +		  trace_print_symbols_seq(p, __entry->action,
> +					  kvmi_action_symbol),
> +		  __entry->new_rip, __entry->ctx_size)
> +);
> +
> +#endif /* _TRACE_KVMI_H */
> +
> +#include <trace/define_trace.h>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 496e59a2738b..6b7c4469b808 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1359,6 +1359,14 @@ struct kvm_s390_ucas_mapping {
>   #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
>   #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
>   
> +struct kvm_introspection {
> +	int fd;
> +	__u32 padding;
> +	__u32 commands;
> +	__u32 events;
> +};
> +#define KVM_INTROSPECTION      _IOW(KVMIO, 0xff, struct kvm_introspection)
> +
>   #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>   #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
>   #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
> diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..9d363c802305 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -92,11 +92,6 @@ extern unsigned long highest_memmap_pfn;
>   extern int isolate_lru_page(struct page *page);
>   extern void putback_lru_page(struct page *page);
>   
> -/*
> - * in mm/rmap.c:
> - */
> -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> -
>   /*
>    * in mm/page_alloc.c
>    */
> diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
> new file mode 100644
> index 000000000000..c4cdaeddac45
> --- /dev/null
> +++ b/virt/kvm/kvmi.c
> @@ -0,0 +1,1410 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + */
> +#include <linux/mmu_context.h>
> +#include <linux/random.h>
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/asm/kvmi.h>
> +#include "../../arch/x86/kvm/x86.h"
> +#include "../../arch/x86/kvm/mmu.h"
> +#include <asm/vmx.h>
> +#include "cpuid.h"
> +#include "kvmi_int.h"
> +#include <asm/kvm_page_track.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/kvmi.h>
> +
> +struct kvmi_mem_access {
> +	struct list_head link;
> +	gfn_t gfn;
> +	u8 access;
> +	bool active[KVM_PAGE_TRACK_MAX];
> +	struct kvm_memory_slot *slot;
> +};
> +
> +static void wakeup_events(struct kvm *kvm);
> +static bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
> +			   unsigned long gva, u8 access);
> +
> +static struct workqueue_struct *wq;
> +
> +static const u8 full_access = KVMI_PAGE_ACCESS_R |
> +			      KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X;
> +
> +static const struct {
> +	unsigned int allow_bit;
> +	enum kvm_page_track_mode track_mode;
> +} track_modes[] = {
> +	{ KVMI_PAGE_ACCESS_R, KVM_PAGE_TRACK_PREREAD },
> +	{ KVMI_PAGE_ACCESS_W, KVM_PAGE_TRACK_PREWRITE },
> +	{ KVMI_PAGE_ACCESS_X, KVM_PAGE_TRACK_PREEXEC },
> +};
> +
> +void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req)
> +{
> +	set_bit(req, &ivcpu->requests);
> +	/* Make sure the bit is set when the worker wakes up */
> +	smp_wmb();
> +	up(&ivcpu->sem_requests);
> +}
> +
> +void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req)
> +{
> +	clear_bit(req, &ivcpu->requests);
> +}
> +
> +int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	/*
> +	 * This vcpu is already stopped, executing this command
> +	 * as a result of the REQ_CMD bit being set
> +	 * (see kvmi_handle_request).
> +	 */
> +	if (ivcpu->pause)
> +		return -KVM_EBUSY;
> +
> +	ivcpu->pause = true;
> +
> +	return 0;
> +}
> +
> +static void kvmi_apply_mem_access(struct kvm *kvm,
> +				  struct kvm_memory_slot *slot,
> +				  struct kvmi_mem_access *m)
> +{
> +	int idx, k;

This should probably be i instead of k. I'm guessing you chose k to 
avoid confusion of i with idx. However, there's precedent already set 
for using i as a loop counter even in this case (e.g., look at 
kvm_scan_ioapic_routes() in arch/x86/kvm/irq_comm.c and 
init_rmode_identity_map() in arch/x86/kvm/vmx.c)

> +
> +	if (!slot) {
> +		slot = gfn_to_memslot(kvm, m->gfn);
> +		if (!slot)
> +			return;
> +	}
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	for (k = 0; k < ARRAY_SIZE(track_modes); k++) {
> +		unsigned int allow_bit = track_modes[k].allow_bit;
> +		enum kvm_page_track_mode mode = track_modes[k].track_mode;
> +
> +		if (m->access & allow_bit) {
> +			if (m->active[mode] && m->slot == slot) {
> +				kvm_slot_page_track_remove_page(kvm, slot,
> +								m->gfn, mode);
> +				m->active[mode] = false;
> +				m->slot = NULL;
> +			}
> +		} else if (!m->active[mode] || m->slot != slot) {
> +			kvm_slot_page_track_add_page(kvm, slot, m->gfn, mode);
> +			m->active[mode] = true;
> +			m->slot = slot;
> +		}
> +	}
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
> +{
> +	struct kvmi_mem_access *m;
> +	struct kvmi_mem_access *__m;
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t gfn = gpa_to_gfn(gpa);
> +
> +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
> +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);

If there's an error, should this not return or something instead of 
continuing as if nothing is wrong?

> +
> +	m = kzalloc(sizeof(struct kvmi_mem_access), GFP_KERNEL);

This should be "m = kzalloc(sizeof(*m), GFP_KERNEL);".

> +	if (!m)
> +		return -KVM_ENOMEM;
> +
> +	INIT_LIST_HEAD(&m->link);
> +	m->gfn = gfn;
> +	m->access = access;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	__m = radix_tree_lookup(&ikvm->access_tree, m->gfn);
> +	if (__m) {
> +		__m->access = m->access;
> +		if (list_empty(&__m->link))
> +			list_add_tail(&__m->link, &ikvm->access_list);
> +	} else {
> +		radix_tree_insert(&ikvm->access_tree, m->gfn, m);
> +		list_add_tail(&m->link, &ikvm->access_list);
> +		m = NULL;
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	kfree(m);
> +
> +	return 0;
> +}
> +
> +static bool kvmi_test_mem_access(struct kvm *kvm, unsigned long gpa,
> +				 u8 access)
> +{
> +	struct kvmi_mem_access *m;
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (!ikvm)
> +		return false;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	m = radix_tree_lookup(&ikvm->access_tree, gpa_to_gfn(gpa));
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	/*
> +	 * We want to be notified only for violations involving access
> +	 * bits that we've specifically cleared
> +	 */
> +	if (m && ((~m->access) & access))
> +		return true;
> +
> +	return false;
> +}
> +
> +static struct kvmi_mem_access *
> +kvmi_get_mem_access_unlocked(struct kvm *kvm, const gfn_t gfn)
> +{
> +	return radix_tree_lookup(&IKVM(kvm)->access_tree, gfn);
> +}
> +
> +static bool is_introspected(struct kvmi *ikvm)
> +{
> +	return (ikvm && ikvm->sock);
> +}
> +
> +void kvmi_flush_mem_access(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	while (!list_empty(&ikvm->access_list)) {
> +		struct kvmi_mem_access *m =
> +			list_first_entry(&ikvm->access_list,
> +					 struct kvmi_mem_access, link);
> +
> +		list_del_init(&m->link);
> +
> +		kvmi_apply_mem_access(kvm, NULL, m);
> +
> +		if (m->access == full_access) {
> +			radix_tree_delete(&ikvm->access_tree, m->gfn);
> +			kfree(m);
> +		}
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static void kvmi_free_mem_access(struct kvm *kvm)
> +{
> +	void **slot;
> +	struct radix_tree_iter iter;
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	radix_tree_for_each_slot(slot, &ikvm->access_tree, &iter, 0) {
> +		struct kvmi_mem_access *m = *slot;
> +
> +		m->access = full_access;
> +		kvmi_apply_mem_access(kvm, NULL, m);
> +
> +		radix_tree_delete(&ikvm->access_tree, m->gfn);
> +		kfree(*slot);
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static unsigned long *msr_mask(struct kvmi *ikvm, unsigned int *msr)
> +{
> +	switch (*msr) {
> +	case 0 ... 0x1fff:
> +		return ikvm->msr_mask.low;
> +	case 0xc0000000 ... 0xc0001fff:
> +		*msr &= 0x1fff;
> +		return ikvm->msr_mask.high;
> +	}
> +	return NULL;
> +}
> +
> +static bool test_msr_mask(struct kvmi *ikvm, unsigned int msr)
> +{
> +	unsigned long *mask = msr_mask(ikvm, &msr);
> +
> +	if (!mask)
> +		return false;
> +	if (!test_bit(msr, mask))
> +		return false;
> +
> +	return true;
> +}
> +
> +static int msr_control(struct kvmi *ikvm, unsigned int msr, bool enable)
> +{
> +	unsigned long *mask = msr_mask(ikvm, &msr);
> +
> +	if (!mask)
> +		return -KVM_EINVAL;
> +	if (enable)
> +		set_bit(msr, mask);
> +	else
> +		clear_bit(msr, mask);
> +	return 0;
> +}
> +
> +unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
> +				   const struct kvm_sregs *sregs)
> +{
> +	unsigned int mode = 0;
> +
> +	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
> +		if (sregs->cs.l)
> +			mode = 8;
> +		else if (!sregs->cs.db)
> +			mode = 2;
> +		else
> +			mode = 4;
> +	} else if (sregs->cr0 & X86_CR0_PE) {
> +		if (!sregs->cs.db)
> +			mode = 2;
> +		else
> +			mode = 4;
> +	} else if (!sregs->cs.db)
> +		mode = 2;
> +	else
> +		mode = 4;

If one branch of a conditional uses braces, then all branches should 
(regardless of if they are only a single statements). The final "else 
if" and "else" blocks here should both be wrapped in braces.

> +
> +	return mode;
> +}
> +
> +static int maybe_delayed_init(void)
> +{
> +	if (wq)
> +		return 0;
> +
> +	wq = alloc_workqueue("kvmi", WQ_CPU_INTENSIVE, 0);
> +	if (!wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +int kvmi_init(void)
> +{
> +	return 0;
> +}
> +
> +static void work_cb(struct work_struct *work)
> +{
> +	struct kvmi *ikvm = container_of(work, struct kvmi, work);
> +	struct kvm   *kvm = ikvm->kvm;

None of your other initial variable assignments are aligned like this. 
Any particular reason why this one is?

> +
> +	while (kvmi_msg_process(ikvm))
> +		;

Typically if you're going to have an empty while block, you stick the 
semi-colon at the end of the while line. So this would be:
	while (kvmi_msg_process(ikvm));

> +
> +	/* We are no longer interested in any kind of events */
> +	atomic_set(&ikvm->event_mask, 0);
> +
> +	/* Clean-up for the next kvmi_hook() call */
> +	ikvm->cr_mask = 0;
> +	memset(&ikvm->msr_mask, 0, sizeof(ikvm->msr_mask));
> +
> +	wakeup_events(kvm);
> +
> +	/* Restore the spte access rights */
> +	/* Shouldn't wait for reconnection? */
> +	kvmi_free_mem_access(kvm);
> +
> +	complete_all(&ikvm->finished);
> +}
> +
> +static void __alloc_vcpu_kvmi(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = kzalloc(sizeof(struct kvmi_vcpu), GFP_KERNEL);
> +
> +	if (ivcpu) {
> +		sema_init(&ivcpu->sem_requests, 0);
> +
> +		/*
> +		 * Make sure the ivcpu is initialized
> +		 * before making it visible.
> +		 */
> +		smp_wmb();
> +
> +		vcpu->kvmi = ivcpu;
> +
> +		kvmi_make_request(ivcpu, REQ_INIT);
> +		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
> +	}
> +}
> +
> +void kvmi_vcpu_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +
> +	if (is_introspected(ikvm)) {
> +		mutex_lock(&vcpu->kvm->lock);
> +		__alloc_vcpu_kvmi(vcpu);
> +		mutex_unlock(&vcpu->kvm->lock);
> +	}
> +}
> +
> +void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
> +{
> +	kfree(IVCPU(vcpu));
> +}
> +
> +static bool __alloc_kvmi(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);
> +
> +	if (ikvm) {
> +		INIT_LIST_HEAD(&ikvm->access_list);
> +		mutex_init(&ikvm->access_tree_lock);
> +		INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
> +		rwlock_init(&ikvm->sock_lock);
> +		init_completion(&ikvm->finished);
> +		INIT_WORK(&ikvm->work, work_cb);
> +
> +		kvm->kvmi = ikvm;
> +		ikvm->kvm = kvm; /* work_cb */
> +	}
> +
> +	return (ikvm != NULL);
> +}

Would it maybe be better to just put a check for ikvm at the top and 
return false, otherwise do all the in the if body then return true?

Like this:

static bool __alloc_kvmi(struct kvm *kvm)
{
	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);

	if (!ikvm)
		return false;

	INIT_LIST_HEAD(&ikvm->access_list);
	mutex_init(&ikvm->access_tree_lock);
	INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
	rwlock_init(&ikvm->sock_lock);
	init_completion(&ikvm->finished);
	INIT_WORK(&ikvm->work, work_cb);

	kvm->kvmi = ikvm;
	ikvm->kvm = kvm; /* work_cb */

	return true;
}

> +
> +static bool alloc_kvmi(struct kvm *kvm)
> +{
> +	bool done;
> +
> +	mutex_lock(&kvm->lock);
> +	done = (
> +		maybe_delayed_init() == 0    &&
> +		IKVM(kvm)            == NULL &&
> +		__alloc_kvmi(kvm)    == true
> +	);
> +	mutex_unlock(&kvm->lock);
> +
> +	return done;
> +}
> +
> +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (!IKVM(vcpu))
> +			__alloc_vcpu_kvmi(vcpu);
> +	mutex_unlock(&kvm->lock);
> +}
> +
> +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (is_introspected(ikvm)) {
> +		kvm_err("Guest already introspected\n");
> +		return false;
> +	}
> +
> +	if (!kvmi_msg_init(ikvm, qemu->fd))
> +		return false;

kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and 
then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check 
if ikvm is NULL, but if it is, it returns false, which would still end 
up here. There should be a check that ikvm is not NULL before this if 
statement.

> +
> +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
> +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
> +
> +	alloc_all_kvmi_vcpu(kvm);
> +	queue_work(wq, &ikvm->work);
> +
> +	return true;
> +}
> +
> +/*
> + * When called from outside a page fault handler, this call should
> + * return ~0ull
> + */
> +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	u64 gla;
> +	u64 gla_val;
> +	u64 v;
> +
> +	if (!vcpu->arch.gpa_available)
> +		return ~0ull;
> +
> +	gla = kvm_mmu_fault_gla(vcpu);
> +	if (gla == ~0ull)
> +		return gla;
> +	gla_val = gla;
> +
> +	/* Handle the potential overflow by returning ~0ull */
> +	if (vcpu->arch.gpa_val > gpa) {
> +		v = vcpu->arch.gpa_val - gpa;
> +		if (v > gla)
> +			gla = ~0ull;
> +		else
> +			gla -= v;
> +	} else {
> +		v = gpa - vcpu->arch.gpa_val;
> +		if (v > (U64_MAX - gla))
> +			gla = ~0ull;
> +		else
> +			gla += v;
> +	}
> +
> +	return gla;
> +}
> +
> +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			       u8 *new,
> +			       int bytes,
> +			       struct kvm_page_track_notifier_node *node,
> +			       bool *data_ready)
> +{
> +	u64 gla;
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	bool ret = true;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return ret;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);

Should you not check the value of ret here before proceeding?

> +	if (ivcpu && ivcpu->ctx_size > 0) {
> +		int s = min_t(int, bytes, ivcpu->ctx_size);
> +
> +		memcpy(new, ivcpu->ctx_data, s);
> +		ivcpu->ctx_size = 0;
> +
> +		if (*data_ready)
> +			kvm_err("Override custom data");
> +
> +		*data_ready = true;
> +	}
> +
> +	return ret;
> +}
> +
> +static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				const u8 *new,
> +				int bytes,
> +				struct kvm_page_track_notifier_node *node)
> +{
> +	u64 gla;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return true;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_W);
> +}
> +
> +static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				struct kvm_page_track_notifier_node *node)
> +{
> +	u64 gla;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return true;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +
> +	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_X);
> +}
> +
> +static void kvmi_track_create_slot(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot,
> +				   unsigned long npages,
> +				   struct kvm_page_track_notifier_node *node)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t start = slot->base_gfn;
> +	const gfn_t end = start + npages;
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +
> +	while (start < end) {
> +		struct kvmi_mem_access *m;
> +
> +		m = kvmi_get_mem_access_unlocked(kvm, start);
> +		if (m)
> +			kvmi_apply_mem_access(kvm, slot, m);
> +		start++;
> +	}
> +
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				  struct kvm_page_track_notifier_node *node)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t start = slot->base_gfn;
> +	const gfn_t end = start + slot->npages;
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +
> +	while (start < end) {
> +		struct kvmi_mem_access *m;
> +
> +		m = kvmi_get_mem_access_unlocked(kvm, start);
> +		if (m) {
> +			u8 prev_access = m->access;
> +
> +			m->access = full_access;
> +			kvmi_apply_mem_access(kvm, slot, m);
> +			m->access = prev_access;
> +		}
> +		start++;
> +	}
> +
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static struct kvm_page_track_notifier_node kptn_node = {
> +	.track_preread = kvmi_track_preread,
> +	.track_prewrite = kvmi_track_prewrite,
> +	.track_preexec = kvmi_track_preexec,
> +	.track_create_slot = kvmi_track_create_slot,
> +	.track_flush_slot = kvmi_track_flush_slot
> +};
> +
> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> +{
> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> +
> +	kvm_page_track_register_notifier(kvm, &kptn_node);
> +
> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));

Is this safe? It could return false if the alloc fails (in which case 
the caller has to do nothing) or if setting up the socket fails (in 
which case the caller needs to free the allocated kvmi).

> +}
> +
> +void kvmi_destroy_vm(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (ikvm) {
> +		kvmi_msg_uninit(ikvm);
> +
> +		mutex_destroy(&ikvm->access_tree_lock);
> +		kfree(ikvm);
> +	}
> +
> +	kvmi_mem_destroy_vm(kvm);
> +}
> +
> +void kvmi_uninit(void)
> +{
> +	if (wq) {
> +		destroy_workqueue(wq);
> +		wq = NULL;
> +	}
> +}
> +
> +void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
> +{
> +	struct msr_data msr;
> +
> +	msr.host_initiated = true;
> +
> +	msr.index = MSR_IA32_SYSENTER_CS;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_cs = msr.data;
> +
> +	msr.index = MSR_IA32_SYSENTER_ESP;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_esp = msr.data;
> +
> +	msr.index = MSR_IA32_SYSENTER_EIP;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_eip = msr.data;
> +
> +	msr.index = MSR_EFER;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.efer = msr.data;
> +
> +	msr.index = MSR_STAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.star = msr.data;
> +
> +	msr.index = MSR_LSTAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.lstar = msr.data;
> +
> +	msr.index = MSR_CSTAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.cstar = msr.data;
> +
> +	msr.index = MSR_IA32_CR_PAT;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.pat = msr.data;
> +}
> +
> +static bool is_event_enabled(struct kvm *kvm, int event_bit)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	return (ikvm && (atomic_read(&ikvm->event_mask) & event_bit));
> +}
> +
> +static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
> +{
> +	int err = -ESRCH;
> +	struct pid *pid;
> +	struct siginfo siginfo[1] = { };
> +
> +	rcu_read_lock();
> +	pid = rcu_dereference(vcpu->pid);
> +	if (pid)
> +		err = kill_pid_info(sig, siginfo, pid);
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +
> +static void kvmi_vm_shutdown(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		kvmi_vcpu_kill(SIGTERM, vcpu);
> +	}
> +	mutex_unlock(&kvm->lock);
> +}
> +
> +/* TODO: Do we need a return code ? */
> +static void handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action)
> +{
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CRASH:
> +		kvmi_vm_shutdown(vcpu->kvm);
> +		break;
> +
> +	default:
> +		kvm_err("Unsupported event action: %d\n", action);
> +	}
> +}
> +
> +bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
> +		   unsigned long old_value, unsigned long *new_value)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 ret_value;
> +	u32 action;
> +
> +	if (!is_event_enabled(kvm, KVMI_EVENT_CR))
> +		return true;
> +	if (!test_bit(cr, &IKVM(kvm)->cr_mask))
> +		return true;
> +	if (old_value == *new_value)
> +		return true;
> +
> +	action = kvmi_msg_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		*new_value = ret_value;
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false;
> +}
> +
> +bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 ret_value;
> +	u32 action;
> +	struct msr_data old_msr = { .host_initiated = true,
> +				    .index = msr->index };
> +
> +	if (msr->host_initiated)
> +		return true;
> +	if (!is_event_enabled(kvm, KVMI_EVENT_MSR))
> +		return true;
> +	if (!test_msr_mask(IKVM(kvm), msr->index))
> +		return true;
> +	if (kvm_get_msr(vcpu, &old_msr))
> +		return true;
> +	if (old_msr.data == msr->data)
> +		return true;
> +
> +	action = kvmi_msg_send_msr(vcpu, msr->index, old_msr.data, msr->data,
> +				   &ret_value);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		msr->data = ret_value;
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false;
> +}
> +
> +void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_XSETBV))
> +		return;
> +
> +	action = kvmi_msg_send_xsetbv(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
> +{
> +	u32 action;
> +	u64 gpa;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
> +		/* qemu will automatically reinject the breakpoint */
> +		return false;
> +
> +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
> +
> +	if (gpa == UNMAPPED_GVA)
> +		kvm_err("%s: invalid gva: %llx", __func__, gva);

If the gpa is unmapped, shouldn't it return false rather than proceeding?

> +
> +	action = kvmi_msg_send_bp(vcpu, gpa);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	case KVMI_EVENT_ACTION_RETRY:
> +		/* rip was most likely adjusted past the INT 3 instruction */
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	/* qemu will automatically reinject the breakpoint */
> +	return false;
> +}
> +EXPORT_SYMBOL(kvmi_breakpoint_event);
> +
> +#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
> +bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long subfunc1, subfunc2;
> +	bool longmode = is_64_bit_mode(vcpu);
> +	unsigned long nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
> +
> +	if (longmode) {
> +		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RDI);
> +		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RSI);
> +	} else {
> +		nr &= 0xFFFFFFFF;
> +		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RBX);
> +		subfunc1 &= 0xFFFFFFFF;
> +		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RCX);
> +		subfunc2 &= 0xFFFFFFFF;
> +	}
> +
> +	return (nr == KVM_HC_XEN_HVM_OP
> +		&& subfunc1 == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
> +		&& subfunc2 == 0);
> +}
> +
> +void kvmi_hypercall_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_HYPERCALL)
> +			|| !kvmi_is_agent_hypercall(vcpu))
> +		return;
> +
> +	action = kvmi_msg_send_hypercall(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
> +			   unsigned long gva, u8 access)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvmi_vcpu *ivcpu;
> +	bool trap_access, ret = true;
> +	u32 ctx_size;
> +	u64 old_rip;
> +	u32 action;
> +
> +	if (!is_event_enabled(kvm, KVMI_EVENT_PAGE_FAULT))
> +		return true;
> +
> +	/* Have we shown interest in this page? */
> +	if (!kvmi_test_mem_access(kvm, gpa, access))
> +		return true;
> +
> +	ivcpu    = IVCPU(vcpu);
> +	ctx_size = sizeof(ivcpu->ctx_data);
> +	old_rip  = kvm_rip_read(vcpu);

Why are these assignments aligned liket this?

> +
> +	if (!kvmi_msg_send_pf(vcpu, gpa, gva, access, &action,
> +			      &trap_access,
> +			      ivcpu->ctx_data, &ctx_size))
> +		goto out;
> +
> +	ivcpu->ctx_size = 0;
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		ivcpu->ctx_size = ctx_size;
> +		break;
> +	case KVMI_EVENT_ACTION_RETRY:
> +		ret = false;
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	/* TODO: trap_access -> don't REPeat the instruction */
> +out:
> +	trace_kvmi_event_page_fault(gpa, gva, access, old_rip, action,
> +				    kvm_rip_read(vcpu), ctx_size);
> +	return ret;
> +}
> +
> +bool kvmi_lost_exception(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	if (!ivcpu || !ivcpu->exception.injected)
> +		return false;
> +
> +	ivcpu->exception.injected = 0;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_TRAP))
> +		return false;
> +
> +	if ((vcpu->arch.exception.injected || vcpu->arch.exception.pending)
> +		&& vcpu->arch.exception.nr == ivcpu->exception.nr
> +		&& vcpu->arch.exception.error_code
> +			== ivcpu->exception.error_code)
> +		return false;
> +
> +	return true;
> +}
> +
> +void kvmi_trap_event(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	u32 vector, type, err;
> +	u32 action;
> +
> +	if (vcpu->arch.exception.pending) {
> +		vector = vcpu->arch.exception.nr;
> +		err = vcpu->arch.exception.error_code;
> +
> +		if (kvm_exception_is_soft(vector))
> +			type = INTR_TYPE_SOFT_EXCEPTION;
> +		else
> +			type = INTR_TYPE_HARD_EXCEPTION;
> +	} else if (vcpu->arch.interrupt.pending) {
> +		vector = vcpu->arch.interrupt.nr;
> +		err = 0;
> +
> +		if (vcpu->arch.interrupt.soft)
> +			type = INTR_TYPE_SOFT_INTR;
> +		else
> +			type = INTR_TYPE_EXT_INTR;
> +	} else {
> +		vector = 0;
> +		type = 0;
> +		err = 0;
> +	}
> +
> +	kvm_err("New exception nr %d/%d err %x/%x addr %lx",
> +		vector, ivcpu->exception.nr,
> +		err, ivcpu->exception.error_code,
> +		vcpu->arch.cr2);
> +
> +	action = kvmi_msg_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
> +			   unsigned long exit_qualification,
> +			   unsigned char descriptor, unsigned char write)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_DESCRIPTOR))
> +		return true;

How come it returns true here? The events below all return false from a 
similar condition check.

> +
> +	action = kvmi_msg_send_descriptor(vcpu, info, exit_qualification,
> +					  descriptor, write);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false; /* TODO: double check this */
> +}
> +EXPORT_SYMBOL(kvmi_descriptor_event);
> +
> +static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_CREATE_VCPU))
> +		return false;
> +
> +	action = kvmi_msg_send_create_vcpu(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return true;
> +}
> +
> +static bool kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	IVCPU(vcpu)->pause = false;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_PAUSE_VCPU))
> +		return false;
> +
> +	action = kvmi_msg_send_pause_vcpu(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return true;
> +}
> +
> +/* TODO: refactor this function uto avoid recursive calls and the semaphore. */
> +void kvmi_handle_request(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	while (ivcpu->ev_rpl_waiting
> +		|| READ_ONCE(ivcpu->requests)) {
> +
> +		down(&ivcpu->sem_requests);
> +
> +		if (test_bit(REQ_INIT, &ivcpu->requests)) {
> +			/*
> +			 * kvmi_create_vcpu_event() may call this function
> +			 * again and won't return unless there is no more work
> +			 * to be done. The while condition will be evaluated
> +			 * to false, but we explicitly exit the loop to avoid
> +			 * surprizing the reader more than we already did.
> +			 */
> +			kvmi_clear_request(ivcpu, REQ_INIT);
> +			if (kvmi_create_vcpu_event(vcpu))
> +				break;
> +		} else if (test_bit(REQ_CMD, &ivcpu->requests)) {
> +			kvmi_msg_handle_vcpu_cmd(vcpu);
> +			/* it will clear the REQ_CMD bit */
> +			if (ivcpu->pause && !ivcpu->ev_rpl_waiting) {
> +				/* Same warnings as with REQ_INIT. */
> +				if (kvmi_pause_vcpu_event(vcpu))
> +					break;
> +			}
> +		} else if (test_bit(REQ_REPLY, &ivcpu->requests)) {
> +			kvmi_clear_request(ivcpu, REQ_REPLY);
> +			ivcpu->ev_rpl_waiting = false;
> +			if (ivcpu->have_delayed_regs) {
> +				kvm_arch_vcpu_set_regs(vcpu,
> +							&ivcpu->delayed_regs);
> +				ivcpu->have_delayed_regs = false;
> +			}
> +			if (ivcpu->pause) {
> +				/* Same warnings as with REQ_INIT. */
> +				if (kvmi_pause_vcpu_event(vcpu))
> +					break;
> +			}
> +		} else if (test_bit(REQ_CLOSE, &ivcpu->requests)) {
> +			kvmi_clear_request(ivcpu, REQ_CLOSE);
> +			break;
> +		} else {
> +			kvm_err("Unexpected request");
> +		}
> +	}
> +
> +	kvmi_flush_mem_access(vcpu->kvm);
> +	/* TODO: merge with kvmi_set_mem_access() */
> +}
> +
> +int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
> +		       u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
> +{
> +	struct kvm_cpuid_entry2 *e;
> +
> +	e = kvm_find_cpuid_entry(vcpu, function, index);
> +	if (!e)
> +		return -KVM_ENOENT;
> +
> +	*eax = e->eax;
> +	*ebx = e->ebx;
> +	*ecx = e->ecx;
> +	*edx = e->edx;
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc)
> +{
> +	/*
> +	 * Should we switch vcpu_cnt to unsigned int?
> +	 * If not, we should limit this to max u16 - 1
> +	 */
> +	*vcpu_cnt = atomic_read(&vcpu->kvm->online_vcpus);
> +	if (kvm_has_tsc_control)
> +		*tsc = 1000ul * vcpu->arch.virtual_tsc_khz;
> +	else
> +		*tsc = 0;
> +
> +	return 0;
> +}
> +
> +static int get_first_vcpu(struct kvm *kvm, struct kvm_vcpu **vcpu)
> +{
> +	struct kvm_vcpu *v;
> +
> +	if (!atomic_read(&kvm->online_vcpus))
> +		return -KVM_EINVAL;
> +
> +	v = kvm_get_vcpu(kvm, 0);
> +
> +	if (!v)
> +		return -KVM_EINVAL;
> +
> +	*vcpu = v;
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
> +			   struct kvm_regs *regs,
> +			   struct kvm_sregs *sregs, struct kvm_msrs *msrs)
> +{
> +	struct kvm_msr_entry  *msr = msrs->entries;
> +	unsigned int	       n   = msrs->nmsrs;

Again with randomly aligning variables...

> +
> +	kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> +	kvm_arch_vcpu_ioctl_get_sregs(vcpu, sregs);
> +	*mode = kvmi_vcpu_mode(vcpu, sregs);
> +
> +	for (; n--; msr++) {

The conditional portion of this for loop appears to not be a 
conditional? Either way, this is a pretty ugly way to write this.

> +		struct msr_data m   = { .index = msr->index };
> +		int		err = kvm_get_msr(vcpu, &m);

And again with the alignment...

> +
> +		if (err)
> +			return -KVM_EINVAL;
> +
> +		msr->data = m.data;
> +	}
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	if (ivcpu->ev_rpl_waiting) {
> +		memcpy(&ivcpu->delayed_regs, regs, sizeof(ivcpu->delayed_regs));
> +		ivcpu->have_delayed_regs = true;
> +	} else
> +		kvm_err("Drop KVMI_SET_REGISTERS");

Since the if has braces, the else should too.

> +	return 0;
> +}
> +
> +int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access)
> +{
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +	struct kvmi_mem_access *m;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	m = kvmi_get_mem_access_unlocked(vcpu->kvm, gpa_to_gfn(gpa));
> +	*access = m ? m->access : full_access;
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	return 0;
> +}
> +
> +static bool is_vector_valid(u8 vector)
> +{
> +	return true;
> +}
> +
> +static bool is_gva_valid(struct kvm_vcpu *vcpu, u64 gva)
> +{
> +	return true;
> +}
> +
> +int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
> +			      bool error_code_valid, u16 error_code,
> +			      u64 address)
> +{
> +	struct x86_exception e = {
> +		.vector = vector,
> +		.error_code_valid = error_code_valid,
> +		.error_code = error_code,
> +		.address = address,
> +	};
> +
> +	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
> +		return -KVM_EINVAL;
> +
> +	if (e.vector == PF_VECTOR)
> +		kvm_inject_page_fault(vcpu, &e);
> +	else if (e.error_code_valid)
> +		kvm_queue_exception_e(vcpu, e.vector, e.error_code);
> +	else
> +		kvm_queue_exception(vcpu, e.vector);
> +
> +	if (IVCPU(vcpu)->exception.injected)
> +		kvm_err("Override exception");
> +
> +	IVCPU(vcpu)->exception.injected = 1;
> +	IVCPU(vcpu)->exception.nr = e.vector;
> +	IVCPU(vcpu)->exception.error_code = error_code_valid ? error_code : 0;
> +
> +	return 0;
> +}
> +
> +unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
> +{
> +	unsigned long hva;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	hva = gfn_to_hva(kvm, gfn);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return hva;
> +}
> +
> +static long get_user_pages_remote_unlocked(struct mm_struct *mm,
> +					   unsigned long start,
> +					   unsigned long nr_pages,
> +					   unsigned int gup_flags,
> +					   struct page **pages)
> +{
> +	long ret;
> +	struct task_struct *tsk = NULL;
> +	struct vm_area_struct **vmas = NULL;
> +	int locked = 1;
> +
> +	down_read(&mm->mmap_sem);
> +	ret =
> +	    get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
> +				  vmas, &locked);

Couldn't this line be "ret = get_user_pages_remote(..." and just break 
it on a different variable?

> +	if (locked)
> +		up_read(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size, int (*send)(
> +				   struct kvmi *, const struct kvmi_msg_hdr *,
> +				   int err, const void *buf, size_t),
> +				   const struct kvmi_msg_hdr *ctx)
> +{
> +	int err, ec;
> +	unsigned long hva;
> +	struct page *page = NULL;
> +	void *ptr_page = NULL, *ptr = NULL;
> +	size_t ptr_size = 0;
> +	struct kvm_vcpu *vcpu;
> +
> +	ec = get_first_vcpu(kvm, &vcpu);
> +
> +	if (ec)
> +		goto out;
> +
> +	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
> +
> +	if (kvm_is_error_hva(hva)) {
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, 0, &page) != 1) {
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	ptr_page = kmap_atomic(page);
> +
> +	ptr = ptr_page + (gpa & ~PAGE_MASK);
> +	ptr_size = size;
> +
> +out:
> +	err = send(IKVM(kvm), ctx, ec, ptr, ptr_size);
> +
> +	if (ptr_page)
> +		kunmap_atomic(ptr_page);
> +	if (page)
> +		put_page(page);
> +	return err;
> +}
> +
> +int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
> +{
> +	int err;
> +	unsigned long hva;
> +	struct page *page;
> +	void *ptr;
> +	struct kvm_vcpu *vcpu;
> +
> +	err = get_first_vcpu(kvm, &vcpu);
> +
> +	if (err)
> +		return err;
> +
> +	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
> +
> +	if (kvm_is_error_hva(hva))
> +		return -KVM_EINVAL;
> +
> +	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, FOLL_WRITE,
> +			&page) != 1)
> +		return -KVM_EINVAL;
> +
> +	ptr = kmap_atomic(page);
> +
> +	memcpy(ptr + (gpa & ~PAGE_MASK), buf, size);
> +
> +	kunmap_atomic(ptr);
> +	put_page(page);
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
> +{
> +	int err = 0;
> +
> +	/* create random token */
> +	get_random_bytes(token, sizeof(struct kvmi_map_mem_token));
> +
> +	/* store token in HOST database */
> +	if (kvmi_store_token(kvm, token))
> +		err = -KVM_ENOMEM;
> +
> +	return err;
> +}

It seems like you could get rid of err altogether and just return 
-KVM_ENOMEM directly from the if body and 0 at the end.

> +
> +int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events)
> +{
> +	int err = 0;
> +
> +	if (events & ~KVMI_KNOWN_EVENTS)
> +		return -KVM_EINVAL;
> +
> +	if (events & KVMI_EVENT_BREAKPOINT) {
> +		if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT)) {
> +			struct kvm_guest_debug dbg = { };
> +
> +			dbg.control =
> +			    KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
> +
> +			err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
> +		}
> +	}
> +
> +	if (!err)
> +		atomic_set(&IKVM(vcpu->kvm)->event_mask, events);
> +
> +	return err;
> +}
> +
> +int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr)
> +{
> +	switch (cr) {
> +	case 0:
> +	case 3:
> +	case 4:
> +		if (enable)
> +			set_bit(cr, &ikvm->cr_mask);
> +		else
> +			clear_bit(cr, &ikvm->cr_mask);
> +		return 0;
> +
> +	default:
> +		return -KVM_EINVAL;
> +	}
> +}
> +
> +int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int err;
> +
> +	err = get_first_vcpu(kvm, &vcpu);
> +	if (err)
> +		return err;
> +
> +	err = msr_control(IKVM(kvm), msr, enable);
> +
> +	if (!err)
> +		kvm_arch_msr_intercept(vcpu, msr, enable);
> +
> +	return err;
> +}
> +
> +void wakeup_events(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvmi_make_request(IVCPU(vcpu), REQ_CLOSE);
> +	mutex_unlock(&kvm->lock);
> +}
> diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
> new file mode 100644
> index 000000000000..5976b98f11cb
> --- /dev/null
> +++ b/virt/kvm/kvmi_int.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_INT_H__
> +#define __KVMI_INT_H__
> +
> +#include <linux/types.h>
> +#include <linux/kvm_host.h>
> +
> +#include <uapi/linux/kvmi.h>
> +
> +#define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
> +
> +struct kvmi_vcpu {
> +	u8 ctx_data[256];
> +	u32 ctx_size;
> +	struct semaphore sem_requests;
> +	unsigned long requests;
> +	/* TODO: get this ~64KB buffer from a cache */
> +	u8 msg_buf[KVMI_MAX_MSG_SIZE];
> +	struct kvmi_event_reply ev_rpl;
> +	void *ev_rpl_ptr;
> +	size_t ev_rpl_size;
> +	size_t ev_rpl_received;
> +	u32 ev_seq;
> +	bool ev_rpl_waiting;
> +	struct {
> +		u16 error_code;
> +		u8 nr;
> +		bool injected;
> +	} exception;
> +	struct kvm_regs delayed_regs;
> +	bool have_delayed_regs;
> +	bool pause;
> +};
> +
> +#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
> +
> +struct kvmi {
> +	atomic_t event_mask;
> +	unsigned long cr_mask;
> +	struct {
> +		unsigned long low[BITS_TO_LONGS(8192)];
> +		unsigned long high[BITS_TO_LONGS(8192)];
> +	} msr_mask;
> +	struct radix_tree_root access_tree;
> +	struct mutex access_tree_lock;
> +	struct list_head access_list;
> +	struct work_struct work;
> +	struct socket *sock;
> +	rwlock_t sock_lock;
> +	struct completion finished;
> +	struct kvm *kvm;
> +	/* TODO: get this ~64KB buffer from a cache */
> +	u8 msg_buf[KVMI_MAX_MSG_SIZE];
> +	u32 cmd_allow_mask;
> +	u32 event_allow_mask;
> +};
> +
> +#define REQ_INIT   0
> +#define REQ_CMD    1
> +#define REQ_REPLY  2
> +#define REQ_CLOSE  3

Would these be better off being an enum?

> +
> +/* kvmi_msg.c */
> +bool kvmi_msg_init(struct kvmi *ikvm, int fd);
> +bool kvmi_msg_process(struct kvmi *ikvm);
> +void kvmi_msg_uninit(struct kvmi *ikvm);
> +void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
> +		     u64 new_value, u64 *ret_value);
> +u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
> +		      u64 new_value, u64 *ret_value);
> +u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa);
> +u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
> +bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
> +		      u32 *action, bool *trap_access, u8 *ctx,
> +		      u32 *ctx_size);
> +u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
> +		       u32 error_code, u64 cr2);
> +u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
> +			     u64 exit_qualification, u8 descriptor, u8 write);
> +u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
> +
> +/* kvmi.c */
> +int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc);
> +int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu);
> +int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
> +			   struct kvm_regs *regs, struct kvm_sregs *sregs,
> +			   struct kvm_msrs *msrs);
> +int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs);
> +int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access);
> +int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
> +			      bool error_code_valid, u16 error_code,
> +			      u64 address);
> +int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events);
> +int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
> +		       u32 *eax, u32 *ebx, u32 *rcx, u32 *edx);
> +int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size,
> +			   int (*send)(struct kvmi *,
> +					const struct kvmi_msg_hdr*,
> +					int err, const void *buf, size_t),
> +			   const struct kvmi_msg_hdr *ctx);
> +int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size,
> +			    const void *buf);
> +int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
> +int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr);
> +int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr);
> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access);
> +void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req);
> +void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req);
> +unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
> +			    const struct kvm_sregs *sregs);
> +void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event);
> +unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn);
> +void kvmi_mem_destroy_vm(struct kvm *kvm);
> +
> +/* kvmi_mem.c */
> +int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
> +
> +#endif
> diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
> new file mode 100644
> index 000000000000..c766357678e6
> --- /dev/null
> +++ b/virt/kvm/kvmi_mem.c
> @@ -0,0 +1,730 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection memory mapping implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +#include <linux/rmap.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/pagemap.h>
> +#include <linux/swap.h>
> +#include <linux/spinlock.h>
> +#include <linux/printk.h>
> +#include <linux/kvmi.h>
> +#include <linux/huge_mm.h>
> +
> +#include <uapi/linux/kvmi.h>
> +
> +#include "kvmi_int.h"
> +
> +
> +static struct list_head mapping_list;
> +static spinlock_t mapping_lock;
> +
> +struct host_map {
> +	struct list_head mapping_list;
> +	gpa_t map_gpa;
> +	struct kvm *machine;
> +	gpa_t req_gpa;
> +};
> +
> +
> +static struct list_head token_list;
> +static spinlock_t token_lock;
> +
> +struct token_entry {
> +	struct list_head token_list;
> +	struct kvmi_map_mem_token token;
> +	struct kvm *kvm;
> +};
> +
> +
> +int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
> +{
> +	struct token_entry *tep;
> +
> +	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
> +			     32, 1, token, sizeof(struct kvmi_map_mem_token),
> +			     false);
> +
> +	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);

	tep = kmalloc(sizeof(*tep), GFP_KERNEL)

> +	if (tep == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tep->token_list);
> +	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));

Here too it might be better to do "sizeof(*token)"

> +	tep->kvm = kvm;
> +
> +	spin_lock(&token_lock);
> +	list_add_tail(&tep->token_list, &token_list);
> +	spin_unlock(&token_lock);
> +
> +	return 0;
> +}
> +
> +static struct kvm *find_machine_at(struct kvm_vcpu *vcpu, gva_t tkn_gva)
> +{
> +	long result;
> +	gpa_t tkn_gpa;
> +	struct kvmi_map_mem_token token;
> +	struct list_head *cur;
> +	struct token_entry *tep, *found = NULL;
> +	struct kvm *target_kvm = NULL;
> +
> +	/* machine token is passed as pointer */
> +	tkn_gpa = kvm_mmu_gva_to_gpa_system(vcpu, tkn_gva, NULL);
> +	if (tkn_gpa == UNMAPPED_GVA)
> +		return NULL;
> +
> +	/* copy token to local address space */
> +	result = kvm_read_guest(vcpu->kvm, tkn_gpa, &token, sizeof(token));
> +	if (IS_ERR_VALUE(result)) {
> +		kvm_err("kvmi: failed copying token from user\n");
> +		return ERR_PTR(result);
> +	}
> +
> +	/* consume token & find the VM */
> +	spin_lock(&token_lock);
> +	list_for_each(cur, &token_list) {
> +		tep = list_entry(cur, struct token_entry, token_list);
> +
> +		if (!memcmp(&token, &tep->token, sizeof(token))) {
> +			list_del(&tep->token_list);
> +			found = tep;
> +			break;
> +		}
> +	}
> +	spin_unlock(&token_lock);
> +
> +	if (found != NULL) {
> +		target_kvm = found->kvm;
> +		kfree(found);
> +	}
> +
> +	return target_kvm;
> +}
> +
> +static void remove_vm_token(struct kvm *kvm)
> +{
> +	struct list_head *cur, *next;
> +	struct token_entry *tep;
> +
> +	spin_lock(&token_lock);
> +	list_for_each_safe(cur, next, &token_list) {
> +		tep = list_entry(cur, struct token_entry, token_list);
> +
> +		if (tep->kvm == kvm) {
> +			list_del(&tep->token_list);
> +			kfree(tep);
> +		}
> +	}
> +	spin_unlock(&token_lock);
> +
> +}

There's an extra blank line at the end of this function (before the brace).

> +
> +
> +static int add_to_list(gpa_t map_gpa, struct kvm *machine, gpa_t req_gpa)
> +{
> +	struct host_map *map;
> +
> +	map = kmalloc(sizeof(struct host_map), GFP_KERNEL);

	map = kmalloc(sizeof(*map), GFP_KERNEL);

> +	if (map == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&map->mapping_list);
> +	map->map_gpa = map_gpa;
> +	map->machine = machine;
> +	map->req_gpa = req_gpa;
> +
> +	spin_lock(&mapping_lock);
> +	list_add_tail(&map->mapping_list, &mapping_list);
> +	spin_unlock(&mapping_lock);
> +
> +	return 0;
> +}
> +
> +static struct host_map *extract_from_list(gpa_t map_gpa)
> +{
> +	struct list_head *cur;
> +	struct host_map *map;
> +
> +	spin_lock(&mapping_lock);
> +	list_for_each(cur, &mapping_list) {
> +		map = list_entry(cur, struct host_map, mapping_list);
> +
> +		/* found - extract and return */
> +		if (map->map_gpa == map_gpa) {
> +			list_del(&map->mapping_list);
> +			spin_unlock(&mapping_lock);
> +
> +			return map;
> +		}
> +	}
> +	spin_unlock(&mapping_lock);
> +
> +	return NULL;
> +}
> +
> +static void remove_vm_from_list(struct kvm *kvm)
> +{
> +	struct list_head *cur, *next;
> +	struct host_map *map;
> +
> +	spin_lock(&mapping_lock);
> +
> +	list_for_each_safe(cur, next, &mapping_list) {
> +		map = list_entry(cur, struct host_map, mapping_list);
> +
> +		if (map->machine == kvm) {
> +			list_del(&map->mapping_list);
> +			kfree(map);
> +		}
> +	}
> +
> +	spin_unlock(&mapping_lock);
> +}
> +
> +static void remove_entry(struct host_map *map)
> +{
> +	kfree(map);
> +}
> +
> +
> +static struct vm_area_struct *isolate_page_vma(struct vm_area_struct *vma,
> +					       unsigned long addr)
> +{
> +	int result;
> +
> +	/* corner case */
> +	if (vma_pages(vma) == 1)
> +		return vma;
> +
> +	if (addr != vma->vm_start) {
> +		/* first split only if address in the middle */
> +		result = split_vma(vma->vm_mm, vma, addr, false);
> +		if (IS_ERR_VALUE((long)result))
> +			return ERR_PTR((long)result);
> +
> +		vma = find_vma(vma->vm_mm, addr);
> +		if (vma == NULL)
> +			return ERR_PTR(-ENOENT);
> +
> +		/* corner case (again) */
> +		if (vma_pages(vma) == 1)
> +			return vma;
> +	}
> +
> +	result = split_vma(vma->vm_mm, vma, addr + PAGE_SIZE, true);
> +	if (IS_ERR_VALUE((long)result))
> +		return ERR_PTR((long)result);
> +
> +	vma = find_vma(vma->vm_mm, addr);
> +	if (vma == NULL)
> +		return ERR_PTR(-ENOENT);
> +
> +	BUG_ON(vma_pages(vma) != 1);
> +
> +	return vma;
> +}
> +
> +static int redirect_rmap(struct vm_area_struct *req_vma, struct page *req_page,
> +			 struct vm_area_struct *map_vma)
> +{
> +	int result;
> +
> +	unlink_anon_vmas(map_vma);
> +
> +	result = anon_vma_fork(map_vma, req_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;

Why not just return result here?

> +
> +	page_dup_rmap(req_page, false);
> +
> +out:
> +	return result;
> +}
> +
> +static int host_map_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
> +			     struct page *req_page, struct page *map_page)
> +{
> +	struct mm_struct *map_mm = map_vma->vm_mm;
> +
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +	pte_t newpte;
> +
> +	unsigned long mmun_start;
> +	unsigned long mmun_end;
> +
> +	/* classic replace_page() code */
> +	pmd = mm_find_pmd(map_mm, map_hva);
> +	if (!pmd)
> +		return -EFAULT;
> +
> +	mmun_start = map_hva;
> +	mmun_end = map_hva + PAGE_SIZE;
> +	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
> +
> +	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
> +
> +	/* create new PTE based on requested page */
> +	newpte = mk_pte(req_page, map_vma->vm_page_prot);
> +	newpte = pte_set_flags(newpte, pte_flags(*ptep));
> +
> +	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
> +	ptep_clear_flush_notify(map_vma, map_hva, ptep);
> +	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
> +
> +	return 0;
> +}
> +
> +static void discard_page(struct page *map_page)
> +{
> +	lock_page(map_page);
> +	// TODO: put_anon_vma() ???? - should be here
> +	page_remove_rmap(map_page, false);
> +	if (!page_mapped(map_page))
> +		try_to_free_swap(map_page);
> +	unlock_page(map_page);
> +	put_page(map_page);
> +}
> +
> +static void kvmi_split_huge_pmd(struct vm_area_struct *req_vma,
> +				hva_t req_hva, struct page *req_page)
> +{
> +	bool tail = false;
> +
> +	/* move reference count from compound head... */
> +	if (PageTail(req_page)) {
> +		tail = true;
> +		put_page(req_page);
> +	}
> +
> +	if (PageCompound(req_page))
> +		split_huge_pmd_address(req_vma, req_hva, false, NULL);
> +
> +	/* ... to the actual page, after splitting */
> +	if (tail)
> +		get_page(req_page);
> +}
> +
> +static int kvmi_map_action(struct mm_struct *req_mm, hva_t req_hva,
> +			   struct mm_struct *map_mm, hva_t map_hva)
> +{
> +	struct vm_area_struct *req_vma;
> +	struct page *req_page = NULL;
> +
> +	struct vm_area_struct *map_vma;
> +	struct page *map_page;
> +
> +	long nrpages;
> +	int result = 0;
> +
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +
> +	/* get host page corresponding to requested address */
> +	nrpages = get_user_pages_remote(NULL, req_mm,
> +		req_hva, 1, 0,
> +		&req_page, &req_vma, NULL);
> +	if (nrpages == 0) {
> +		kvm_err("kvmi: no page for req_hva %016lx\n", req_hva);
> +		result = -ENOENT;
> +		goto out_err;
> +	} else if (IS_ERR_VALUE(nrpages)) {
> +		result = nrpages;
> +		kvm_err("kvmi: get_user_pages_remote() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page before remap");
> +
> +	/* find (not get) local page corresponding to target address */
> +	map_vma = find_vma(map_mm, map_hva);
> +	if (map_vma == NULL) {
> +		kvm_err("kvmi: no local VMA found for remapping\n");
> +		result = -ENOENT;
> +		goto out_err;
> +	}
> +
> +	map_page = follow_page(map_vma, map_hva, 0);
> +	if (IS_ERR_VALUE(map_page)) {
> +		result = PTR_ERR(map_page);
> +		kvm_debug("kvmi: follow_page() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	} else if (map_page == NULL) {
> +		result = -ENOENT;
> +		kvm_debug("kvmi: follow_page() returned no page\n");
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(map_page, "map_page before remap");
> +
> +	/* split local VMA for rmap redirecting */
> +	map_vma = isolate_page_vma(map_vma, map_hva);
> +	if (IS_ERR_VALUE(map_vma)) {
> +		result = PTR_ERR(map_vma);
> +		kvm_debug("kvmi: isolate_page_vma() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	}
> +
> +	/* split remote huge page */
> +	kvmi_split_huge_pmd(req_vma, req_hva, req_page);
> +
> +	/* re-link VMAs */
> +	result = redirect_rmap(req_vma, req_page, map_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	/* also redirect page tables */
> +	result = host_map_fix_ptes(map_vma, map_hva, req_page, map_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	/* the old page will be discarded */
> +	discard_page(map_page);
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(map_page, "map_page after being discarded");
> +
> +	/* done */
> +	goto out_finalize;
> +
> +out_err:
> +	/* get_user_pages_remote() incremented page reference count */
> +	if (req_page != NULL)
> +		put_page(req_page);
> +
> +out_finalize:
> +	/* release semaphores in reverse order */
> +	up_write(&map_mm->mmap_sem);
> +	up_write(&req_mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
> +	gpa_t req_gpa, gpa_t map_gpa)
> +{
> +	int result = 0;
> +	struct kvm *target_kvm;
> +
> +	gfn_t req_gfn;
> +	hva_t req_hva;
> +	struct mm_struct *req_mm;
> +
> +	gfn_t map_gfn;
> +	hva_t map_hva;
> +	struct mm_struct *map_mm = vcpu->kvm->mm;
> +
> +	kvm_debug("kvmi: mapping request req_gpa %016llx, map_gpa %016llx\n",
> +		  req_gpa, map_gpa);
> +
> +	/* get the struct kvm * corresponding to the token */
> +	target_kvm = find_machine_at(vcpu, tkn_gva);
> +	if (IS_ERR_VALUE(target_kvm))
> +		return PTR_ERR(target_kvm);

Since the else if block below has braces, this if block should have 
braces too.

> +	else if (target_kvm == NULL) {
> +		kvm_err("kvmi: unable to find target machine\n");
> +		return -ENOENT;
> +	}
> +	kvm_get_kvm(target_kvm);
> +	req_mm = target_kvm->mm;
> +
> +	/* translate source addresses */
> +	req_gfn = gpa_to_gfn(req_gpa);
> +	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
> +	if (kvm_is_error_hva(req_hva)) {
> +		kvm_err("kvmi: invalid req HVA %016lx\n", req_hva);
> +		result = -EFAULT;
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: req_gpa %016llx, req_gfn %016llx, req_hva %016lx\n",
> +		  req_gpa, req_gfn, req_hva);
> +
> +	/* translate destination addresses */
> +	map_gfn = gpa_to_gfn(map_gpa);
> +	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
> +	if (kvm_is_error_hva(map_hva)) {
> +		kvm_err("kvmi: invalid map HVA %016lx\n", map_hva);
> +		result = -EFAULT;
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
> +		map_gpa, map_gfn, map_hva);
> +
> +	/* go to step 2 */
> +	result = kvmi_map_action(req_mm, req_hva, map_mm, map_hva);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	/* add mapping to list */
> +	result = add_to_list(map_gpa, target_kvm, req_gpa);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	/* all fine */
> +	kvm_debug("kvmi: mapping of req_gpa %016llx successful\n", req_gpa);
> +
> +out:
> +	/* mandatory dec refernce count */
> +	kvm_put_kvm(target_kvm);
> +
> +	return result;
> +}
> +
> +
> +static int restore_rmap(struct vm_area_struct *map_vma, hva_t map_hva,
> +			struct page *req_page, struct page *new_page)
> +{
> +	int result;
> +
> +	/* decouple links to anon_vmas */
> +	unlink_anon_vmas(map_vma);
> +	map_vma->anon_vma = NULL;
> +
> +	/* allocate new anon_vma */
> +	result = anon_vma_prepare(map_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		return result;
> +
> +	lock_page(new_page);
> +	page_add_new_anon_rmap(new_page, map_vma, map_hva, false);
> +	unlock_page(new_page);
> +
> +	/* decrease req_page mapcount */
> +	atomic_dec(&req_page->_mapcount);
> +
> +	return 0;
> +}
> +
> +static int host_unmap_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
> +			       struct page *new_page)
> +{
> +	struct mm_struct *map_mm = map_vma->vm_mm;
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +	pte_t newpte;
> +
> +	unsigned long mmun_start;
> +	unsigned long mmun_end;
> +
> +	/* page replacing code */
> +	pmd = mm_find_pmd(map_mm, map_hva);
> +	if (!pmd)
> +		return -EFAULT;
> +
> +	mmun_start = map_hva;
> +	mmun_end = map_hva + PAGE_SIZE;
> +	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
> +
> +	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
> +
> +	newpte = mk_pte(new_page, map_vma->vm_page_prot);
> +	newpte = pte_set_flags(newpte, pte_flags(*ptep));
> +
> +	/* clear cache & MMU notifier entries */
> +	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
> +	ptep_clear_flush_notify(map_vma, map_hva, ptep);
> +	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
> +
> +	return 0;
> +}
> +
> +static int kvmi_unmap_action(struct mm_struct *req_mm,
> +			     struct mm_struct *map_mm, hva_t map_hva)
> +{
> +	struct vm_area_struct *map_vma;
> +	struct page *req_page = NULL;
> +	struct page *new_page = NULL;
> +
> +	int result;
> +
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +
> +	/* find destination VMA for mapping */
> +	map_vma = find_vma(map_mm, map_hva);
> +	if (map_vma == NULL) {
> +		result = -ENOENT;
> +		kvm_err("kvmi: no local VMA found for unmapping\n");
> +		goto out_err;
> +	}
> +
> +	/* find (not get) page mapped to destination address */
> +	req_page = follow_page(map_vma, map_hva, 0);
> +	if (IS_ERR_VALUE(req_page)) {
> +		result = PTR_ERR(req_page);
> +		kvm_err("kvmi: follow_page() failed with result %d\n", result);
> +		goto out_err;
> +	} else if (req_page == NULL) {
> +		result = -ENOENT;
> +		kvm_err("kvmi: follow_page() returned no page\n");
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page before decoupling");
> +
> +	/* Returns NULL when no page can be allocated. */
> +	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, map_vma, map_hva);
> +	if (new_page == NULL) {
> +		result = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(new_page, "new_page after allocation");
> +
> +	/* should fix the rmap tree */
> +	result = restore_rmap(map_vma, map_hva, req_page, new_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page after decoupling");
> +
> +	/* page table fixing here */
> +	result = host_unmap_fix_ptes(map_vma, map_hva, new_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(new_page, "new_page after unmapping");
> +
> +	goto out_finalize;
> +
> +out_err:
> +	if (new_page != NULL)
> +		put_page(new_page);
> +
> +out_finalize:
> +	/* reference count was inc during get_user_pages_remote() */
> +	if (req_page != NULL) {
> +		put_page(req_page);
> +
> +		if (IS_ENABLED(CONFIG_DEBUG_VM))
> +			dump_page(req_page, "req_page after release");
> +	}
> +
> +	/* release semaphores in reverse order */
> +	up_write(&map_mm->mmap_sem);
> +	up_write(&req_mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
> +{
> +	struct kvm *target_kvm;
> +	struct mm_struct *req_mm;
> +
> +	struct host_map *map;
> +	int result;
> +
> +	gfn_t map_gfn;
> +	hva_t map_hva;
> +	struct mm_struct *map_mm = vcpu->kvm->mm;
> +
> +	kvm_debug("kvmi: unmap request for map_gpa %016llx\n", map_gpa);
> +
> +	/* get the struct kvm * corresponding to map_gpa */
> +	map = extract_from_list(map_gpa);
> +	if (map == NULL) {
> +		kvm_err("kvmi: map_gpa %016llx not mapped\n", map_gpa);
> +		return -ENOENT;
> +	}
> +	target_kvm = map->machine;
> +	kvm_get_kvm(target_kvm);
> +	req_mm = target_kvm->mm;
> +
> +	kvm_debug("kvmi: req_gpa %016llx of machine %016lx mapped in map_gpa %016llx\n",
> +		  map->req_gpa, (unsigned long) map->machine, map->map_gpa);
> +
> +	/* address where we did the remapping */
> +	map_gfn = gpa_to_gfn(map_gpa);
> +	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
> +	if (kvm_is_error_hva(map_hva)) {
> +		result = -EFAULT;
> +		kvm_err("kvmi: invalid HVA %016lx\n", map_hva);
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
> +		  map_gpa, map_gfn, map_hva);
> +
> +	/* go to step 2 */
> +	result = kvmi_unmap_action(req_mm, map_mm, map_hva);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	kvm_debug("kvmi: unmap of map_gpa %016llx successful\n", map_gpa);
> +
> +out:
> +	kvm_put_kvm(target_kvm);
> +
> +	/* remove entry whatever happens above */
> +	remove_entry(map);
> +
> +	return result;
> +}
> +
> +void kvmi_mem_destroy_vm(struct kvm *kvm)
> +{
> +	kvm_debug("kvmi: machine %016lx was torn down\n",
> +		(unsigned long) kvm);
> +
> +	remove_vm_from_list(kvm);
> +	remove_vm_token(kvm);
> +}
> +
> +
> +int kvm_intro_host_init(void)
> +{
> +	/* token database */
> +	INIT_LIST_HEAD(&token_list);
> +	spin_lock_init(&token_lock);
> +
> +	/* mapping database */
> +	INIT_LIST_HEAD(&mapping_list);
> +	spin_lock_init(&mapping_lock);
> +
> +	kvm_info("kvmi: initialized host memory introspection\n");
> +
> +	return 0;
> +}
> +
> +void kvm_intro_host_exit(void)
> +{
> +	// ...
> +}
> +
> +module_init(kvm_intro_host_init)
> +module_exit(kvm_intro_host_exit)
> diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
> new file mode 100644
> index 000000000000..b1b20eb6332d
> --- /dev/null
> +++ b/virt/kvm/kvmi_msg.c
> @@ -0,0 +1,1134 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + */
> +#include <linux/file.h>
> +#include <linux/net.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvmi.h>
> +#include <asm/virtext.h>
> +
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/asm/kvmi.h>
> +
> +#include "kvmi_int.h"
> +
> +#include <trace/events/kvmi.h>
> +
> +/*
> + * TODO: break these call paths
> + *   kvmi.c        work_cb
> + *   kvmi_msg.c    kvmi_dispatch_message
> + *   kvmi.c        kvmi_cmd_... / kvmi_make_request
> + *   kvmi_msg.c    kvmi_msg_reply
> + *
> + *   kvmi.c        kvmi_X_event
> + *   kvmi_msg.c    kvmi_send_event
> + *   kvmi.c        kvmi_handle_request
> + */
> +
> +/* TODO: move some of the code to arch/x86 */
> +
> +static atomic_t seq_ev = ATOMIC_INIT(0);
> +
> +static u32 new_seq(void)
> +{
> +	return atomic_inc_return(&seq_ev);
> +}
> +
> +static const char *event_str(unsigned int e)
> +{
> +	switch (e) {
> +	case KVMI_EVENT_CR:
> +		return "CR";
> +	case KVMI_EVENT_MSR:
> +		return "MSR";
> +	case KVMI_EVENT_XSETBV:
> +		return "XSETBV";
> +	case KVMI_EVENT_BREAKPOINT:
> +		return "BREAKPOINT";
> +	case KVMI_EVENT_HYPERCALL:
> +		return "HYPERCALL";
> +	case KVMI_EVENT_PAGE_FAULT:
> +		return "PAGE_FAULT";
> +	case KVMI_EVENT_TRAP:
> +		return "TRAP";
> +	case KVMI_EVENT_DESCRIPTOR:
> +		return "DESCRIPTOR";
> +	case KVMI_EVENT_CREATE_VCPU:
> +		return "CREATE_VCPU";
> +	case KVMI_EVENT_PAUSE_VCPU:
> +		return "PAUSE_VCPU";
> +	default:
> +		return "EVENT?";
> +	}
> +}
> +
> +static const char * const msg_IDs[] = {
> +	[KVMI_GET_VERSION]      = "KVMI_GET_VERSION",
> +	[KVMI_GET_GUEST_INFO]   = "KVMI_GET_GUEST_INFO",
> +	[KVMI_PAUSE_VCPU]       = "KVMI_PAUSE_VCPU",
> +	[KVMI_GET_REGISTERS]    = "KVMI_GET_REGISTERS",
> +	[KVMI_SET_REGISTERS]    = "KVMI_SET_REGISTERS",
> +	[KVMI_GET_PAGE_ACCESS]  = "KVMI_GET_PAGE_ACCESS",
> +	[KVMI_SET_PAGE_ACCESS]  = "KVMI_SET_PAGE_ACCESS",
> +	[KVMI_INJECT_EXCEPTION] = "KVMI_INJECT_EXCEPTION",
> +	[KVMI_READ_PHYSICAL]    = "KVMI_READ_PHYSICAL",
> +	[KVMI_WRITE_PHYSICAL]   = "KVMI_WRITE_PHYSICAL",
> +	[KVMI_GET_MAP_TOKEN]    = "KVMI_GET_MAP_TOKEN",
> +	[KVMI_CONTROL_EVENTS]   = "KVMI_CONTROL_EVENTS",
> +	[KVMI_CONTROL_CR]       = "KVMI_CONTROL_CR",
> +	[KVMI_CONTROL_MSR]      = "KVMI_CONTROL_MSR",
> +	[KVMI_EVENT]            = "KVMI_EVENT",
> +	[KVMI_EVENT_REPLY]      = "KVMI_EVENT_REPLY",
> +	[KVMI_GET_CPUID]        = "KVMI_GET_CPUID",
> +	[KVMI_GET_XSAVE]        = "KVMI_GET_XSAVE",
> +};
> +
> +static size_t sizeof_get_registers(const void *r)
> +{
> +	const struct kvmi_get_registers *req = r;
> +
> +	return sizeof(*req) + sizeof(req->msrs_idx[0]) * req->nmsrs;
> +}
> +
> +static size_t sizeof_get_page_access(const void *r)
> +{
> +	const struct kvmi_get_page_access *req = r;
> +
> +	return sizeof(*req) + sizeof(req->gpa[0]) * req->count;
> +}
> +
> +static size_t sizeof_set_page_access(const void *r)
> +{
> +	const struct kvmi_set_page_access *req = r;
> +
> +	return sizeof(*req) + sizeof(req->entries[0]) * req->count;
> +}
> +
> +static size_t sizeof_write_physical(const void *r)
> +{
> +	const struct kvmi_write_physical *req = r;
> +
> +	return sizeof(*req) + req->size;
> +}
> +
> +static const struct {
> +	size_t size;
> +	size_t (*cbk_full_size)(const void *msg);
> +} msg_bytes[] = {
> +	[KVMI_GET_VERSION]      = { 0, NULL },
> +	[KVMI_GET_GUEST_INFO]   = { sizeof(struct kvmi_get_guest_info), NULL },
> +	[KVMI_PAUSE_VCPU]       = { sizeof(struct kvmi_pause_vcpu), NULL },
> +	[KVMI_GET_REGISTERS]    = { sizeof(struct kvmi_get_registers),
> +						sizeof_get_registers },
> +	[KVMI_SET_REGISTERS]    = { sizeof(struct kvmi_set_registers), NULL },
> +	[KVMI_GET_PAGE_ACCESS]  = { sizeof(struct kvmi_get_page_access),
> +						sizeof_get_page_access },
> +	[KVMI_SET_PAGE_ACCESS]  = { sizeof(struct kvmi_set_page_access),
> +						sizeof_set_page_access },
> +	[KVMI_INJECT_EXCEPTION] = { sizeof(struct kvmi_inject_exception),
> +					NULL },
> +	[KVMI_READ_PHYSICAL]    = { sizeof(struct kvmi_read_physical), NULL },
> +	[KVMI_WRITE_PHYSICAL]   = { sizeof(struct kvmi_write_physical),
> +						sizeof_write_physical },
> +	[KVMI_GET_MAP_TOKEN]    = { 0, NULL },
> +	[KVMI_CONTROL_EVENTS]   = { sizeof(struct kvmi_control_events), NULL },
> +	[KVMI_CONTROL_CR]       = { sizeof(struct kvmi_control_cr), NULL },
> +	[KVMI_CONTROL_MSR]      = { sizeof(struct kvmi_control_msr), NULL },
> +	[KVMI_GET_CPUID]        = { sizeof(struct kvmi_get_cpuid), NULL },
> +	[KVMI_GET_XSAVE]        = { sizeof(struct kvmi_get_xsave), NULL },
> +};
> +
> +static int kvmi_sock_read(struct kvmi *ikvm, void *buf, size_t size)
> +{
> +	struct kvec i = {
> +		.iov_base = buf,
> +		.iov_len = size,
> +	};
> +	struct msghdr m = { };
> +	int rc;
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (likely(ikvm->sock))
> +		rc = kernel_recvmsg(ikvm->sock, &m, &i, 1, size, MSG_WAITALL);
> +	else
> +		rc = -EPIPE;
> +
> +	if (rc > 0)
> +		print_hex_dump_debug("read: ", DUMP_PREFIX_NONE, 32, 1,
> +					buf, rc, false);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	if (unlikely(rc != size)) {
> +		kvm_err("kernel_recvmsg: %d\n", rc);
> +		if (rc >= 0)
> +			rc = -EPIPE;
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +static int kvmi_sock_write(struct kvmi *ikvm, struct kvec *i, size_t n,
> +			   size_t size)
> +{
> +	struct msghdr m = { };
> +	int rc, k;
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (likely(ikvm->sock))
> +		rc = kernel_sendmsg(ikvm->sock, &m, i, n, size);
> +	else
> +		rc = -EPIPE;
> +
> +	for (k = 0; k < n; k++)
> +		print_hex_dump_debug("write: ", DUMP_PREFIX_NONE, 32, 1,
> +				     i[k].iov_base, i[k].iov_len, false);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	if (unlikely(rc != size)) {
> +		kvm_err("kernel_sendmsg: %d\n", rc);
> +		if (rc >= 0)
> +			rc = -EPIPE;
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +static const char *id2str(int i)
> +{
> +	return (i < ARRAY_SIZE(msg_IDs) && msg_IDs[i] ? msg_IDs[i] : "unknown");
> +}
> +
> +static struct kvmi_vcpu *kvmi_vcpu_waiting_for_reply(struct kvm *kvm, u32 seq)
> +{
> +	struct kvmi_vcpu *found = NULL;
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		/* kvmi_send_event */
> +		smp_rmb();
> +		if (READ_ONCE(IVCPU(vcpu)->ev_rpl_waiting)
> +		    && seq == IVCPU(vcpu)->ev_seq) {
> +			found = IVCPU(vcpu);
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	return found;
> +}
> +
> +static bool kvmi_msg_dispatch_reply(struct kvmi *ikvm,
> +				    const struct kvmi_msg_hdr *msg)
> +{
> +	struct kvmi_vcpu *ivcpu;
> +	int err;
> +
> +	ivcpu = kvmi_vcpu_waiting_for_reply(ikvm->kvm, msg->seq);
> +	if (!ivcpu) {
> +		kvm_err("%s: unexpected event reply (seq=%u)\n", __func__,
> +			msg->seq);
> +		return false;
> +	}
> +
> +	if (msg->size == sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size) {
> +		err = kvmi_sock_read(ikvm, &ivcpu->ev_rpl,
> +					sizeof(ivcpu->ev_rpl));
> +		if (!err && ivcpu->ev_rpl_size)
> +			err = kvmi_sock_read(ikvm, ivcpu->ev_rpl_ptr,
> +						ivcpu->ev_rpl_size);
> +	} else {
> +		kvm_err("%s: invalid event reply size (max=%zu, recv=%u, expected=%zu)\n",
> +			__func__, ivcpu->ev_rpl_size, msg->size,
> +			sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size);
> +		err = -1;
> +	}
> +
> +	ivcpu->ev_rpl_received = err ? -1 : ivcpu->ev_rpl_size;
> +
> +	kvmi_make_request(ivcpu, REQ_REPLY);
> +
> +	return (err == 0);
> +}
> +
> +static bool consume_sock_bytes(struct kvmi *ikvm, size_t n)
> +{
> +	while (n) {
> +		u8 buf[256];
> +		size_t chunk = min(n, sizeof(buf));
> +
> +		if (kvmi_sock_read(ikvm, buf, chunk) != 0)
> +			return false;
> +
> +		n -= chunk;
> +	}
> +
> +	return true;
> +}
> +
> +static int kvmi_msg_reply(struct kvmi *ikvm,
> +			  const struct kvmi_msg_hdr *msg,
> +			  int err, const void *rpl, size_t rpl_size)
> +{
> +	struct kvmi_error_code ec;
> +	struct kvmi_msg_hdr h;
> +	struct kvec vec[3] = {
> +		{.iov_base = &h,           .iov_len = sizeof(h) },
> +		{.iov_base = &ec,          .iov_len = sizeof(ec)},
> +		{.iov_base = (void *) rpl, .iov_len = rpl_size  },
> +	};
> +	size_t size = sizeof(h) + sizeof(ec) + (err ? 0 : rpl_size);
> +	size_t n = err ? ARRAY_SIZE(vec)-1 : ARRAY_SIZE(vec);
> +
> +	memset(&h, 0, sizeof(h));
> +	h.id = msg->id;
> +	h.seq = msg->seq;
> +	h.size = size - sizeof(h);
> +
> +	memset(&ec, 0, sizeof(ec));
> +	ec.err = err;
> +
> +	return kvmi_sock_write(ikvm, vec, n, size);
> +}
> +
> +static int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg,
> +				int err, const void *rpl, size_t size)
> +{
> +	/*
> +	 * As soon as we reply to this vCPU command, we can get another one,
> +	 * and we must signal that the incoming buffer (ivcpu->msg_buf)
> +	 * is ready by clearing this bit/request.
> +	 */
> +	kvmi_clear_request(IVCPU(vcpu), REQ_CMD);
> +
> +	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, size);
> +}
> +
> +bool kvmi_msg_init(struct kvmi *ikvm, int fd)
> +{
> +	struct socket *sock;
> +	int r;
> +
> +	sock = sockfd_lookup(fd, &r);
> +
> +	if (!sock) {
> +		kvm_err("Invalid file handle: %d\n", fd);
> +		return false;
> +	}
> +
> +	WRITE_ONCE(ikvm->sock, sock);
> +
> +	return true;
> +}
> +
> +void kvmi_msg_uninit(struct kvmi *ikvm)
> +{
> +	kvm_info("Wake up the receiving thread\n");
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (ikvm->sock)
> +		kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	kvm_info("Wait for the receiving thread to complete\n");
> +	wait_for_completion(&ikvm->finished);
> +}
> +
> +static int handle_get_version(struct kvmi *ikvm,
> +			      const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_version_reply rpl;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +	rpl.version = KVMI_VERSION;
> +
> +	return kvmi_msg_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
> +}
> +
> +static struct kvm_vcpu *kvmi_get_vcpu(struct kvmi *ikvm, int vcpu_id)
> +{
> +	struct kvm *kvm = ikvm->kvm;
> +
> +	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
> +		return NULL;
> +
> +	return kvm_get_vcpu(kvm, vcpu_id);
> +}
> +
> +static bool invalid_page_access(u64 gpa, u64 size)
> +{
> +	u64 off = gpa & ~PAGE_MASK;
> +
> +	return (size == 0 || size > PAGE_SIZE || off + size > PAGE_SIZE);
> +}
> +
> +static int handle_read_physical(struct kvmi *ikvm,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	const struct kvmi_read_physical *req = _req;
> +
> +	if (invalid_page_access(req->gpa, req->size))
> +		return -EINVAL;
> +
> +	return kvmi_cmd_read_physical(ikvm->kvm, req->gpa, req->size,
> +				      kvmi_msg_reply, msg);
> +}
> +
> +static int handle_write_physical(struct kvmi *ikvm,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *_req)
> +{
> +	const struct kvmi_write_physical *req = _req;
> +	int ec;
> +
> +	if (invalid_page_access(req->gpa, req->size))
> +		return -EINVAL;
> +
> +	ec = kvmi_cmd_write_physical(ikvm->kvm, req->gpa, req->size, req->data);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_map_token(struct kvmi *ikvm,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	struct kvmi_get_map_token_reply rpl;
> +	int ec;
> +
> +	ec = kvmi_cmd_alloc_token(ikvm->kvm, &rpl.token);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_control_cr(struct kvmi *ikvm,
> +			     const struct kvmi_msg_hdr *msg, const void *_req)
> +{
> +	const struct kvmi_control_cr *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_control_cr(ikvm, req->enable, req->cr);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +static int handle_control_msr(struct kvmi *ikvm,
> +			      const struct kvmi_msg_hdr *msg, const void *_req)
> +{
> +	const struct kvmi_control_msr *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_control_msr(ikvm->kvm, req->enable, req->msr);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +/*
> + * These commands are executed on the receiving thread/worker.
> + */
> +static int (*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
> +			     const void *) = {
> +	[KVMI_GET_VERSION]    = handle_get_version,
> +	[KVMI_READ_PHYSICAL]  = handle_read_physical,
> +	[KVMI_WRITE_PHYSICAL] = handle_write_physical,
> +	[KVMI_GET_MAP_TOKEN]  = handle_get_map_token,
> +	[KVMI_CONTROL_CR]     = handle_control_cr,
> +	[KVMI_CONTROL_MSR]    = handle_control_msr,
> +};
> +
> +static int handle_get_guest_info(struct kvm_vcpu *vcpu,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *req)
> +{
> +	struct kvmi_get_guest_info_reply rpl;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +	kvmi_cmd_get_guest_info(vcpu, &rpl.vcpu_count, &rpl.tsc_speed);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, 0, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_pause_vcpu(struct kvm_vcpu *vcpu,
> +			     const struct kvmi_msg_hdr *msg,
> +			     const void *req)
> +{
> +	int ec = kvmi_cmd_pause_vcpu(vcpu);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
> +				       const struct kvmi_get_registers *req,
> +				       size_t *rpl_size)
> +{
> +	struct kvmi_get_registers_reply *rpl;
> +	u16 k, n = req->nmsrs;
> +
> +	*rpl_size = sizeof(*rpl) + sizeof(rpl->msrs.entries[0]) * n;
> +
> +	rpl = kzalloc(*rpl_size, GFP_KERNEL);
> +
> +	if (rpl) {
> +		rpl->msrs.nmsrs = n;
> +
> +		for (k = 0; k < n; k++)
> +			rpl->msrs.entries[k].index = req->msrs_idx[k];
> +	}
> +
> +	return rpl;
> +}
> +
> +static int handle_get_registers(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_registers_reply *rpl;
> +	size_t rpl_size = 0;
> +	int err, ec;
> +
> +	rpl = alloc_get_registers_reply(msg, req, &rpl_size);
> +
> +	if (!rpl)
> +		ec = -KVM_ENOMEM;
> +	else
> +		ec = kvmi_cmd_get_registers(vcpu, &rpl->mode,
> +						&rpl->regs, &rpl->sregs,
> +						&rpl->msrs);
> +
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +static int handle_set_registers(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	const struct kvmi_set_registers *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_set_registers(vcpu, &req->regs);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_page_access(struct kvm_vcpu *vcpu,
> +				  const struct kvmi_msg_hdr *msg,
> +				  const void *_req)
> +{
> +	const struct kvmi_get_page_access *req = _req;
> +	struct kvmi_get_page_access_reply *rpl = NULL;
> +	size_t rpl_size = 0;
> +	u16 k, n = req->count;
> +	int err, ec = 0;
> +
> +	if (req->view != 0 && !kvm_eptp_switching_supported) {
> +		ec = -KVM_ENOSYS;
> +		goto out;
> +	}
> +
> +	if (req->view != 0) { /* TODO */
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	rpl_size = sizeof(*rpl) + sizeof(rpl->access[0]) * n;
> +	rpl = kzalloc(rpl_size, GFP_KERNEL);
> +
> +	if (!rpl) {
> +		ec = -KVM_ENOMEM;
> +		goto out;
> +	}
> +
> +	for (k = 0; k < n && ec == 0; k++)
> +		ec = kvmi_cmd_get_page_access(vcpu, req->gpa[k],
> +						&rpl->access[k]);
> +
> +out:
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +static int handle_set_page_access(struct kvm_vcpu *vcpu,
> +				  const struct kvmi_msg_hdr *msg,
> +				  const void *_req)
> +{
> +	const struct kvmi_set_page_access *req = _req;
> +	struct kvm *kvm = vcpu->kvm;
> +	u16 k, n = req->count;
> +	int ec = 0;
> +
> +	if (req->view != 0) {
> +		if (!kvm_eptp_switching_supported)
> +			ec = -KVM_ENOSYS;
> +		else
> +			ec = -KVM_EINVAL; /* TODO */
> +	} else {
> +		for (k = 0; k < n; k++) {
> +			u64 gpa   = req->entries[k].gpa;
> +			u8 access = req->entries[k].access;
> +			int ec0;
> +
> +			if (access &  ~(KVMI_PAGE_ACCESS_R |
> +					KVMI_PAGE_ACCESS_W |
> +					KVMI_PAGE_ACCESS_X))
> +				ec0 = -KVM_EINVAL;
> +			else
> +				ec0 = kvmi_set_mem_access(kvm, gpa, access);
> +
> +			if (ec0 && !ec)
> +				ec = ec0;
> +
> +			trace_kvmi_set_mem_access(gpa_to_gfn(gpa), access, ec0);
> +		}
> +	}
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_inject_exception(struct kvm_vcpu *vcpu,
> +				   const struct kvmi_msg_hdr *msg,
> +				   const void *_req)
> +{
> +	const struct kvmi_inject_exception *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_inject_exception(vcpu, req->nr, req->has_error,
> +				       req->error_code, req->address);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_control_events(struct kvm_vcpu *vcpu,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *_req)
> +{
> +	const struct kvmi_control_events *req = _req;
> +	u32 not_allowed = ~IKVM(vcpu->kvm)->event_allow_mask;
> +	u32 unknown = ~KVMI_KNOWN_EVENTS;
> +	int ec;
> +
> +	if (req->events & unknown)
> +		ec = -KVM_EINVAL;
> +	else if (req->events & not_allowed)
> +		ec = -KVM_EPERM;
> +	else
> +		ec = kvmi_cmd_control_events(vcpu, req->events);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_cpuid(struct kvm_vcpu *vcpu,
> +			    const struct kvmi_msg_hdr *msg,
> +			    const void *_req)
> +{
> +	const struct kvmi_get_cpuid *req = _req;
> +	struct kvmi_get_cpuid_reply rpl;
> +	int ec;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +
> +	ec = kvmi_cmd_get_cpuid(vcpu, req->function, req->index,
> +					&rpl.eax, &rpl.ebx, &rpl.ecx,
> +					&rpl.edx);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_get_xsave(struct kvm_vcpu *vcpu,
> +			    const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_xsave_reply *rpl;
> +	size_t rpl_size = sizeof(*rpl) + sizeof(struct kvm_xsave);
> +	int ec = 0, err;
> +
> +	rpl = kzalloc(rpl_size, GFP_KERNEL);
> +
> +	if (!rpl)
> +		ec = -KVM_ENOMEM;

Again, because the else block has braces, the if should too.

> +	else {
> +		struct kvm_xsave *area;
> +
> +		area = (struct kvm_xsave *)&rpl->region[0];
> +		kvm_vcpu_ioctl_x86_get_xsave(vcpu, area);
> +	}
> +
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +/*
> + * These commands are executed on the vCPU thread. The receiving thread
> + * saves the command into kvmi_vcpu.msg_buf[] and signals the vCPU to handle
> + * the command (including sending back the reply).
> + */
> +static int (*const msg_vcpu[])(struct kvm_vcpu *,
> +			       const struct kvmi_msg_hdr *, const void *) = {
> +	[KVMI_GET_GUEST_INFO]   = handle_get_guest_info,
> +	[KVMI_PAUSE_VCPU]       = handle_pause_vcpu,
> +	[KVMI_GET_REGISTERS]    = handle_get_registers,
> +	[KVMI_SET_REGISTERS]    = handle_set_registers,
> +	[KVMI_GET_PAGE_ACCESS]  = handle_get_page_access,
> +	[KVMI_SET_PAGE_ACCESS]  = handle_set_page_access,
> +	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
> +	[KVMI_CONTROL_EVENTS]   = handle_control_events,
> +	[KVMI_GET_CPUID]        = handle_get_cpuid,
> +	[KVMI_GET_XSAVE]        = handle_get_xsave,
> +};
> +
> +void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi_msg_hdr *msg = (void *) ivcpu->msg_buf;
> +	u8 *req = ivcpu->msg_buf + sizeof(*msg);
> +	int err;
> +
> +	err = msg_vcpu[msg->id](vcpu, msg, req);
> +
> +	if (err)
> +		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
> +			id2str(msg->id), err);
> +
> +	/*
> +	 * No error code is returned.
> +	 *
> +	 * The introspector gets its error code from the message handler
> +	 * or the socket is closed (and QEMU should reconnect).
> +	 */
> +}
> +
> +static int kvmi_msg_recv_varlen(struct kvmi *ikvm, size_t(*cbk) (const void *),
> +				size_t min_n, size_t msg_size)
> +{
> +	size_t extra_n;
> +	u8 *extra_buf;
> +	int err;
> +
> +	if (min_n > msg_size) {
> +		kvm_err("%s: got %zu bytes instead of min %zu\n",
> +			__func__, msg_size, min_n);
> +		return -EINVAL;
> +	}
> +
> +	if (!min_n)
> +		return 0;
> +
> +	err = kvmi_sock_read(ikvm, ikvm->msg_buf, min_n);
> +
> +	extra_buf = ikvm->msg_buf + min_n;
> +	extra_n = msg_size - min_n;
> +
> +	if (!err && extra_n) {
> +		if (cbk(ikvm->msg_buf) == msg_size)
> +			err = kvmi_sock_read(ikvm, extra_buf, extra_n);
> +		else
> +			err = -EINVAL;
> +	}
> +
> +	return err;
> +}
> +
> +static int kvmi_msg_recv_n(struct kvmi *ikvm, size_t n, size_t msg_size)
> +{
> +	if (n != msg_size) {
> +		kvm_err("%s: got %zu bytes instead of %zu\n",
> +			__func__, msg_size, n);
> +		return -EINVAL;
> +	}
> +
> +	if (!n)
> +		return 0;
> +
> +	return kvmi_sock_read(ikvm, ikvm->msg_buf, n);
> +}
> +
> +static int kvmi_msg_recv(struct kvmi *ikvm, const struct kvmi_msg_hdr *msg)
> +{
> +	size_t (*cbk)(const void *) = msg_bytes[msg->id].cbk_full_size;
> +	size_t expected = msg_bytes[msg->id].size;
> +
> +	if (cbk)
> +		return kvmi_msg_recv_varlen(ikvm, cbk, expected, msg->size);
> +	else
> +		return kvmi_msg_recv_n(ikvm, expected, msg->size);
> +}
> +
> +struct vcpu_msg_hdr {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +static int kvmi_msg_queue_to_vcpu(struct kvmi *ikvm,
> +				  const struct kvmi_msg_hdr *msg)
> +{
> +	struct vcpu_msg_hdr *vcpu_hdr = (struct vcpu_msg_hdr *)ikvm->msg_buf;
> +	struct kvmi_vcpu *ivcpu;
> +	struct kvm_vcpu *vcpu;
> +
> +	if (msg->size < sizeof(*vcpu_hdr)) {
> +		kvm_err("%s: invalid vcpu message: %d\n", __func__, msg->size);
> +		return -EINVAL; /* ABI error */
> +	}
> +
> +	vcpu = kvmi_get_vcpu(ikvm, vcpu_hdr->vcpu);
> +
> +	if (!vcpu) {
> +		kvm_err("%s: invalid vcpu: %d\n", __func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EINVAL, NULL, 0);
> +	}
> +
> +	ivcpu = vcpu->kvmi;
> +
> +	if (!ivcpu) {
> +		kvm_err("%s: not introspected vcpu: %d\n",
> +			__func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EAGAIN, NULL, 0);
> +	}
> +
> +	if (test_bit(REQ_CMD, &ivcpu->requests)) {
> +		kvm_err("%s: vcpu is busy: %d\n", __func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EBUSY, NULL, 0);
> +	}
> +
> +	memcpy(ivcpu->msg_buf, msg, sizeof(*msg));
> +	memcpy(ivcpu->msg_buf + sizeof(*msg), ikvm->msg_buf, msg->size);
> +
> +	kvmi_make_request(ivcpu, REQ_CMD);
> +	kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
> +	kvm_vcpu_kick(vcpu);
> +
> +	return 0;
> +}
> +
> +static bool kvmi_msg_dispatch_cmd(struct kvmi *ikvm,
> +				  const struct kvmi_msg_hdr *msg)
> +{
> +	int err = kvmi_msg_recv(ikvm, msg);
> +
> +	if (err)
> +		goto out;
> +
> +	if (!KVMI_ALLOWED_COMMAND(msg->id, ikvm->cmd_allow_mask)) {
> +		err = kvmi_msg_reply(ikvm, msg, -KVM_EPERM, NULL, 0);
> +		goto out;
> +	}
> +
> +	if (msg_vcpu[msg->id])
> +		err = kvmi_msg_queue_to_vcpu(ikvm, msg);
> +	else
> +		err = msg_vm[msg->id](ikvm, msg, ikvm->msg_buf);
> +
> +out:
> +	if (err)
> +		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
> +			id2str(msg->id), err);
> +
> +	return (err == 0);
> +}
> +
> +static bool handle_unsupported_msg(struct kvmi *ikvm,
> +				   const struct kvmi_msg_hdr *msg)
> +{
> +	int err;
> +
> +	kvm_err("%s: %u\n", __func__, msg->id);
> +
> +	err = consume_sock_bytes(ikvm, msg->size);
> +
> +	if (!err)
> +		err = kvmi_msg_reply(ikvm, msg, -KVM_ENOSYS, NULL, 0);
> +
> +	return (err == 0);
> +}
> +
> +static bool kvmi_msg_dispatch(struct kvmi *ikvm)
> +{
> +	struct kvmi_msg_hdr msg;
> +	int err;
> +
> +	err = kvmi_sock_read(ikvm, &msg, sizeof(msg));
> +
> +	if (err) {
> +		kvm_err("%s: can't read\n", __func__);
> +		return false;
> +	}
> +
> +	trace_kvmi_msg_dispatch(msg.id, msg.size);
> +
> +	kvm_debug("%s: id:%u (%s) size:%u\n", __func__, msg.id,
> +		  id2str(msg.id), msg.size);
> +
> +	if (msg.id == KVMI_EVENT_REPLY)
> +		return kvmi_msg_dispatch_reply(ikvm, &msg);
> +
> +	if (msg.id >= ARRAY_SIZE(msg_bytes)
> +	    || (!msg_vm[msg.id] && !msg_vcpu[msg.id]))
> +		return handle_unsupported_msg(ikvm, &msg);
> +
> +	return kvmi_msg_dispatch_cmd(ikvm, &msg);
> +}
> +
> +static void kvmi_sock_close(struct kvmi *ikvm)
> +{
> +	kvm_info("%s\n", __func__);
> +
> +	write_lock(&ikvm->sock_lock);
> +
> +	if (ikvm->sock) {
> +		kvm_info("Release the socket\n");
> +		sockfd_put(ikvm->sock);
> +
> +		ikvm->sock = NULL;
> +	}
> +
> +	write_unlock(&ikvm->sock_lock);
> +}
> +
> +bool kvmi_msg_process(struct kvmi *ikvm)
> +{
> +	if (!kvmi_msg_dispatch(ikvm)) {
> +		kvmi_sock_close(ikvm);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static void kvmi_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev,
> +			     u32 ev_id)
> +{
> +	memset(ev, 0, sizeof(*ev));
> +	ev->vcpu = vcpu->vcpu_id;
> +	ev->event = ev_id;
> +	kvm_arch_vcpu_ioctl_get_regs(vcpu, &ev->regs);
> +	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &ev->sregs);
> +	ev->mode = kvmi_vcpu_mode(vcpu, &ev->sregs);
> +	kvmi_get_msrs(vcpu, ev);
> +}
> +
> +static bool kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
> +			    void *ev,  size_t ev_size,
> +			    void *rpl, size_t rpl_size)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi_event common;
> +	struct kvmi_msg_hdr h;
> +	struct kvec vec[3] = {
> +		{.iov_base = &h,      .iov_len = sizeof(h)     },
> +		{.iov_base = &common, .iov_len = sizeof(common)},
> +		{.iov_base = ev,      .iov_len = ev_size       },
> +	};
> +	size_t msg_size = sizeof(h) + sizeof(common) + ev_size;
> +	size_t n = ev_size ? ARRAY_SIZE(vec) : ARRAY_SIZE(vec)-1;
> +
> +	memset(&h, 0, sizeof(h));
> +	h.id = KVMI_EVENT;
> +	h.seq = new_seq();
> +	h.size = msg_size - sizeof(h);
> +
> +	kvmi_setup_event(vcpu, &common, ev_id);
> +
> +	ivcpu->ev_rpl_ptr = rpl;
> +	ivcpu->ev_rpl_size = rpl_size;
> +	ivcpu->ev_seq = h.seq;
> +	ivcpu->ev_rpl_received = -1;
> +	WRITE_ONCE(ivcpu->ev_rpl_waiting, true);
> +	/* kvmi_vcpu_waiting_for_reply() */
> +	smp_wmb();
> +
> +	trace_kvmi_send_event(ev_id);
> +
> +	kvm_debug("%s: %-11s(seq:%u) size:%lu vcpu:%d\n",
> +		  __func__, event_str(ev_id), h.seq, ev_size, vcpu->vcpu_id);
> +
> +	if (kvmi_sock_write(IKVM(vcpu->kvm), vec, n, msg_size) == 0)
> +		kvmi_handle_request(vcpu);
> +
> +	kvm_debug("%s: reply for %-11s(seq:%u) size:%lu vcpu:%d\n",
> +		  __func__, event_str(ev_id), h.seq, rpl_size, vcpu->vcpu_id);
> +
> +	return (ivcpu->ev_rpl_received >= 0);
> +}
> +
> +u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
> +		     u64 new_value, u64 *ret_value)
> +{
> +	struct kvmi_event_cr e;
> +	struct kvmi_event_cr_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.cr = cr;
> +	e.old_value = old_value;
> +	e.new_value = new_value;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_CR, &e, sizeof(e),
> +				&r, sizeof(r))) {
> +		*ret_value = new_value;
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +	}
> +
> +	*ret_value = r.new_val;
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
> +		      u64 new_value, u64 *ret_value)
> +{
> +	struct kvmi_event_msr e;
> +	struct kvmi_event_msr_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.msr = msr;
> +	e.old_value = old_value;
> +	e.new_value = new_value;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_MSR, &e, sizeof(e),
> +				&r, sizeof(r))) {
> +		*ret_value = new_value;
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +	}
> +
> +	*ret_value = r.new_val;
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_XSETBV, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa)
> +{
> +	struct kvmi_event_breakpoint e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.gpa = gpa;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_BREAKPOINT,
> +				&e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_HYPERCALL, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
> +		      u32 *action, bool *trap_access, u8 *ctx_data,
> +		      u32 *ctx_size)
> +{
> +	u32 max_ctx_size = *ctx_size;
> +	struct kvmi_event_page_fault e;
> +	struct kvmi_event_page_fault_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.gpa = gpa;
> +	e.gva = gva;
> +	e.mode = mode;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAGE_FAULT, &e, sizeof(e),
> +				&r, sizeof(r)))
> +		return false;
> +
> +	*action = IVCPU(vcpu)->ev_rpl.action;
> +	*trap_access = r.trap_access;
> +	*ctx_size = 0;
> +
> +	if (r.ctx_size <= max_ctx_size) {
> +		*ctx_size = min_t(u32, r.ctx_size, sizeof(r.ctx_data));
> +		if (*ctx_size)
> +			memcpy(ctx_data, r.ctx_data, *ctx_size);
> +	} else {
> +		kvm_err("%s: ctx_size (recv:%u max:%u)\n", __func__,
> +			r.ctx_size, *ctx_size);
> +		/*
> +		 * TODO: This is an ABI error.
> +		 * We should shutdown the socket?
> +		 */
> +	}
> +
> +	return true;
> +}
> +
> +u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
> +		       u32 error_code, u64 cr2)
> +{
> +	struct kvmi_event_trap e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.vector = vector;
> +	e.type = type;
> +	e.error_code = error_code;
> +	e.cr2 = cr2;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_TRAP, &e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
> +			     u64 exit_qualification, u8 descriptor, u8 write)
> +{
> +	struct kvmi_event_descriptor e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.descriptor = descriptor;
> +	e.write = write;
> +
> +	if (cpu_has_vmx()) {
> +		e.arch.vmx.instr_info = info;
> +		e.arch.vmx.exit_qualification = exit_qualification;
> +	} else {
> +		e.arch.svm.exit_info = info;
> +	}
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_DESCRIPTOR,
> +				&e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_CREATE_VCPU, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAUSE_VCPU, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22  7:34     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22  7:34 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This subsystem is split into three source files:
>   - kvmi_msg.c - ABI and socket related functions
>   - kvmi_mem.c - handle map/unmap requests from the introspector
>   - kvmi.c - all the other
> 
> The new data used by this subsystem is attached to the 'kvm' and
> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> structures).
> 
> Besides the KVMI system, this patch exports the
> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> 
> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
> Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
> Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/kvm/Makefile           |    1 +
>   arch/x86/kvm/x86.c              |    4 +-
>   include/linux/kvm_host.h        |    4 +
>   include/linux/kvmi.h            |   32 +
>   include/linux/mm.h              |    3 +
>   include/trace/events/kvmi.h     |  174 +++++
>   include/uapi/linux/kvm.h        |    8 +
>   mm/internal.h                   |    5 -
>   virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>   virt/kvm/kvmi_int.h             |  121 ++++
>   virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>   virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>   13 files changed, 3620 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/kvmi.h
>   create mode 100644 include/trace/events/kvmi.h
>   create mode 100644 virt/kvm/kvmi.c
>   create mode 100644 virt/kvm/kvmi_int.h
>   create mode 100644 virt/kvm/kvmi_mem.c
>   create mode 100644 virt/kvm/kvmi_msg.c
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2cf03ed181e6..1e9e49eaee3b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -73,6 +73,7 @@
>   #define KVM_REQ_HV_RESET		KVM_ARCH_REQ(20)
>   #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
>   #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
> +#define KVM_REQ_INTROSPECTION           KVM_ARCH_REQ(23)
>   
>   #define CR0_RESERVED_BITS                                               \
>   	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index dc4f2fdf5e57..ab6225563526 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -9,6 +9,7 @@ CFLAGS_vmx.o := -I.
>   KVM := ../../../virt/kvm
>   
>   kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> +				$(KVM)/kvmi.o $(KVM)/kvmi_msg.o $(KVM)/kvmi_mem.o \
>   				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
>   kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 74839859c0fd..cdfc7200a018 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3346,8 +3346,8 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
>   	}
>   }
>   
> -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> -					 struct kvm_xsave *guest_xsave)
> +void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> +				  struct kvm_xsave *guest_xsave)
>   {
>   	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
>   		memset(guest_xsave, 0, sizeof(struct kvm_xsave));
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 68e4d756f5c9..eae0598e18a5 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -274,6 +274,7 @@ struct kvm_vcpu {
>   	bool preempted;
>   	struct kvm_vcpu_arch arch;
>   	struct dentry *debugfs_dentry;
> +	void *kvmi;
>   };
>   
>   static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -446,6 +447,7 @@ struct kvm {
>   	struct srcu_struct srcu;
>   	struct srcu_struct irq_srcu;
>   	pid_t userspace_pid;
> +	void *kvmi;
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -779,6 +781,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>   int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>   					struct kvm_guest_debug *dbg);
>   int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
> +void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> +				  struct kvm_xsave *guest_xsave);
>   
>   int kvm_arch_init(void *opaque);
>   void kvm_arch_exit(void);
> diff --git a/include/linux/kvmi.h b/include/linux/kvmi.h
> new file mode 100644
> index 000000000000..7fac1d23f67c
> --- /dev/null
> +++ b/include/linux/kvmi.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_H__
> +#define __KVMI_H__
> +
> +#define kvmi_is_present() 1
> +
> +int kvmi_init(void);
> +void kvmi_uninit(void);
> +void kvmi_destroy_vm(struct kvm *kvm);
> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu);
> +void kvmi_vcpu_init(struct kvm_vcpu *vcpu);
> +void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu);
> +bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
> +		   unsigned long old_value, unsigned long *new_value);
> +bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr);
> +void kvmi_xsetbv_event(struct kvm_vcpu *vcpu);
> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva);
> +bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu);
> +void kvmi_hypercall_event(struct kvm_vcpu *vcpu);
> +bool kvmi_lost_exception(struct kvm_vcpu *vcpu);
> +void kvmi_trap_event(struct kvm_vcpu *vcpu);
> +bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
> +			   unsigned long exit_qualification,
> +			   unsigned char descriptor, unsigned char write);
> +void kvmi_flush_mem_access(struct kvm *kvm);
> +void kvmi_handle_request(struct kvm_vcpu *vcpu);
> +int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
> +			     gpa_t req_gpa, gpa_t map_gpa);
> +int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa);
> +
> +
> +#endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ea818ff739cd..b659c7436789 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1115,6 +1115,9 @@ void page_address_init(void);
>   #define page_address_init()  do { } while(0)
>   #endif
>   
> +/* rmap.c */
> +extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +
>   extern void *page_rmapping(struct page *page);
>   extern struct anon_vma *page_anon_vma(struct page *page);
>   extern struct address_space *page_mapping(struct page *page);
> diff --git a/include/trace/events/kvmi.h b/include/trace/events/kvmi.h
> new file mode 100644
> index 000000000000..dc36fd3b30dc
> --- /dev/null
> +++ b/include/trace/events/kvmi.h
> @@ -0,0 +1,174 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM kvmi
> +
> +#if !defined(_TRACE_KVMI_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_KVMI_H
> +
> +#include <linux/tracepoint.h>
> +
> +#ifndef __TRACE_KVMI_STRUCTURES
> +#define __TRACE_KVMI_STRUCTURES
> +
> +#undef EN
> +#define EN(x) { x, #x }
> +
> +static const struct trace_print_flags kvmi_msg_id_symbol[] = {
> +	EN(KVMI_GET_VERSION),
> +	EN(KVMI_PAUSE_VCPU),
> +	EN(KVMI_GET_GUEST_INFO),
> +	EN(KVMI_GET_REGISTERS),
> +	EN(KVMI_SET_REGISTERS),
> +	EN(KVMI_GET_PAGE_ACCESS),
> +	EN(KVMI_SET_PAGE_ACCESS),
> +	EN(KVMI_INJECT_EXCEPTION),
> +	EN(KVMI_READ_PHYSICAL),
> +	EN(KVMI_WRITE_PHYSICAL),
> +	EN(KVMI_GET_MAP_TOKEN),
> +	EN(KVMI_CONTROL_EVENTS),
> +	EN(KVMI_CONTROL_CR),
> +	EN(KVMI_CONTROL_MSR),
> +	EN(KVMI_EVENT),
> +	EN(KVMI_EVENT_REPLY),
> +	EN(KVMI_GET_CPUID),
> +	EN(KVMI_GET_XSAVE),
> +	{-1, NULL}
> +};
> +
> +static const struct trace_print_flags kvmi_event_id_symbol[] = {
> +	EN(KVMI_EVENT_CR),
> +	EN(KVMI_EVENT_MSR),
> +	EN(KVMI_EVENT_XSETBV),
> +	EN(KVMI_EVENT_BREAKPOINT),
> +	EN(KVMI_EVENT_HYPERCALL),
> +	EN(KVMI_EVENT_PAGE_FAULT),
> +	EN(KVMI_EVENT_TRAP),
> +	EN(KVMI_EVENT_DESCRIPTOR),
> +	EN(KVMI_EVENT_CREATE_VCPU),
> +	EN(KVMI_EVENT_PAUSE_VCPU),
> +	{-1, NULL}
> +};
> +
> +static const struct trace_print_flags kvmi_action_symbol[] = {
> +	{KVMI_EVENT_ACTION_CONTINUE, "continue"},
> +	{KVMI_EVENT_ACTION_RETRY, "retry"},
> +	{KVMI_EVENT_ACTION_CRASH, "crash"},
> +	{-1, NULL}
> +};
> +
> +#endif /* __TRACE_KVMI_STRUCTURES */
> +
> +TRACE_EVENT(
> +	kvmi_msg_dispatch,
> +	TP_PROTO(__u16 id, __u16 size),
> +	TP_ARGS(id, size),
> +	TP_STRUCT__entry(
> +		__field(__u16, id)
> +		__field(__u16, size)
> +	),
> +	TP_fast_assign(
> +		__entry->id = id;
> +		__entry->size = size;
> +	),
> +	TP_printk("%s size %u",
> +		  trace_print_symbols_seq(p, __entry->id, kvmi_msg_id_symbol),
> +		  __entry->size)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_send_event,
> +	TP_PROTO(__u32 id),
> +	TP_ARGS(id),
> +	TP_STRUCT__entry(
> +		__field(__u32, id)
> +	),
> +	TP_fast_assign(
> +		__entry->id = id;
> +	),
> +	TP_printk("%s",
> +		trace_print_symbols_seq(p, __entry->id, kvmi_event_id_symbol))
> +);
> +
> +#define KVMI_ACCESS_PRINTK() ({                                         \
> +	const char *saved_ptr = trace_seq_buffer_ptr(p);		\
> +	static const char * const access_str[] = {			\
> +		"---", "r--", "-w-", "rw-", "--x", "r-x", "-wx", "rwx"  \
> +	};							        \
> +	trace_seq_printf(p, "%s", access_str[__entry->access & 7]);	\
> +	saved_ptr;							\
> +})
> +
> +TRACE_EVENT(
> +	kvmi_set_mem_access,
> +	TP_PROTO(__u64 gfn, __u8 access, int err),
> +	TP_ARGS(gfn, access, err),
> +	TP_STRUCT__entry(
> +		__field(__u64, gfn)
> +		__field(__u8, access)
> +		__field(int, err)
> +	),
> +	TP_fast_assign(
> +		__entry->gfn = gfn;
> +		__entry->access = access;
> +		__entry->err = err;
> +	),
> +	TP_printk("gfn %llx %s %s %d",
> +		  __entry->gfn, KVMI_ACCESS_PRINTK(),
> +		  __entry->err ? "failed" : "succeeded", __entry->err)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_apply_mem_access,
> +	TP_PROTO(__u64 gfn, __u8 access, int err),
> +	TP_ARGS(gfn, access, err),
> +	TP_STRUCT__entry(
> +		__field(__u64, gfn)
> +		__field(__u8, access)
> +		__field(int, err)
> +	),
> +	TP_fast_assign(
> +		__entry->gfn = gfn;
> +		__entry->access = access;
> +		__entry->err = err;
> +	),
> +	TP_printk("gfn %llx %s flush %s %d",
> +		  __entry->gfn, KVMI_ACCESS_PRINTK(),
> +		  __entry->err ? "failed" : "succeeded", __entry->err)
> +);
> +
> +TRACE_EVENT(
> +	kvmi_event_page_fault,
> +	TP_PROTO(__u64 gpa, __u64 gva, __u8 access, __u64 old_rip,
> +		 __u32 action, __u64 new_rip, __u32 ctx_size),
> +	TP_ARGS(gpa, gva, access, old_rip, action, new_rip, ctx_size),
> +	TP_STRUCT__entry(
> +		__field(__u64, gpa)
> +		__field(__u64, gva)
> +		__field(__u8, access)
> +		__field(__u64, old_rip)
> +		__field(__u32, action)
> +		__field(__u64, new_rip)
> +		__field(__u32, ctx_size)
> +	),
> +	TP_fast_assign(
> +		__entry->gpa = gpa;
> +		__entry->gva = gva;
> +		__entry->access = access;
> +		__entry->old_rip = old_rip;
> +		__entry->action = action;
> +		__entry->new_rip = new_rip;
> +		__entry->ctx_size = ctx_size;
> +	),
> +	TP_printk("gpa %llx %s gva %llx rip %llx -> %s rip %llx ctx %u",
> +		  __entry->gpa,
> +		  KVMI_ACCESS_PRINTK(),
> +		  __entry->gva,
> +		  __entry->old_rip,
> +		  trace_print_symbols_seq(p, __entry->action,
> +					  kvmi_action_symbol),
> +		  __entry->new_rip, __entry->ctx_size)
> +);
> +
> +#endif /* _TRACE_KVMI_H */
> +
> +#include <trace/define_trace.h>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 496e59a2738b..6b7c4469b808 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1359,6 +1359,14 @@ struct kvm_s390_ucas_mapping {
>   #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
>   #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
>   
> +struct kvm_introspection {
> +	int fd;
> +	__u32 padding;
> +	__u32 commands;
> +	__u32 events;
> +};
> +#define KVM_INTROSPECTION      _IOW(KVMIO, 0xff, struct kvm_introspection)
> +
>   #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>   #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
>   #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
> diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..9d363c802305 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -92,11 +92,6 @@ extern unsigned long highest_memmap_pfn;
>   extern int isolate_lru_page(struct page *page);
>   extern void putback_lru_page(struct page *page);
>   
> -/*
> - * in mm/rmap.c:
> - */
> -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> -
>   /*
>    * in mm/page_alloc.c
>    */
> diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
> new file mode 100644
> index 000000000000..c4cdaeddac45
> --- /dev/null
> +++ b/virt/kvm/kvmi.c
> @@ -0,0 +1,1410 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + */
> +#include <linux/mmu_context.h>
> +#include <linux/random.h>
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/asm/kvmi.h>
> +#include "../../arch/x86/kvm/x86.h"
> +#include "../../arch/x86/kvm/mmu.h"
> +#include <asm/vmx.h>
> +#include "cpuid.h"
> +#include "kvmi_int.h"
> +#include <asm/kvm_page_track.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/kvmi.h>
> +
> +struct kvmi_mem_access {
> +	struct list_head link;
> +	gfn_t gfn;
> +	u8 access;
> +	bool active[KVM_PAGE_TRACK_MAX];
> +	struct kvm_memory_slot *slot;
> +};
> +
> +static void wakeup_events(struct kvm *kvm);
> +static bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
> +			   unsigned long gva, u8 access);
> +
> +static struct workqueue_struct *wq;
> +
> +static const u8 full_access = KVMI_PAGE_ACCESS_R |
> +			      KVMI_PAGE_ACCESS_W | KVMI_PAGE_ACCESS_X;
> +
> +static const struct {
> +	unsigned int allow_bit;
> +	enum kvm_page_track_mode track_mode;
> +} track_modes[] = {
> +	{ KVMI_PAGE_ACCESS_R, KVM_PAGE_TRACK_PREREAD },
> +	{ KVMI_PAGE_ACCESS_W, KVM_PAGE_TRACK_PREWRITE },
> +	{ KVMI_PAGE_ACCESS_X, KVM_PAGE_TRACK_PREEXEC },
> +};
> +
> +void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req)
> +{
> +	set_bit(req, &ivcpu->requests);
> +	/* Make sure the bit is set when the worker wakes up */
> +	smp_wmb();
> +	up(&ivcpu->sem_requests);
> +}
> +
> +void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req)
> +{
> +	clear_bit(req, &ivcpu->requests);
> +}
> +
> +int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	/*
> +	 * This vcpu is already stopped, executing this command
> +	 * as a result of the REQ_CMD bit being set
> +	 * (see kvmi_handle_request).
> +	 */
> +	if (ivcpu->pause)
> +		return -KVM_EBUSY;
> +
> +	ivcpu->pause = true;
> +
> +	return 0;
> +}
> +
> +static void kvmi_apply_mem_access(struct kvm *kvm,
> +				  struct kvm_memory_slot *slot,
> +				  struct kvmi_mem_access *m)
> +{
> +	int idx, k;

This should probably be i instead of k. I'm guessing you chose k to 
avoid confusion of i with idx. However, there's precedent already set 
for using i as a loop counter even in this case (e.g., look at 
kvm_scan_ioapic_routes() in arch/x86/kvm/irq_comm.c and 
init_rmode_identity_map() in arch/x86/kvm/vmx.c)

> +
> +	if (!slot) {
> +		slot = gfn_to_memslot(kvm, m->gfn);
> +		if (!slot)
> +			return;
> +	}
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	for (k = 0; k < ARRAY_SIZE(track_modes); k++) {
> +		unsigned int allow_bit = track_modes[k].allow_bit;
> +		enum kvm_page_track_mode mode = track_modes[k].track_mode;
> +
> +		if (m->access & allow_bit) {
> +			if (m->active[mode] && m->slot == slot) {
> +				kvm_slot_page_track_remove_page(kvm, slot,
> +								m->gfn, mode);
> +				m->active[mode] = false;
> +				m->slot = NULL;
> +			}
> +		} else if (!m->active[mode] || m->slot != slot) {
> +			kvm_slot_page_track_add_page(kvm, slot, m->gfn, mode);
> +			m->active[mode] = true;
> +			m->slot = slot;
> +		}
> +	}
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
> +{
> +	struct kvmi_mem_access *m;
> +	struct kvmi_mem_access *__m;
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t gfn = gpa_to_gfn(gpa);
> +
> +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
> +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);

If there's an error, should this not return or something instead of 
continuing as if nothing is wrong?

> +
> +	m = kzalloc(sizeof(struct kvmi_mem_access), GFP_KERNEL);

This should be "m = kzalloc(sizeof(*m), GFP_KERNEL);".

> +	if (!m)
> +		return -KVM_ENOMEM;
> +
> +	INIT_LIST_HEAD(&m->link);
> +	m->gfn = gfn;
> +	m->access = access;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	__m = radix_tree_lookup(&ikvm->access_tree, m->gfn);
> +	if (__m) {
> +		__m->access = m->access;
> +		if (list_empty(&__m->link))
> +			list_add_tail(&__m->link, &ikvm->access_list);
> +	} else {
> +		radix_tree_insert(&ikvm->access_tree, m->gfn, m);
> +		list_add_tail(&m->link, &ikvm->access_list);
> +		m = NULL;
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	kfree(m);
> +
> +	return 0;
> +}
> +
> +static bool kvmi_test_mem_access(struct kvm *kvm, unsigned long gpa,
> +				 u8 access)
> +{
> +	struct kvmi_mem_access *m;
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (!ikvm)
> +		return false;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	m = radix_tree_lookup(&ikvm->access_tree, gpa_to_gfn(gpa));
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	/*
> +	 * We want to be notified only for violations involving access
> +	 * bits that we've specifically cleared
> +	 */
> +	if (m && ((~m->access) & access))
> +		return true;
> +
> +	return false;
> +}
> +
> +static struct kvmi_mem_access *
> +kvmi_get_mem_access_unlocked(struct kvm *kvm, const gfn_t gfn)
> +{
> +	return radix_tree_lookup(&IKVM(kvm)->access_tree, gfn);
> +}
> +
> +static bool is_introspected(struct kvmi *ikvm)
> +{
> +	return (ikvm && ikvm->sock);
> +}
> +
> +void kvmi_flush_mem_access(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	while (!list_empty(&ikvm->access_list)) {
> +		struct kvmi_mem_access *m =
> +			list_first_entry(&ikvm->access_list,
> +					 struct kvmi_mem_access, link);
> +
> +		list_del_init(&m->link);
> +
> +		kvmi_apply_mem_access(kvm, NULL, m);
> +
> +		if (m->access == full_access) {
> +			radix_tree_delete(&ikvm->access_tree, m->gfn);
> +			kfree(m);
> +		}
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static void kvmi_free_mem_access(struct kvm *kvm)
> +{
> +	void **slot;
> +	struct radix_tree_iter iter;
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	radix_tree_for_each_slot(slot, &ikvm->access_tree, &iter, 0) {
> +		struct kvmi_mem_access *m = *slot;
> +
> +		m->access = full_access;
> +		kvmi_apply_mem_access(kvm, NULL, m);
> +
> +		radix_tree_delete(&ikvm->access_tree, m->gfn);
> +		kfree(*slot);
> +	}
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static unsigned long *msr_mask(struct kvmi *ikvm, unsigned int *msr)
> +{
> +	switch (*msr) {
> +	case 0 ... 0x1fff:
> +		return ikvm->msr_mask.low;
> +	case 0xc0000000 ... 0xc0001fff:
> +		*msr &= 0x1fff;
> +		return ikvm->msr_mask.high;
> +	}
> +	return NULL;
> +}
> +
> +static bool test_msr_mask(struct kvmi *ikvm, unsigned int msr)
> +{
> +	unsigned long *mask = msr_mask(ikvm, &msr);
> +
> +	if (!mask)
> +		return false;
> +	if (!test_bit(msr, mask))
> +		return false;
> +
> +	return true;
> +}
> +
> +static int msr_control(struct kvmi *ikvm, unsigned int msr, bool enable)
> +{
> +	unsigned long *mask = msr_mask(ikvm, &msr);
> +
> +	if (!mask)
> +		return -KVM_EINVAL;
> +	if (enable)
> +		set_bit(msr, mask);
> +	else
> +		clear_bit(msr, mask);
> +	return 0;
> +}
> +
> +unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
> +				   const struct kvm_sregs *sregs)
> +{
> +	unsigned int mode = 0;
> +
> +	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
> +		if (sregs->cs.l)
> +			mode = 8;
> +		else if (!sregs->cs.db)
> +			mode = 2;
> +		else
> +			mode = 4;
> +	} else if (sregs->cr0 & X86_CR0_PE) {
> +		if (!sregs->cs.db)
> +			mode = 2;
> +		else
> +			mode = 4;
> +	} else if (!sregs->cs.db)
> +		mode = 2;
> +	else
> +		mode = 4;

If one branch of a conditional uses braces, then all branches should 
(regardless of if they are only a single statements). The final "else 
if" and "else" blocks here should both be wrapped in braces.

> +
> +	return mode;
> +}
> +
> +static int maybe_delayed_init(void)
> +{
> +	if (wq)
> +		return 0;
> +
> +	wq = alloc_workqueue("kvmi", WQ_CPU_INTENSIVE, 0);
> +	if (!wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +int kvmi_init(void)
> +{
> +	return 0;
> +}
> +
> +static void work_cb(struct work_struct *work)
> +{
> +	struct kvmi *ikvm = container_of(work, struct kvmi, work);
> +	struct kvm   *kvm = ikvm->kvm;

None of your other initial variable assignments are aligned like this. 
Any particular reason why this one is?

> +
> +	while (kvmi_msg_process(ikvm))
> +		;

Typically if you're going to have an empty while block, you stick the 
semi-colon at the end of the while line. So this would be:
	while (kvmi_msg_process(ikvm));

> +
> +	/* We are no longer interested in any kind of events */
> +	atomic_set(&ikvm->event_mask, 0);
> +
> +	/* Clean-up for the next kvmi_hook() call */
> +	ikvm->cr_mask = 0;
> +	memset(&ikvm->msr_mask, 0, sizeof(ikvm->msr_mask));
> +
> +	wakeup_events(kvm);
> +
> +	/* Restore the spte access rights */
> +	/* Shouldn't wait for reconnection? */
> +	kvmi_free_mem_access(kvm);
> +
> +	complete_all(&ikvm->finished);
> +}
> +
> +static void __alloc_vcpu_kvmi(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = kzalloc(sizeof(struct kvmi_vcpu), GFP_KERNEL);
> +
> +	if (ivcpu) {
> +		sema_init(&ivcpu->sem_requests, 0);
> +
> +		/*
> +		 * Make sure the ivcpu is initialized
> +		 * before making it visible.
> +		 */
> +		smp_wmb();
> +
> +		vcpu->kvmi = ivcpu;
> +
> +		kvmi_make_request(ivcpu, REQ_INIT);
> +		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
> +	}
> +}
> +
> +void kvmi_vcpu_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +
> +	if (is_introspected(ikvm)) {
> +		mutex_lock(&vcpu->kvm->lock);
> +		__alloc_vcpu_kvmi(vcpu);
> +		mutex_unlock(&vcpu->kvm->lock);
> +	}
> +}
> +
> +void kvmi_vcpu_uninit(struct kvm_vcpu *vcpu)
> +{
> +	kfree(IVCPU(vcpu));
> +}
> +
> +static bool __alloc_kvmi(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);
> +
> +	if (ikvm) {
> +		INIT_LIST_HEAD(&ikvm->access_list);
> +		mutex_init(&ikvm->access_tree_lock);
> +		INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
> +		rwlock_init(&ikvm->sock_lock);
> +		init_completion(&ikvm->finished);
> +		INIT_WORK(&ikvm->work, work_cb);
> +
> +		kvm->kvmi = ikvm;
> +		ikvm->kvm = kvm; /* work_cb */
> +	}
> +
> +	return (ikvm != NULL);
> +}

Would it maybe be better to just put a check for ikvm at the top and 
return false, otherwise do all the in the if body then return true?

Like this:

static bool __alloc_kvmi(struct kvm *kvm)
{
	struct kvmi *ikvm = kzalloc(sizeof(struct kvmi), GFP_KERNEL);

	if (!ikvm)
		return false;

	INIT_LIST_HEAD(&ikvm->access_list);
	mutex_init(&ikvm->access_tree_lock);
	INIT_RADIX_TREE(&ikvm->access_tree, GFP_KERNEL);
	rwlock_init(&ikvm->sock_lock);
	init_completion(&ikvm->finished);
	INIT_WORK(&ikvm->work, work_cb);

	kvm->kvmi = ikvm;
	ikvm->kvm = kvm; /* work_cb */

	return true;
}

> +
> +static bool alloc_kvmi(struct kvm *kvm)
> +{
> +	bool done;
> +
> +	mutex_lock(&kvm->lock);
> +	done = (
> +		maybe_delayed_init() == 0    &&
> +		IKVM(kvm)            == NULL &&
> +		__alloc_kvmi(kvm)    == true
> +	);
> +	mutex_unlock(&kvm->lock);
> +
> +	return done;
> +}
> +
> +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (!IKVM(vcpu))
> +			__alloc_vcpu_kvmi(vcpu);
> +	mutex_unlock(&kvm->lock);
> +}
> +
> +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (is_introspected(ikvm)) {
> +		kvm_err("Guest already introspected\n");
> +		return false;
> +	}
> +
> +	if (!kvmi_msg_init(ikvm, qemu->fd))
> +		return false;

kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and 
then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check 
if ikvm is NULL, but if it is, it returns false, which would still end 
up here. There should be a check that ikvm is not NULL before this if 
statement.

> +
> +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
> +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
> +
> +	alloc_all_kvmi_vcpu(kvm);
> +	queue_work(wq, &ikvm->work);
> +
> +	return true;
> +}
> +
> +/*
> + * When called from outside a page fault handler, this call should
> + * return ~0ull
> + */
> +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	u64 gla;
> +	u64 gla_val;
> +	u64 v;
> +
> +	if (!vcpu->arch.gpa_available)
> +		return ~0ull;
> +
> +	gla = kvm_mmu_fault_gla(vcpu);
> +	if (gla == ~0ull)
> +		return gla;
> +	gla_val = gla;
> +
> +	/* Handle the potential overflow by returning ~0ull */
> +	if (vcpu->arch.gpa_val > gpa) {
> +		v = vcpu->arch.gpa_val - gpa;
> +		if (v > gla)
> +			gla = ~0ull;
> +		else
> +			gla -= v;
> +	} else {
> +		v = gpa - vcpu->arch.gpa_val;
> +		if (v > (U64_MAX - gla))
> +			gla = ~0ull;
> +		else
> +			gla += v;
> +	}
> +
> +	return gla;
> +}
> +
> +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			       u8 *new,
> +			       int bytes,
> +			       struct kvm_page_track_notifier_node *node,
> +			       bool *data_ready)
> +{
> +	u64 gla;
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	bool ret = true;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return ret;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);

Should you not check the value of ret here before proceeding?

> +	if (ivcpu && ivcpu->ctx_size > 0) {
> +		int s = min_t(int, bytes, ivcpu->ctx_size);
> +
> +		memcpy(new, ivcpu->ctx_data, s);
> +		ivcpu->ctx_size = 0;
> +
> +		if (*data_ready)
> +			kvm_err("Override custom data");
> +
> +		*data_ready = true;
> +	}
> +
> +	return ret;
> +}
> +
> +static bool kvmi_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				const u8 *new,
> +				int bytes,
> +				struct kvm_page_track_notifier_node *node)
> +{
> +	u64 gla;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return true;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_W);
> +}
> +
> +static bool kvmi_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				struct kvm_page_track_notifier_node *node)
> +{
> +	u64 gla;
> +
> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> +		return true;
> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> +
> +	return kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_X);
> +}
> +
> +static void kvmi_track_create_slot(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot,
> +				   unsigned long npages,
> +				   struct kvm_page_track_notifier_node *node)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t start = slot->base_gfn;
> +	const gfn_t end = start + npages;
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +
> +	while (start < end) {
> +		struct kvmi_mem_access *m;
> +
> +		m = kvmi_get_mem_access_unlocked(kvm, start);
> +		if (m)
> +			kvmi_apply_mem_access(kvm, slot, m);
> +		start++;
> +	}
> +
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static void kvmi_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				  struct kvm_page_track_notifier_node *node)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +	gfn_t start = slot->base_gfn;
> +	const gfn_t end = start + slot->npages;
> +
> +	if (!ikvm)
> +		return;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +
> +	while (start < end) {
> +		struct kvmi_mem_access *m;
> +
> +		m = kvmi_get_mem_access_unlocked(kvm, start);
> +		if (m) {
> +			u8 prev_access = m->access;
> +
> +			m->access = full_access;
> +			kvmi_apply_mem_access(kvm, slot, m);
> +			m->access = prev_access;
> +		}
> +		start++;
> +	}
> +
> +	mutex_unlock(&ikvm->access_tree_lock);
> +}
> +
> +static struct kvm_page_track_notifier_node kptn_node = {
> +	.track_preread = kvmi_track_preread,
> +	.track_prewrite = kvmi_track_prewrite,
> +	.track_preexec = kvmi_track_preexec,
> +	.track_create_slot = kvmi_track_create_slot,
> +	.track_flush_slot = kvmi_track_flush_slot
> +};
> +
> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> +{
> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> +
> +	kvm_page_track_register_notifier(kvm, &kptn_node);
> +
> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));

Is this safe? It could return false if the alloc fails (in which case 
the caller has to do nothing) or if setting up the socket fails (in 
which case the caller needs to free the allocated kvmi).

> +}
> +
> +void kvmi_destroy_vm(struct kvm *kvm)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	if (ikvm) {
> +		kvmi_msg_uninit(ikvm);
> +
> +		mutex_destroy(&ikvm->access_tree_lock);
> +		kfree(ikvm);
> +	}
> +
> +	kvmi_mem_destroy_vm(kvm);
> +}
> +
> +void kvmi_uninit(void)
> +{
> +	if (wq) {
> +		destroy_workqueue(wq);
> +		wq = NULL;
> +	}
> +}
> +
> +void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
> +{
> +	struct msr_data msr;
> +
> +	msr.host_initiated = true;
> +
> +	msr.index = MSR_IA32_SYSENTER_CS;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_cs = msr.data;
> +
> +	msr.index = MSR_IA32_SYSENTER_ESP;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_esp = msr.data;
> +
> +	msr.index = MSR_IA32_SYSENTER_EIP;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.sysenter_eip = msr.data;
> +
> +	msr.index = MSR_EFER;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.efer = msr.data;
> +
> +	msr.index = MSR_STAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.star = msr.data;
> +
> +	msr.index = MSR_LSTAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.lstar = msr.data;
> +
> +	msr.index = MSR_CSTAR;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.cstar = msr.data;
> +
> +	msr.index = MSR_IA32_CR_PAT;
> +	kvm_get_msr(vcpu, &msr);
> +	event->msrs.pat = msr.data;
> +}
> +
> +static bool is_event_enabled(struct kvm *kvm, int event_bit)
> +{
> +	struct kvmi *ikvm = IKVM(kvm);
> +
> +	return (ikvm && (atomic_read(&ikvm->event_mask) & event_bit));
> +}
> +
> +static int kvmi_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
> +{
> +	int err = -ESRCH;
> +	struct pid *pid;
> +	struct siginfo siginfo[1] = { };
> +
> +	rcu_read_lock();
> +	pid = rcu_dereference(vcpu->pid);
> +	if (pid)
> +		err = kill_pid_info(sig, siginfo, pid);
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +
> +static void kvmi_vm_shutdown(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		kvmi_vcpu_kill(SIGTERM, vcpu);
> +	}
> +	mutex_unlock(&kvm->lock);
> +}
> +
> +/* TODO: Do we need a return code ? */
> +static void handle_common_event_actions(struct kvm_vcpu *vcpu, u32 action)
> +{
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CRASH:
> +		kvmi_vm_shutdown(vcpu->kvm);
> +		break;
> +
> +	default:
> +		kvm_err("Unsupported event action: %d\n", action);
> +	}
> +}
> +
> +bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
> +		   unsigned long old_value, unsigned long *new_value)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 ret_value;
> +	u32 action;
> +
> +	if (!is_event_enabled(kvm, KVMI_EVENT_CR))
> +		return true;
> +	if (!test_bit(cr, &IKVM(kvm)->cr_mask))
> +		return true;
> +	if (old_value == *new_value)
> +		return true;
> +
> +	action = kvmi_msg_send_cr(vcpu, cr, old_value, *new_value, &ret_value);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		*new_value = ret_value;
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false;
> +}
> +
> +bool kvmi_msr_event(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 ret_value;
> +	u32 action;
> +	struct msr_data old_msr = { .host_initiated = true,
> +				    .index = msr->index };
> +
> +	if (msr->host_initiated)
> +		return true;
> +	if (!is_event_enabled(kvm, KVMI_EVENT_MSR))
> +		return true;
> +	if (!test_msr_mask(IKVM(kvm), msr->index))
> +		return true;
> +	if (kvm_get_msr(vcpu, &old_msr))
> +		return true;
> +	if (old_msr.data == msr->data)
> +		return true;
> +
> +	action = kvmi_msg_send_msr(vcpu, msr->index, old_msr.data, msr->data,
> +				   &ret_value);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		msr->data = ret_value;
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false;
> +}
> +
> +void kvmi_xsetbv_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_XSETBV))
> +		return;
> +
> +	action = kvmi_msg_send_xsetbv(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
> +{
> +	u32 action;
> +	u64 gpa;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
> +		/* qemu will automatically reinject the breakpoint */
> +		return false;
> +
> +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
> +
> +	if (gpa == UNMAPPED_GVA)
> +		kvm_err("%s: invalid gva: %llx", __func__, gva);

If the gpa is unmapped, shouldn't it return false rather than proceeding?

> +
> +	action = kvmi_msg_send_bp(vcpu, gpa);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	case KVMI_EVENT_ACTION_RETRY:
> +		/* rip was most likely adjusted past the INT 3 instruction */
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	/* qemu will automatically reinject the breakpoint */
> +	return false;
> +}
> +EXPORT_SYMBOL(kvmi_breakpoint_event);
> +
> +#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
> +bool kvmi_is_agent_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long subfunc1, subfunc2;
> +	bool longmode = is_64_bit_mode(vcpu);
> +	unsigned long nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
> +
> +	if (longmode) {
> +		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RDI);
> +		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RSI);
> +	} else {
> +		nr &= 0xFFFFFFFF;
> +		subfunc1 = kvm_register_read(vcpu, VCPU_REGS_RBX);
> +		subfunc1 &= 0xFFFFFFFF;
> +		subfunc2 = kvm_register_read(vcpu, VCPU_REGS_RCX);
> +		subfunc2 &= 0xFFFFFFFF;
> +	}
> +
> +	return (nr == KVM_HC_XEN_HVM_OP
> +		&& subfunc1 == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT
> +		&& subfunc2 == 0);
> +}
> +
> +void kvmi_hypercall_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_HYPERCALL)
> +			|| !kvmi_is_agent_hypercall(vcpu))
> +		return;
> +
> +	action = kvmi_msg_send_hypercall(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_page_fault_event(struct kvm_vcpu *vcpu, unsigned long gpa,
> +			   unsigned long gva, u8 access)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvmi_vcpu *ivcpu;
> +	bool trap_access, ret = true;
> +	u32 ctx_size;
> +	u64 old_rip;
> +	u32 action;
> +
> +	if (!is_event_enabled(kvm, KVMI_EVENT_PAGE_FAULT))
> +		return true;
> +
> +	/* Have we shown interest in this page? */
> +	if (!kvmi_test_mem_access(kvm, gpa, access))
> +		return true;
> +
> +	ivcpu    = IVCPU(vcpu);
> +	ctx_size = sizeof(ivcpu->ctx_data);
> +	old_rip  = kvm_rip_read(vcpu);

Why are these assignments aligned liket this?

> +
> +	if (!kvmi_msg_send_pf(vcpu, gpa, gva, access, &action,
> +			      &trap_access,
> +			      ivcpu->ctx_data, &ctx_size))
> +		goto out;
> +
> +	ivcpu->ctx_size = 0;
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		ivcpu->ctx_size = ctx_size;
> +		break;
> +	case KVMI_EVENT_ACTION_RETRY:
> +		ret = false;
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	/* TODO: trap_access -> don't REPeat the instruction */
> +out:
> +	trace_kvmi_event_page_fault(gpa, gva, access, old_rip, action,
> +				    kvm_rip_read(vcpu), ctx_size);
> +	return ret;
> +}
> +
> +bool kvmi_lost_exception(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	if (!ivcpu || !ivcpu->exception.injected)
> +		return false;
> +
> +	ivcpu->exception.injected = 0;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_TRAP))
> +		return false;
> +
> +	if ((vcpu->arch.exception.injected || vcpu->arch.exception.pending)
> +		&& vcpu->arch.exception.nr == ivcpu->exception.nr
> +		&& vcpu->arch.exception.error_code
> +			== ivcpu->exception.error_code)
> +		return false;
> +
> +	return true;
> +}
> +
> +void kvmi_trap_event(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	u32 vector, type, err;
> +	u32 action;
> +
> +	if (vcpu->arch.exception.pending) {
> +		vector = vcpu->arch.exception.nr;
> +		err = vcpu->arch.exception.error_code;
> +
> +		if (kvm_exception_is_soft(vector))
> +			type = INTR_TYPE_SOFT_EXCEPTION;
> +		else
> +			type = INTR_TYPE_HARD_EXCEPTION;
> +	} else if (vcpu->arch.interrupt.pending) {
> +		vector = vcpu->arch.interrupt.nr;
> +		err = 0;
> +
> +		if (vcpu->arch.interrupt.soft)
> +			type = INTR_TYPE_SOFT_INTR;
> +		else
> +			type = INTR_TYPE_EXT_INTR;
> +	} else {
> +		vector = 0;
> +		type = 0;
> +		err = 0;
> +	}
> +
> +	kvm_err("New exception nr %d/%d err %x/%x addr %lx",
> +		vector, ivcpu->exception.nr,
> +		err, ivcpu->exception.error_code,
> +		vcpu->arch.cr2);
> +
> +	action = kvmi_msg_send_trap(vcpu, vector, type, err, vcpu->arch.cr2);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +}
> +
> +bool kvmi_descriptor_event(struct kvm_vcpu *vcpu, u32 info,
> +			   unsigned long exit_qualification,
> +			   unsigned char descriptor, unsigned char write)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_DESCRIPTOR))
> +		return true;

How come it returns true here? The events below all return false from a 
similar condition check.

> +
> +	action = kvmi_msg_send_descriptor(vcpu, info, exit_qualification,
> +					  descriptor, write);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		return true;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return false; /* TODO: double check this */
> +}
> +EXPORT_SYMBOL(kvmi_descriptor_event);
> +
> +static bool kvmi_create_vcpu_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_CREATE_VCPU))
> +		return false;
> +
> +	action = kvmi_msg_send_create_vcpu(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return true;
> +}
> +
> +static bool kvmi_pause_vcpu_event(struct kvm_vcpu *vcpu)
> +{
> +	u32 action;
> +
> +	IVCPU(vcpu)->pause = false;
> +
> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_PAUSE_VCPU))
> +		return false;
> +
> +	action = kvmi_msg_send_pause_vcpu(vcpu);
> +
> +	switch (action) {
> +	case KVMI_EVENT_ACTION_CONTINUE:
> +		break;
> +	default:
> +		handle_common_event_actions(vcpu, action);
> +	}
> +
> +	return true;
> +}
> +
> +/* TODO: refactor this function uto avoid recursive calls and the semaphore. */
> +void kvmi_handle_request(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	while (ivcpu->ev_rpl_waiting
> +		|| READ_ONCE(ivcpu->requests)) {
> +
> +		down(&ivcpu->sem_requests);
> +
> +		if (test_bit(REQ_INIT, &ivcpu->requests)) {
> +			/*
> +			 * kvmi_create_vcpu_event() may call this function
> +			 * again and won't return unless there is no more work
> +			 * to be done. The while condition will be evaluated
> +			 * to false, but we explicitly exit the loop to avoid
> +			 * surprizing the reader more than we already did.
> +			 */
> +			kvmi_clear_request(ivcpu, REQ_INIT);
> +			if (kvmi_create_vcpu_event(vcpu))
> +				break;
> +		} else if (test_bit(REQ_CMD, &ivcpu->requests)) {
> +			kvmi_msg_handle_vcpu_cmd(vcpu);
> +			/* it will clear the REQ_CMD bit */
> +			if (ivcpu->pause && !ivcpu->ev_rpl_waiting) {
> +				/* Same warnings as with REQ_INIT. */
> +				if (kvmi_pause_vcpu_event(vcpu))
> +					break;
> +			}
> +		} else if (test_bit(REQ_REPLY, &ivcpu->requests)) {
> +			kvmi_clear_request(ivcpu, REQ_REPLY);
> +			ivcpu->ev_rpl_waiting = false;
> +			if (ivcpu->have_delayed_regs) {
> +				kvm_arch_vcpu_set_regs(vcpu,
> +							&ivcpu->delayed_regs);
> +				ivcpu->have_delayed_regs = false;
> +			}
> +			if (ivcpu->pause) {
> +				/* Same warnings as with REQ_INIT. */
> +				if (kvmi_pause_vcpu_event(vcpu))
> +					break;
> +			}
> +		} else if (test_bit(REQ_CLOSE, &ivcpu->requests)) {
> +			kvmi_clear_request(ivcpu, REQ_CLOSE);
> +			break;
> +		} else {
> +			kvm_err("Unexpected request");
> +		}
> +	}
> +
> +	kvmi_flush_mem_access(vcpu->kvm);
> +	/* TODO: merge with kvmi_set_mem_access() */
> +}
> +
> +int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
> +		       u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
> +{
> +	struct kvm_cpuid_entry2 *e;
> +
> +	e = kvm_find_cpuid_entry(vcpu, function, index);
> +	if (!e)
> +		return -KVM_ENOENT;
> +
> +	*eax = e->eax;
> +	*ebx = e->ebx;
> +	*ecx = e->ecx;
> +	*edx = e->edx;
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc)
> +{
> +	/*
> +	 * Should we switch vcpu_cnt to unsigned int?
> +	 * If not, we should limit this to max u16 - 1
> +	 */
> +	*vcpu_cnt = atomic_read(&vcpu->kvm->online_vcpus);
> +	if (kvm_has_tsc_control)
> +		*tsc = 1000ul * vcpu->arch.virtual_tsc_khz;
> +	else
> +		*tsc = 0;
> +
> +	return 0;
> +}
> +
> +static int get_first_vcpu(struct kvm *kvm, struct kvm_vcpu **vcpu)
> +{
> +	struct kvm_vcpu *v;
> +
> +	if (!atomic_read(&kvm->online_vcpus))
> +		return -KVM_EINVAL;
> +
> +	v = kvm_get_vcpu(kvm, 0);
> +
> +	if (!v)
> +		return -KVM_EINVAL;
> +
> +	*vcpu = v;
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
> +			   struct kvm_regs *regs,
> +			   struct kvm_sregs *sregs, struct kvm_msrs *msrs)
> +{
> +	struct kvm_msr_entry  *msr = msrs->entries;
> +	unsigned int	       n   = msrs->nmsrs;

Again with randomly aligning variables...

> +
> +	kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> +	kvm_arch_vcpu_ioctl_get_sregs(vcpu, sregs);
> +	*mode = kvmi_vcpu_mode(vcpu, sregs);
> +
> +	for (; n--; msr++) {

The conditional portion of this for loop appears to not be a 
conditional? Either way, this is a pretty ugly way to write this.

> +		struct msr_data m   = { .index = msr->index };
> +		int		err = kvm_get_msr(vcpu, &m);

And again with the alignment...

> +
> +		if (err)
> +			return -KVM_EINVAL;
> +
> +		msr->data = m.data;
> +	}
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> +	if (ivcpu->ev_rpl_waiting) {
> +		memcpy(&ivcpu->delayed_regs, regs, sizeof(ivcpu->delayed_regs));
> +		ivcpu->have_delayed_regs = true;
> +	} else
> +		kvm_err("Drop KVMI_SET_REGISTERS");

Since the if has braces, the else should too.

> +	return 0;
> +}
> +
> +int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access)
> +{
> +	struct kvmi *ikvm = IKVM(vcpu->kvm);
> +	struct kvmi_mem_access *m;
> +
> +	mutex_lock(&ikvm->access_tree_lock);
> +	m = kvmi_get_mem_access_unlocked(vcpu->kvm, gpa_to_gfn(gpa));
> +	*access = m ? m->access : full_access;
> +	mutex_unlock(&ikvm->access_tree_lock);
> +
> +	return 0;
> +}
> +
> +static bool is_vector_valid(u8 vector)
> +{
> +	return true;
> +}
> +
> +static bool is_gva_valid(struct kvm_vcpu *vcpu, u64 gva)
> +{
> +	return true;
> +}
> +
> +int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
> +			      bool error_code_valid, u16 error_code,
> +			      u64 address)
> +{
> +	struct x86_exception e = {
> +		.vector = vector,
> +		.error_code_valid = error_code_valid,
> +		.error_code = error_code,
> +		.address = address,
> +	};
> +
> +	if (!(is_vector_valid(vector) && is_gva_valid(vcpu, address)))
> +		return -KVM_EINVAL;
> +
> +	if (e.vector == PF_VECTOR)
> +		kvm_inject_page_fault(vcpu, &e);
> +	else if (e.error_code_valid)
> +		kvm_queue_exception_e(vcpu, e.vector, e.error_code);
> +	else
> +		kvm_queue_exception(vcpu, e.vector);
> +
> +	if (IVCPU(vcpu)->exception.injected)
> +		kvm_err("Override exception");
> +
> +	IVCPU(vcpu)->exception.injected = 1;
> +	IVCPU(vcpu)->exception.nr = e.vector;
> +	IVCPU(vcpu)->exception.error_code = error_code_valid ? error_code : 0;
> +
> +	return 0;
> +}
> +
> +unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
> +{
> +	unsigned long hva;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	hva = gfn_to_hva(kvm, gfn);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return hva;
> +}
> +
> +static long get_user_pages_remote_unlocked(struct mm_struct *mm,
> +					   unsigned long start,
> +					   unsigned long nr_pages,
> +					   unsigned int gup_flags,
> +					   struct page **pages)
> +{
> +	long ret;
> +	struct task_struct *tsk = NULL;
> +	struct vm_area_struct **vmas = NULL;
> +	int locked = 1;
> +
> +	down_read(&mm->mmap_sem);
> +	ret =
> +	    get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
> +				  vmas, &locked);

Couldn't this line be "ret = get_user_pages_remote(..." and just break 
it on a different variable?

> +	if (locked)
> +		up_read(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size, int (*send)(
> +				   struct kvmi *, const struct kvmi_msg_hdr *,
> +				   int err, const void *buf, size_t),
> +				   const struct kvmi_msg_hdr *ctx)
> +{
> +	int err, ec;
> +	unsigned long hva;
> +	struct page *page = NULL;
> +	void *ptr_page = NULL, *ptr = NULL;
> +	size_t ptr_size = 0;
> +	struct kvm_vcpu *vcpu;
> +
> +	ec = get_first_vcpu(kvm, &vcpu);
> +
> +	if (ec)
> +		goto out;
> +
> +	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
> +
> +	if (kvm_is_error_hva(hva)) {
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, 0, &page) != 1) {
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	ptr_page = kmap_atomic(page);
> +
> +	ptr = ptr_page + (gpa & ~PAGE_MASK);
> +	ptr_size = size;
> +
> +out:
> +	err = send(IKVM(kvm), ctx, ec, ptr, ptr_size);
> +
> +	if (ptr_page)
> +		kunmap_atomic(ptr_page);
> +	if (page)
> +		put_page(page);
> +	return err;
> +}
> +
> +int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size, const void *buf)
> +{
> +	int err;
> +	unsigned long hva;
> +	struct page *page;
> +	void *ptr;
> +	struct kvm_vcpu *vcpu;
> +
> +	err = get_first_vcpu(kvm, &vcpu);
> +
> +	if (err)
> +		return err;
> +
> +	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(gpa));
> +
> +	if (kvm_is_error_hva(hva))
> +		return -KVM_EINVAL;
> +
> +	if (get_user_pages_remote_unlocked(kvm->mm, hva, 1, FOLL_WRITE,
> +			&page) != 1)
> +		return -KVM_EINVAL;
> +
> +	ptr = kmap_atomic(page);
> +
> +	memcpy(ptr + (gpa & ~PAGE_MASK), buf, size);
> +
> +	kunmap_atomic(ptr);
> +	put_page(page);
> +
> +	return 0;
> +}
> +
> +int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
> +{
> +	int err = 0;
> +
> +	/* create random token */
> +	get_random_bytes(token, sizeof(struct kvmi_map_mem_token));
> +
> +	/* store token in HOST database */
> +	if (kvmi_store_token(kvm, token))
> +		err = -KVM_ENOMEM;
> +
> +	return err;
> +}

It seems like you could get rid of err altogether and just return 
-KVM_ENOMEM directly from the if body and 0 at the end.

> +
> +int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events)
> +{
> +	int err = 0;
> +
> +	if (events & ~KVMI_KNOWN_EVENTS)
> +		return -KVM_EINVAL;
> +
> +	if (events & KVMI_EVENT_BREAKPOINT) {
> +		if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT)) {
> +			struct kvm_guest_debug dbg = { };
> +
> +			dbg.control =
> +			    KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
> +
> +			err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
> +		}
> +	}
> +
> +	if (!err)
> +		atomic_set(&IKVM(vcpu->kvm)->event_mask, events);
> +
> +	return err;
> +}
> +
> +int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr)
> +{
> +	switch (cr) {
> +	case 0:
> +	case 3:
> +	case 4:
> +		if (enable)
> +			set_bit(cr, &ikvm->cr_mask);
> +		else
> +			clear_bit(cr, &ikvm->cr_mask);
> +		return 0;
> +
> +	default:
> +		return -KVM_EINVAL;
> +	}
> +}
> +
> +int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int err;
> +
> +	err = get_first_vcpu(kvm, &vcpu);
> +	if (err)
> +		return err;
> +
> +	err = msr_control(IKVM(kvm), msr, enable);
> +
> +	if (!err)
> +		kvm_arch_msr_intercept(vcpu, msr, enable);
> +
> +	return err;
> +}
> +
> +void wakeup_events(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvmi_make_request(IVCPU(vcpu), REQ_CLOSE);
> +	mutex_unlock(&kvm->lock);
> +}
> diff --git a/virt/kvm/kvmi_int.h b/virt/kvm/kvmi_int.h
> new file mode 100644
> index 000000000000..5976b98f11cb
> --- /dev/null
> +++ b/virt/kvm/kvmi_int.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVMI_INT_H__
> +#define __KVMI_INT_H__
> +
> +#include <linux/types.h>
> +#include <linux/kvm_host.h>
> +
> +#include <uapi/linux/kvmi.h>
> +
> +#define IVCPU(vcpu) ((struct kvmi_vcpu *)((vcpu)->kvmi))
> +
> +struct kvmi_vcpu {
> +	u8 ctx_data[256];
> +	u32 ctx_size;
> +	struct semaphore sem_requests;
> +	unsigned long requests;
> +	/* TODO: get this ~64KB buffer from a cache */
> +	u8 msg_buf[KVMI_MAX_MSG_SIZE];
> +	struct kvmi_event_reply ev_rpl;
> +	void *ev_rpl_ptr;
> +	size_t ev_rpl_size;
> +	size_t ev_rpl_received;
> +	u32 ev_seq;
> +	bool ev_rpl_waiting;
> +	struct {
> +		u16 error_code;
> +		u8 nr;
> +		bool injected;
> +	} exception;
> +	struct kvm_regs delayed_regs;
> +	bool have_delayed_regs;
> +	bool pause;
> +};
> +
> +#define IKVM(kvm) ((struct kvmi *)((kvm)->kvmi))
> +
> +struct kvmi {
> +	atomic_t event_mask;
> +	unsigned long cr_mask;
> +	struct {
> +		unsigned long low[BITS_TO_LONGS(8192)];
> +		unsigned long high[BITS_TO_LONGS(8192)];
> +	} msr_mask;
> +	struct radix_tree_root access_tree;
> +	struct mutex access_tree_lock;
> +	struct list_head access_list;
> +	struct work_struct work;
> +	struct socket *sock;
> +	rwlock_t sock_lock;
> +	struct completion finished;
> +	struct kvm *kvm;
> +	/* TODO: get this ~64KB buffer from a cache */
> +	u8 msg_buf[KVMI_MAX_MSG_SIZE];
> +	u32 cmd_allow_mask;
> +	u32 event_allow_mask;
> +};
> +
> +#define REQ_INIT   0
> +#define REQ_CMD    1
> +#define REQ_REPLY  2
> +#define REQ_CLOSE  3

Would these be better off being an enum?

> +
> +/* kvmi_msg.c */
> +bool kvmi_msg_init(struct kvmi *ikvm, int fd);
> +bool kvmi_msg_process(struct kvmi *ikvm);
> +void kvmi_msg_uninit(struct kvmi *ikvm);
> +void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
> +		     u64 new_value, u64 *ret_value);
> +u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
> +		      u64 new_value, u64 *ret_value);
> +u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa);
> +u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu);
> +bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
> +		      u32 *action, bool *trap_access, u8 *ctx,
> +		      u32 *ctx_size);
> +u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
> +		       u32 error_code, u64 cr2);
> +u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
> +			     u64 exit_qualification, u8 descriptor, u8 write);
> +u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu);
> +u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu);
> +
> +/* kvmi.c */
> +int kvmi_cmd_get_guest_info(struct kvm_vcpu *vcpu, u16 *vcpu_cnt, u64 *tsc);
> +int kvmi_cmd_pause_vcpu(struct kvm_vcpu *vcpu);
> +int kvmi_cmd_get_registers(struct kvm_vcpu *vcpu, u32 *mode,
> +			   struct kvm_regs *regs, struct kvm_sregs *sregs,
> +			   struct kvm_msrs *msrs);
> +int kvmi_cmd_set_registers(struct kvm_vcpu *vcpu, const struct kvm_regs *regs);
> +int kvmi_cmd_get_page_access(struct kvm_vcpu *vcpu, u64 gpa, u8 *access);
> +int kvmi_cmd_inject_exception(struct kvm_vcpu *vcpu, u8 vector,
> +			      bool error_code_valid, u16 error_code,
> +			      u64 address);
> +int kvmi_cmd_control_events(struct kvm_vcpu *vcpu, u32 events);
> +int kvmi_cmd_get_cpuid(struct kvm_vcpu *vcpu, u32 function, u32 index,
> +		       u32 *eax, u32 *ebx, u32 *rcx, u32 *edx);
> +int kvmi_cmd_read_physical(struct kvm *kvm, u64 gpa, u64 size,
> +			   int (*send)(struct kvmi *,
> +					const struct kvmi_msg_hdr*,
> +					int err, const void *buf, size_t),
> +			   const struct kvmi_msg_hdr *ctx);
> +int kvmi_cmd_write_physical(struct kvm *kvm, u64 gpa, u64 size,
> +			    const void *buf);
> +int kvmi_cmd_alloc_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
> +int kvmi_cmd_control_cr(struct kvmi *ikvm, bool enable, u32 cr);
> +int kvmi_cmd_control_msr(struct kvm *kvm, bool enable, u32 msr);
> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access);
> +void kvmi_make_request(struct kvmi_vcpu *ivcpu, int req);
> +void kvmi_clear_request(struct kvmi_vcpu *ivcpu, int req);
> +unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
> +			    const struct kvm_sregs *sregs);
> +void kvmi_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event);
> +unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn);
> +void kvmi_mem_destroy_vm(struct kvm *kvm);
> +
> +/* kvmi_mem.c */
> +int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token);
> +
> +#endif
> diff --git a/virt/kvm/kvmi_mem.c b/virt/kvm/kvmi_mem.c
> new file mode 100644
> index 000000000000..c766357678e6
> --- /dev/null
> +++ b/virt/kvm/kvmi_mem.c
> @@ -0,0 +1,730 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection memory mapping implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +#include <linux/rmap.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/pagemap.h>
> +#include <linux/swap.h>
> +#include <linux/spinlock.h>
> +#include <linux/printk.h>
> +#include <linux/kvmi.h>
> +#include <linux/huge_mm.h>
> +
> +#include <uapi/linux/kvmi.h>
> +
> +#include "kvmi_int.h"
> +
> +
> +static struct list_head mapping_list;
> +static spinlock_t mapping_lock;
> +
> +struct host_map {
> +	struct list_head mapping_list;
> +	gpa_t map_gpa;
> +	struct kvm *machine;
> +	gpa_t req_gpa;
> +};
> +
> +
> +static struct list_head token_list;
> +static spinlock_t token_lock;
> +
> +struct token_entry {
> +	struct list_head token_list;
> +	struct kvmi_map_mem_token token;
> +	struct kvm *kvm;
> +};
> +
> +
> +int kvmi_store_token(struct kvm *kvm, struct kvmi_map_mem_token *token)
> +{
> +	struct token_entry *tep;
> +
> +	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
> +			     32, 1, token, sizeof(struct kvmi_map_mem_token),
> +			     false);
> +
> +	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);

	tep = kmalloc(sizeof(*tep), GFP_KERNEL)

> +	if (tep == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tep->token_list);
> +	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));

Here too it might be better to do "sizeof(*token)"

> +	tep->kvm = kvm;
> +
> +	spin_lock(&token_lock);
> +	list_add_tail(&tep->token_list, &token_list);
> +	spin_unlock(&token_lock);
> +
> +	return 0;
> +}
> +
> +static struct kvm *find_machine_at(struct kvm_vcpu *vcpu, gva_t tkn_gva)
> +{
> +	long result;
> +	gpa_t tkn_gpa;
> +	struct kvmi_map_mem_token token;
> +	struct list_head *cur;
> +	struct token_entry *tep, *found = NULL;
> +	struct kvm *target_kvm = NULL;
> +
> +	/* machine token is passed as pointer */
> +	tkn_gpa = kvm_mmu_gva_to_gpa_system(vcpu, tkn_gva, NULL);
> +	if (tkn_gpa == UNMAPPED_GVA)
> +		return NULL;
> +
> +	/* copy token to local address space */
> +	result = kvm_read_guest(vcpu->kvm, tkn_gpa, &token, sizeof(token));
> +	if (IS_ERR_VALUE(result)) {
> +		kvm_err("kvmi: failed copying token from user\n");
> +		return ERR_PTR(result);
> +	}
> +
> +	/* consume token & find the VM */
> +	spin_lock(&token_lock);
> +	list_for_each(cur, &token_list) {
> +		tep = list_entry(cur, struct token_entry, token_list);
> +
> +		if (!memcmp(&token, &tep->token, sizeof(token))) {
> +			list_del(&tep->token_list);
> +			found = tep;
> +			break;
> +		}
> +	}
> +	spin_unlock(&token_lock);
> +
> +	if (found != NULL) {
> +		target_kvm = found->kvm;
> +		kfree(found);
> +	}
> +
> +	return target_kvm;
> +}
> +
> +static void remove_vm_token(struct kvm *kvm)
> +{
> +	struct list_head *cur, *next;
> +	struct token_entry *tep;
> +
> +	spin_lock(&token_lock);
> +	list_for_each_safe(cur, next, &token_list) {
> +		tep = list_entry(cur, struct token_entry, token_list);
> +
> +		if (tep->kvm == kvm) {
> +			list_del(&tep->token_list);
> +			kfree(tep);
> +		}
> +	}
> +	spin_unlock(&token_lock);
> +
> +}

There's an extra blank line at the end of this function (before the brace).

> +
> +
> +static int add_to_list(gpa_t map_gpa, struct kvm *machine, gpa_t req_gpa)
> +{
> +	struct host_map *map;
> +
> +	map = kmalloc(sizeof(struct host_map), GFP_KERNEL);

	map = kmalloc(sizeof(*map), GFP_KERNEL);

> +	if (map == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&map->mapping_list);
> +	map->map_gpa = map_gpa;
> +	map->machine = machine;
> +	map->req_gpa = req_gpa;
> +
> +	spin_lock(&mapping_lock);
> +	list_add_tail(&map->mapping_list, &mapping_list);
> +	spin_unlock(&mapping_lock);
> +
> +	return 0;
> +}
> +
> +static struct host_map *extract_from_list(gpa_t map_gpa)
> +{
> +	struct list_head *cur;
> +	struct host_map *map;
> +
> +	spin_lock(&mapping_lock);
> +	list_for_each(cur, &mapping_list) {
> +		map = list_entry(cur, struct host_map, mapping_list);
> +
> +		/* found - extract and return */
> +		if (map->map_gpa == map_gpa) {
> +			list_del(&map->mapping_list);
> +			spin_unlock(&mapping_lock);
> +
> +			return map;
> +		}
> +	}
> +	spin_unlock(&mapping_lock);
> +
> +	return NULL;
> +}
> +
> +static void remove_vm_from_list(struct kvm *kvm)
> +{
> +	struct list_head *cur, *next;
> +	struct host_map *map;
> +
> +	spin_lock(&mapping_lock);
> +
> +	list_for_each_safe(cur, next, &mapping_list) {
> +		map = list_entry(cur, struct host_map, mapping_list);
> +
> +		if (map->machine == kvm) {
> +			list_del(&map->mapping_list);
> +			kfree(map);
> +		}
> +	}
> +
> +	spin_unlock(&mapping_lock);
> +}
> +
> +static void remove_entry(struct host_map *map)
> +{
> +	kfree(map);
> +}
> +
> +
> +static struct vm_area_struct *isolate_page_vma(struct vm_area_struct *vma,
> +					       unsigned long addr)
> +{
> +	int result;
> +
> +	/* corner case */
> +	if (vma_pages(vma) == 1)
> +		return vma;
> +
> +	if (addr != vma->vm_start) {
> +		/* first split only if address in the middle */
> +		result = split_vma(vma->vm_mm, vma, addr, false);
> +		if (IS_ERR_VALUE((long)result))
> +			return ERR_PTR((long)result);
> +
> +		vma = find_vma(vma->vm_mm, addr);
> +		if (vma == NULL)
> +			return ERR_PTR(-ENOENT);
> +
> +		/* corner case (again) */
> +		if (vma_pages(vma) == 1)
> +			return vma;
> +	}
> +
> +	result = split_vma(vma->vm_mm, vma, addr + PAGE_SIZE, true);
> +	if (IS_ERR_VALUE((long)result))
> +		return ERR_PTR((long)result);
> +
> +	vma = find_vma(vma->vm_mm, addr);
> +	if (vma == NULL)
> +		return ERR_PTR(-ENOENT);
> +
> +	BUG_ON(vma_pages(vma) != 1);
> +
> +	return vma;
> +}
> +
> +static int redirect_rmap(struct vm_area_struct *req_vma, struct page *req_page,
> +			 struct vm_area_struct *map_vma)
> +{
> +	int result;
> +
> +	unlink_anon_vmas(map_vma);
> +
> +	result = anon_vma_fork(map_vma, req_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;

Why not just return result here?

> +
> +	page_dup_rmap(req_page, false);
> +
> +out:
> +	return result;
> +}
> +
> +static int host_map_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
> +			     struct page *req_page, struct page *map_page)
> +{
> +	struct mm_struct *map_mm = map_vma->vm_mm;
> +
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +	pte_t newpte;
> +
> +	unsigned long mmun_start;
> +	unsigned long mmun_end;
> +
> +	/* classic replace_page() code */
> +	pmd = mm_find_pmd(map_mm, map_hva);
> +	if (!pmd)
> +		return -EFAULT;
> +
> +	mmun_start = map_hva;
> +	mmun_end = map_hva + PAGE_SIZE;
> +	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
> +
> +	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
> +
> +	/* create new PTE based on requested page */
> +	newpte = mk_pte(req_page, map_vma->vm_page_prot);
> +	newpte = pte_set_flags(newpte, pte_flags(*ptep));
> +
> +	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
> +	ptep_clear_flush_notify(map_vma, map_hva, ptep);
> +	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
> +
> +	return 0;
> +}
> +
> +static void discard_page(struct page *map_page)
> +{
> +	lock_page(map_page);
> +	// TODO: put_anon_vma() ???? - should be here
> +	page_remove_rmap(map_page, false);
> +	if (!page_mapped(map_page))
> +		try_to_free_swap(map_page);
> +	unlock_page(map_page);
> +	put_page(map_page);
> +}
> +
> +static void kvmi_split_huge_pmd(struct vm_area_struct *req_vma,
> +				hva_t req_hva, struct page *req_page)
> +{
> +	bool tail = false;
> +
> +	/* move reference count from compound head... */
> +	if (PageTail(req_page)) {
> +		tail = true;
> +		put_page(req_page);
> +	}
> +
> +	if (PageCompound(req_page))
> +		split_huge_pmd_address(req_vma, req_hva, false, NULL);
> +
> +	/* ... to the actual page, after splitting */
> +	if (tail)
> +		get_page(req_page);
> +}
> +
> +static int kvmi_map_action(struct mm_struct *req_mm, hva_t req_hva,
> +			   struct mm_struct *map_mm, hva_t map_hva)
> +{
> +	struct vm_area_struct *req_vma;
> +	struct page *req_page = NULL;
> +
> +	struct vm_area_struct *map_vma;
> +	struct page *map_page;
> +
> +	long nrpages;
> +	int result = 0;
> +
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +
> +	/* get host page corresponding to requested address */
> +	nrpages = get_user_pages_remote(NULL, req_mm,
> +		req_hva, 1, 0,
> +		&req_page, &req_vma, NULL);
> +	if (nrpages == 0) {
> +		kvm_err("kvmi: no page for req_hva %016lx\n", req_hva);
> +		result = -ENOENT;
> +		goto out_err;
> +	} else if (IS_ERR_VALUE(nrpages)) {
> +		result = nrpages;
> +		kvm_err("kvmi: get_user_pages_remote() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page before remap");
> +
> +	/* find (not get) local page corresponding to target address */
> +	map_vma = find_vma(map_mm, map_hva);
> +	if (map_vma == NULL) {
> +		kvm_err("kvmi: no local VMA found for remapping\n");
> +		result = -ENOENT;
> +		goto out_err;
> +	}
> +
> +	map_page = follow_page(map_vma, map_hva, 0);
> +	if (IS_ERR_VALUE(map_page)) {
> +		result = PTR_ERR(map_page);
> +		kvm_debug("kvmi: follow_page() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	} else if (map_page == NULL) {
> +		result = -ENOENT;
> +		kvm_debug("kvmi: follow_page() returned no page\n");
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(map_page, "map_page before remap");
> +
> +	/* split local VMA for rmap redirecting */
> +	map_vma = isolate_page_vma(map_vma, map_hva);
> +	if (IS_ERR_VALUE(map_vma)) {
> +		result = PTR_ERR(map_vma);
> +		kvm_debug("kvmi: isolate_page_vma() failed with result %d\n",
> +			result);
> +		goto out_err;
> +	}
> +
> +	/* split remote huge page */
> +	kvmi_split_huge_pmd(req_vma, req_hva, req_page);
> +
> +	/* re-link VMAs */
> +	result = redirect_rmap(req_vma, req_page, map_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	/* also redirect page tables */
> +	result = host_map_fix_ptes(map_vma, map_hva, req_page, map_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	/* the old page will be discarded */
> +	discard_page(map_page);
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(map_page, "map_page after being discarded");
> +
> +	/* done */
> +	goto out_finalize;
> +
> +out_err:
> +	/* get_user_pages_remote() incremented page reference count */
> +	if (req_page != NULL)
> +		put_page(req_page);
> +
> +out_finalize:
> +	/* release semaphores in reverse order */
> +	up_write(&map_mm->mmap_sem);
> +	up_write(&req_mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +int kvmi_host_mem_map(struct kvm_vcpu *vcpu, gva_t tkn_gva,
> +	gpa_t req_gpa, gpa_t map_gpa)
> +{
> +	int result = 0;
> +	struct kvm *target_kvm;
> +
> +	gfn_t req_gfn;
> +	hva_t req_hva;
> +	struct mm_struct *req_mm;
> +
> +	gfn_t map_gfn;
> +	hva_t map_hva;
> +	struct mm_struct *map_mm = vcpu->kvm->mm;
> +
> +	kvm_debug("kvmi: mapping request req_gpa %016llx, map_gpa %016llx\n",
> +		  req_gpa, map_gpa);
> +
> +	/* get the struct kvm * corresponding to the token */
> +	target_kvm = find_machine_at(vcpu, tkn_gva);
> +	if (IS_ERR_VALUE(target_kvm))
> +		return PTR_ERR(target_kvm);

Since the else if block below has braces, this if block should have 
braces too.

> +	else if (target_kvm == NULL) {
> +		kvm_err("kvmi: unable to find target machine\n");
> +		return -ENOENT;
> +	}
> +	kvm_get_kvm(target_kvm);
> +	req_mm = target_kvm->mm;
> +
> +	/* translate source addresses */
> +	req_gfn = gpa_to_gfn(req_gpa);
> +	req_hva = gfn_to_hva_safe(target_kvm, req_gfn);
> +	if (kvm_is_error_hva(req_hva)) {
> +		kvm_err("kvmi: invalid req HVA %016lx\n", req_hva);
> +		result = -EFAULT;
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: req_gpa %016llx, req_gfn %016llx, req_hva %016lx\n",
> +		  req_gpa, req_gfn, req_hva);
> +
> +	/* translate destination addresses */
> +	map_gfn = gpa_to_gfn(map_gpa);
> +	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
> +	if (kvm_is_error_hva(map_hva)) {
> +		kvm_err("kvmi: invalid map HVA %016lx\n", map_hva);
> +		result = -EFAULT;
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
> +		map_gpa, map_gfn, map_hva);
> +
> +	/* go to step 2 */
> +	result = kvmi_map_action(req_mm, req_hva, map_mm, map_hva);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	/* add mapping to list */
> +	result = add_to_list(map_gpa, target_kvm, req_gpa);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	/* all fine */
> +	kvm_debug("kvmi: mapping of req_gpa %016llx successful\n", req_gpa);
> +
> +out:
> +	/* mandatory dec refernce count */
> +	kvm_put_kvm(target_kvm);
> +
> +	return result;
> +}
> +
> +
> +static int restore_rmap(struct vm_area_struct *map_vma, hva_t map_hva,
> +			struct page *req_page, struct page *new_page)
> +{
> +	int result;
> +
> +	/* decouple links to anon_vmas */
> +	unlink_anon_vmas(map_vma);
> +	map_vma->anon_vma = NULL;
> +
> +	/* allocate new anon_vma */
> +	result = anon_vma_prepare(map_vma);
> +	if (IS_ERR_VALUE((long)result))
> +		return result;
> +
> +	lock_page(new_page);
> +	page_add_new_anon_rmap(new_page, map_vma, map_hva, false);
> +	unlock_page(new_page);
> +
> +	/* decrease req_page mapcount */
> +	atomic_dec(&req_page->_mapcount);
> +
> +	return 0;
> +}
> +
> +static int host_unmap_fix_ptes(struct vm_area_struct *map_vma, hva_t map_hva,
> +			       struct page *new_page)
> +{
> +	struct mm_struct *map_mm = map_vma->vm_mm;
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	spinlock_t *ptl;
> +	pte_t newpte;
> +
> +	unsigned long mmun_start;
> +	unsigned long mmun_end;
> +
> +	/* page replacing code */
> +	pmd = mm_find_pmd(map_mm, map_hva);
> +	if (!pmd)
> +		return -EFAULT;
> +
> +	mmun_start = map_hva;
> +	mmun_end = map_hva + PAGE_SIZE;
> +	mmu_notifier_invalidate_range_start(map_mm, mmun_start, mmun_end);
> +
> +	ptep = pte_offset_map_lock(map_mm, pmd, map_hva, &ptl);
> +
> +	newpte = mk_pte(new_page, map_vma->vm_page_prot);
> +	newpte = pte_set_flags(newpte, pte_flags(*ptep));
> +
> +	/* clear cache & MMU notifier entries */
> +	flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
> +	ptep_clear_flush_notify(map_vma, map_hva, ptep);
> +	set_pte_at_notify(map_mm, map_hva, ptep, newpte);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +
> +	mmu_notifier_invalidate_range_end(map_mm, mmun_start, mmun_end);
> +
> +	return 0;
> +}
> +
> +static int kvmi_unmap_action(struct mm_struct *req_mm,
> +			     struct mm_struct *map_mm, hva_t map_hva)
> +{
> +	struct vm_area_struct *map_vma;
> +	struct page *req_page = NULL;
> +	struct page *new_page = NULL;
> +
> +	int result;
> +
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +
> +	/* find destination VMA for mapping */
> +	map_vma = find_vma(map_mm, map_hva);
> +	if (map_vma == NULL) {
> +		result = -ENOENT;
> +		kvm_err("kvmi: no local VMA found for unmapping\n");
> +		goto out_err;
> +	}
> +
> +	/* find (not get) page mapped to destination address */
> +	req_page = follow_page(map_vma, map_hva, 0);
> +	if (IS_ERR_VALUE(req_page)) {
> +		result = PTR_ERR(req_page);
> +		kvm_err("kvmi: follow_page() failed with result %d\n", result);
> +		goto out_err;
> +	} else if (req_page == NULL) {
> +		result = -ENOENT;
> +		kvm_err("kvmi: follow_page() returned no page\n");
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page before decoupling");
> +
> +	/* Returns NULL when no page can be allocated. */
> +	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, map_vma, map_hva);
> +	if (new_page == NULL) {
> +		result = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(new_page, "new_page after allocation");
> +
> +	/* should fix the rmap tree */
> +	result = restore_rmap(map_vma, map_hva, req_page, new_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(req_page, "req_page after decoupling");
> +
> +	/* page table fixing here */
> +	result = host_unmap_fix_ptes(map_vma, map_hva, new_page);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out_err;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(new_page, "new_page after unmapping");
> +
> +	goto out_finalize;
> +
> +out_err:
> +	if (new_page != NULL)
> +		put_page(new_page);
> +
> +out_finalize:
> +	/* reference count was inc during get_user_pages_remote() */
> +	if (req_page != NULL) {
> +		put_page(req_page);
> +
> +		if (IS_ENABLED(CONFIG_DEBUG_VM))
> +			dump_page(req_page, "req_page after release");
> +	}
> +
> +	/* release semaphores in reverse order */
> +	up_write(&map_mm->mmap_sem);
> +	up_write(&req_mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +int kvmi_host_mem_unmap(struct kvm_vcpu *vcpu, gpa_t map_gpa)
> +{
> +	struct kvm *target_kvm;
> +	struct mm_struct *req_mm;
> +
> +	struct host_map *map;
> +	int result;
> +
> +	gfn_t map_gfn;
> +	hva_t map_hva;
> +	struct mm_struct *map_mm = vcpu->kvm->mm;
> +
> +	kvm_debug("kvmi: unmap request for map_gpa %016llx\n", map_gpa);
> +
> +	/* get the struct kvm * corresponding to map_gpa */
> +	map = extract_from_list(map_gpa);
> +	if (map == NULL) {
> +		kvm_err("kvmi: map_gpa %016llx not mapped\n", map_gpa);
> +		return -ENOENT;
> +	}
> +	target_kvm = map->machine;
> +	kvm_get_kvm(target_kvm);
> +	req_mm = target_kvm->mm;
> +
> +	kvm_debug("kvmi: req_gpa %016llx of machine %016lx mapped in map_gpa %016llx\n",
> +		  map->req_gpa, (unsigned long) map->machine, map->map_gpa);
> +
> +	/* address where we did the remapping */
> +	map_gfn = gpa_to_gfn(map_gpa);
> +	map_hva = gfn_to_hva_safe(vcpu->kvm, map_gfn);
> +	if (kvm_is_error_hva(map_hva)) {
> +		result = -EFAULT;
> +		kvm_err("kvmi: invalid HVA %016lx\n", map_hva);
> +		goto out;
> +	}
> +
> +	kvm_debug("kvmi: map_gpa %016llx, map_gfn %016llx, map_hva %016lx\n",
> +		  map_gpa, map_gfn, map_hva);
> +
> +	/* go to step 2 */
> +	result = kvmi_unmap_action(req_mm, map_mm, map_hva);
> +	if (IS_ERR_VALUE((long)result))
> +		goto out;
> +
> +	kvm_debug("kvmi: unmap of map_gpa %016llx successful\n", map_gpa);
> +
> +out:
> +	kvm_put_kvm(target_kvm);
> +
> +	/* remove entry whatever happens above */
> +	remove_entry(map);
> +
> +	return result;
> +}
> +
> +void kvmi_mem_destroy_vm(struct kvm *kvm)
> +{
> +	kvm_debug("kvmi: machine %016lx was torn down\n",
> +		(unsigned long) kvm);
> +
> +	remove_vm_from_list(kvm);
> +	remove_vm_token(kvm);
> +}
> +
> +
> +int kvm_intro_host_init(void)
> +{
> +	/* token database */
> +	INIT_LIST_HEAD(&token_list);
> +	spin_lock_init(&token_lock);
> +
> +	/* mapping database */
> +	INIT_LIST_HEAD(&mapping_list);
> +	spin_lock_init(&mapping_lock);
> +
> +	kvm_info("kvmi: initialized host memory introspection\n");
> +
> +	return 0;
> +}
> +
> +void kvm_intro_host_exit(void)
> +{
> +	// ...
> +}
> +
> +module_init(kvm_intro_host_init)
> +module_exit(kvm_intro_host_exit)
> diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
> new file mode 100644
> index 000000000000..b1b20eb6332d
> --- /dev/null
> +++ b/virt/kvm/kvmi_msg.c
> @@ -0,0 +1,1134 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + */
> +#include <linux/file.h>
> +#include <linux/net.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvmi.h>
> +#include <asm/virtext.h>
> +
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/asm/kvmi.h>
> +
> +#include "kvmi_int.h"
> +
> +#include <trace/events/kvmi.h>
> +
> +/*
> + * TODO: break these call paths
> + *   kvmi.c        work_cb
> + *   kvmi_msg.c    kvmi_dispatch_message
> + *   kvmi.c        kvmi_cmd_... / kvmi_make_request
> + *   kvmi_msg.c    kvmi_msg_reply
> + *
> + *   kvmi.c        kvmi_X_event
> + *   kvmi_msg.c    kvmi_send_event
> + *   kvmi.c        kvmi_handle_request
> + */
> +
> +/* TODO: move some of the code to arch/x86 */
> +
> +static atomic_t seq_ev = ATOMIC_INIT(0);
> +
> +static u32 new_seq(void)
> +{
> +	return atomic_inc_return(&seq_ev);
> +}
> +
> +static const char *event_str(unsigned int e)
> +{
> +	switch (e) {
> +	case KVMI_EVENT_CR:
> +		return "CR";
> +	case KVMI_EVENT_MSR:
> +		return "MSR";
> +	case KVMI_EVENT_XSETBV:
> +		return "XSETBV";
> +	case KVMI_EVENT_BREAKPOINT:
> +		return "BREAKPOINT";
> +	case KVMI_EVENT_HYPERCALL:
> +		return "HYPERCALL";
> +	case KVMI_EVENT_PAGE_FAULT:
> +		return "PAGE_FAULT";
> +	case KVMI_EVENT_TRAP:
> +		return "TRAP";
> +	case KVMI_EVENT_DESCRIPTOR:
> +		return "DESCRIPTOR";
> +	case KVMI_EVENT_CREATE_VCPU:
> +		return "CREATE_VCPU";
> +	case KVMI_EVENT_PAUSE_VCPU:
> +		return "PAUSE_VCPU";
> +	default:
> +		return "EVENT?";
> +	}
> +}
> +
> +static const char * const msg_IDs[] = {
> +	[KVMI_GET_VERSION]      = "KVMI_GET_VERSION",
> +	[KVMI_GET_GUEST_INFO]   = "KVMI_GET_GUEST_INFO",
> +	[KVMI_PAUSE_VCPU]       = "KVMI_PAUSE_VCPU",
> +	[KVMI_GET_REGISTERS]    = "KVMI_GET_REGISTERS",
> +	[KVMI_SET_REGISTERS]    = "KVMI_SET_REGISTERS",
> +	[KVMI_GET_PAGE_ACCESS]  = "KVMI_GET_PAGE_ACCESS",
> +	[KVMI_SET_PAGE_ACCESS]  = "KVMI_SET_PAGE_ACCESS",
> +	[KVMI_INJECT_EXCEPTION] = "KVMI_INJECT_EXCEPTION",
> +	[KVMI_READ_PHYSICAL]    = "KVMI_READ_PHYSICAL",
> +	[KVMI_WRITE_PHYSICAL]   = "KVMI_WRITE_PHYSICAL",
> +	[KVMI_GET_MAP_TOKEN]    = "KVMI_GET_MAP_TOKEN",
> +	[KVMI_CONTROL_EVENTS]   = "KVMI_CONTROL_EVENTS",
> +	[KVMI_CONTROL_CR]       = "KVMI_CONTROL_CR",
> +	[KVMI_CONTROL_MSR]      = "KVMI_CONTROL_MSR",
> +	[KVMI_EVENT]            = "KVMI_EVENT",
> +	[KVMI_EVENT_REPLY]      = "KVMI_EVENT_REPLY",
> +	[KVMI_GET_CPUID]        = "KVMI_GET_CPUID",
> +	[KVMI_GET_XSAVE]        = "KVMI_GET_XSAVE",
> +};
> +
> +static size_t sizeof_get_registers(const void *r)
> +{
> +	const struct kvmi_get_registers *req = r;
> +
> +	return sizeof(*req) + sizeof(req->msrs_idx[0]) * req->nmsrs;
> +}
> +
> +static size_t sizeof_get_page_access(const void *r)
> +{
> +	const struct kvmi_get_page_access *req = r;
> +
> +	return sizeof(*req) + sizeof(req->gpa[0]) * req->count;
> +}
> +
> +static size_t sizeof_set_page_access(const void *r)
> +{
> +	const struct kvmi_set_page_access *req = r;
> +
> +	return sizeof(*req) + sizeof(req->entries[0]) * req->count;
> +}
> +
> +static size_t sizeof_write_physical(const void *r)
> +{
> +	const struct kvmi_write_physical *req = r;
> +
> +	return sizeof(*req) + req->size;
> +}
> +
> +static const struct {
> +	size_t size;
> +	size_t (*cbk_full_size)(const void *msg);
> +} msg_bytes[] = {
> +	[KVMI_GET_VERSION]      = { 0, NULL },
> +	[KVMI_GET_GUEST_INFO]   = { sizeof(struct kvmi_get_guest_info), NULL },
> +	[KVMI_PAUSE_VCPU]       = { sizeof(struct kvmi_pause_vcpu), NULL },
> +	[KVMI_GET_REGISTERS]    = { sizeof(struct kvmi_get_registers),
> +						sizeof_get_registers },
> +	[KVMI_SET_REGISTERS]    = { sizeof(struct kvmi_set_registers), NULL },
> +	[KVMI_GET_PAGE_ACCESS]  = { sizeof(struct kvmi_get_page_access),
> +						sizeof_get_page_access },
> +	[KVMI_SET_PAGE_ACCESS]  = { sizeof(struct kvmi_set_page_access),
> +						sizeof_set_page_access },
> +	[KVMI_INJECT_EXCEPTION] = { sizeof(struct kvmi_inject_exception),
> +					NULL },
> +	[KVMI_READ_PHYSICAL]    = { sizeof(struct kvmi_read_physical), NULL },
> +	[KVMI_WRITE_PHYSICAL]   = { sizeof(struct kvmi_write_physical),
> +						sizeof_write_physical },
> +	[KVMI_GET_MAP_TOKEN]    = { 0, NULL },
> +	[KVMI_CONTROL_EVENTS]   = { sizeof(struct kvmi_control_events), NULL },
> +	[KVMI_CONTROL_CR]       = { sizeof(struct kvmi_control_cr), NULL },
> +	[KVMI_CONTROL_MSR]      = { sizeof(struct kvmi_control_msr), NULL },
> +	[KVMI_GET_CPUID]        = { sizeof(struct kvmi_get_cpuid), NULL },
> +	[KVMI_GET_XSAVE]        = { sizeof(struct kvmi_get_xsave), NULL },
> +};
> +
> +static int kvmi_sock_read(struct kvmi *ikvm, void *buf, size_t size)
> +{
> +	struct kvec i = {
> +		.iov_base = buf,
> +		.iov_len = size,
> +	};
> +	struct msghdr m = { };
> +	int rc;
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (likely(ikvm->sock))
> +		rc = kernel_recvmsg(ikvm->sock, &m, &i, 1, size, MSG_WAITALL);
> +	else
> +		rc = -EPIPE;
> +
> +	if (rc > 0)
> +		print_hex_dump_debug("read: ", DUMP_PREFIX_NONE, 32, 1,
> +					buf, rc, false);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	if (unlikely(rc != size)) {
> +		kvm_err("kernel_recvmsg: %d\n", rc);
> +		if (rc >= 0)
> +			rc = -EPIPE;
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +static int kvmi_sock_write(struct kvmi *ikvm, struct kvec *i, size_t n,
> +			   size_t size)
> +{
> +	struct msghdr m = { };
> +	int rc, k;
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (likely(ikvm->sock))
> +		rc = kernel_sendmsg(ikvm->sock, &m, i, n, size);
> +	else
> +		rc = -EPIPE;
> +
> +	for (k = 0; k < n; k++)
> +		print_hex_dump_debug("write: ", DUMP_PREFIX_NONE, 32, 1,
> +				     i[k].iov_base, i[k].iov_len, false);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	if (unlikely(rc != size)) {
> +		kvm_err("kernel_sendmsg: %d\n", rc);
> +		if (rc >= 0)
> +			rc = -EPIPE;
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +static const char *id2str(int i)
> +{
> +	return (i < ARRAY_SIZE(msg_IDs) && msg_IDs[i] ? msg_IDs[i] : "unknown");
> +}
> +
> +static struct kvmi_vcpu *kvmi_vcpu_waiting_for_reply(struct kvm *kvm, u32 seq)
> +{
> +	struct kvmi_vcpu *found = NULL;
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		/* kvmi_send_event */
> +		smp_rmb();
> +		if (READ_ONCE(IVCPU(vcpu)->ev_rpl_waiting)
> +		    && seq == IVCPU(vcpu)->ev_seq) {
> +			found = IVCPU(vcpu);
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	return found;
> +}
> +
> +static bool kvmi_msg_dispatch_reply(struct kvmi *ikvm,
> +				    const struct kvmi_msg_hdr *msg)
> +{
> +	struct kvmi_vcpu *ivcpu;
> +	int err;
> +
> +	ivcpu = kvmi_vcpu_waiting_for_reply(ikvm->kvm, msg->seq);
> +	if (!ivcpu) {
> +		kvm_err("%s: unexpected event reply (seq=%u)\n", __func__,
> +			msg->seq);
> +		return false;
> +	}
> +
> +	if (msg->size == sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size) {
> +		err = kvmi_sock_read(ikvm, &ivcpu->ev_rpl,
> +					sizeof(ivcpu->ev_rpl));
> +		if (!err && ivcpu->ev_rpl_size)
> +			err = kvmi_sock_read(ikvm, ivcpu->ev_rpl_ptr,
> +						ivcpu->ev_rpl_size);
> +	} else {
> +		kvm_err("%s: invalid event reply size (max=%zu, recv=%u, expected=%zu)\n",
> +			__func__, ivcpu->ev_rpl_size, msg->size,
> +			sizeof(ivcpu->ev_rpl) + ivcpu->ev_rpl_size);
> +		err = -1;
> +	}
> +
> +	ivcpu->ev_rpl_received = err ? -1 : ivcpu->ev_rpl_size;
> +
> +	kvmi_make_request(ivcpu, REQ_REPLY);
> +
> +	return (err == 0);
> +}
> +
> +static bool consume_sock_bytes(struct kvmi *ikvm, size_t n)
> +{
> +	while (n) {
> +		u8 buf[256];
> +		size_t chunk = min(n, sizeof(buf));
> +
> +		if (kvmi_sock_read(ikvm, buf, chunk) != 0)
> +			return false;
> +
> +		n -= chunk;
> +	}
> +
> +	return true;
> +}
> +
> +static int kvmi_msg_reply(struct kvmi *ikvm,
> +			  const struct kvmi_msg_hdr *msg,
> +			  int err, const void *rpl, size_t rpl_size)
> +{
> +	struct kvmi_error_code ec;
> +	struct kvmi_msg_hdr h;
> +	struct kvec vec[3] = {
> +		{.iov_base = &h,           .iov_len = sizeof(h) },
> +		{.iov_base = &ec,          .iov_len = sizeof(ec)},
> +		{.iov_base = (void *) rpl, .iov_len = rpl_size  },
> +	};
> +	size_t size = sizeof(h) + sizeof(ec) + (err ? 0 : rpl_size);
> +	size_t n = err ? ARRAY_SIZE(vec)-1 : ARRAY_SIZE(vec);
> +
> +	memset(&h, 0, sizeof(h));
> +	h.id = msg->id;
> +	h.seq = msg->seq;
> +	h.size = size - sizeof(h);
> +
> +	memset(&ec, 0, sizeof(ec));
> +	ec.err = err;
> +
> +	return kvmi_sock_write(ikvm, vec, n, size);
> +}
> +
> +static int kvmi_msg_vcpu_reply(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg,
> +				int err, const void *rpl, size_t size)
> +{
> +	/*
> +	 * As soon as we reply to this vCPU command, we can get another one,
> +	 * and we must signal that the incoming buffer (ivcpu->msg_buf)
> +	 * is ready by clearing this bit/request.
> +	 */
> +	kvmi_clear_request(IVCPU(vcpu), REQ_CMD);
> +
> +	return kvmi_msg_reply(IKVM(vcpu->kvm), msg, err, rpl, size);
> +}
> +
> +bool kvmi_msg_init(struct kvmi *ikvm, int fd)
> +{
> +	struct socket *sock;
> +	int r;
> +
> +	sock = sockfd_lookup(fd, &r);
> +
> +	if (!sock) {
> +		kvm_err("Invalid file handle: %d\n", fd);
> +		return false;
> +	}
> +
> +	WRITE_ONCE(ikvm->sock, sock);
> +
> +	return true;
> +}
> +
> +void kvmi_msg_uninit(struct kvmi *ikvm)
> +{
> +	kvm_info("Wake up the receiving thread\n");
> +
> +	read_lock(&ikvm->sock_lock);
> +
> +	if (ikvm->sock)
> +		kernel_sock_shutdown(ikvm->sock, SHUT_RDWR);
> +
> +	read_unlock(&ikvm->sock_lock);
> +
> +	kvm_info("Wait for the receiving thread to complete\n");
> +	wait_for_completion(&ikvm->finished);
> +}
> +
> +static int handle_get_version(struct kvmi *ikvm,
> +			      const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_version_reply rpl;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +	rpl.version = KVMI_VERSION;
> +
> +	return kvmi_msg_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
> +}
> +
> +static struct kvm_vcpu *kvmi_get_vcpu(struct kvmi *ikvm, int vcpu_id)
> +{
> +	struct kvm *kvm = ikvm->kvm;
> +
> +	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
> +		return NULL;
> +
> +	return kvm_get_vcpu(kvm, vcpu_id);
> +}
> +
> +static bool invalid_page_access(u64 gpa, u64 size)
> +{
> +	u64 off = gpa & ~PAGE_MASK;
> +
> +	return (size == 0 || size > PAGE_SIZE || off + size > PAGE_SIZE);
> +}
> +
> +static int handle_read_physical(struct kvmi *ikvm,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	const struct kvmi_read_physical *req = _req;
> +
> +	if (invalid_page_access(req->gpa, req->size))
> +		return -EINVAL;
> +
> +	return kvmi_cmd_read_physical(ikvm->kvm, req->gpa, req->size,
> +				      kvmi_msg_reply, msg);
> +}
> +
> +static int handle_write_physical(struct kvmi *ikvm,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *_req)
> +{
> +	const struct kvmi_write_physical *req = _req;
> +	int ec;
> +
> +	if (invalid_page_access(req->gpa, req->size))
> +		return -EINVAL;
> +
> +	ec = kvmi_cmd_write_physical(ikvm->kvm, req->gpa, req->size, req->data);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_map_token(struct kvmi *ikvm,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	struct kvmi_get_map_token_reply rpl;
> +	int ec;
> +
> +	ec = kvmi_cmd_alloc_token(ikvm->kvm, &rpl.token);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_control_cr(struct kvmi *ikvm,
> +			     const struct kvmi_msg_hdr *msg, const void *_req)
> +{
> +	const struct kvmi_control_cr *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_control_cr(ikvm, req->enable, req->cr);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +static int handle_control_msr(struct kvmi *ikvm,
> +			      const struct kvmi_msg_hdr *msg, const void *_req)
> +{
> +	const struct kvmi_control_msr *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_control_msr(ikvm->kvm, req->enable, req->msr);
> +
> +	return kvmi_msg_reply(ikvm, msg, ec, NULL, 0);
> +}
> +
> +/*
> + * These commands are executed on the receiving thread/worker.
> + */
> +static int (*const msg_vm[])(struct kvmi *, const struct kvmi_msg_hdr *,
> +			     const void *) = {
> +	[KVMI_GET_VERSION]    = handle_get_version,
> +	[KVMI_READ_PHYSICAL]  = handle_read_physical,
> +	[KVMI_WRITE_PHYSICAL] = handle_write_physical,
> +	[KVMI_GET_MAP_TOKEN]  = handle_get_map_token,
> +	[KVMI_CONTROL_CR]     = handle_control_cr,
> +	[KVMI_CONTROL_MSR]    = handle_control_msr,
> +};
> +
> +static int handle_get_guest_info(struct kvm_vcpu *vcpu,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *req)
> +{
> +	struct kvmi_get_guest_info_reply rpl;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +	kvmi_cmd_get_guest_info(vcpu, &rpl.vcpu_count, &rpl.tsc_speed);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, 0, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_pause_vcpu(struct kvm_vcpu *vcpu,
> +			     const struct kvmi_msg_hdr *msg,
> +			     const void *req)
> +{
> +	int ec = kvmi_cmd_pause_vcpu(vcpu);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static void *alloc_get_registers_reply(const struct kvmi_msg_hdr *msg,
> +				       const struct kvmi_get_registers *req,
> +				       size_t *rpl_size)
> +{
> +	struct kvmi_get_registers_reply *rpl;
> +	u16 k, n = req->nmsrs;
> +
> +	*rpl_size = sizeof(*rpl) + sizeof(rpl->msrs.entries[0]) * n;
> +
> +	rpl = kzalloc(*rpl_size, GFP_KERNEL);
> +
> +	if (rpl) {
> +		rpl->msrs.nmsrs = n;
> +
> +		for (k = 0; k < n; k++)
> +			rpl->msrs.entries[k].index = req->msrs_idx[k];
> +	}
> +
> +	return rpl;
> +}
> +
> +static int handle_get_registers(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_registers_reply *rpl;
> +	size_t rpl_size = 0;
> +	int err, ec;
> +
> +	rpl = alloc_get_registers_reply(msg, req, &rpl_size);
> +
> +	if (!rpl)
> +		ec = -KVM_ENOMEM;
> +	else
> +		ec = kvmi_cmd_get_registers(vcpu, &rpl->mode,
> +						&rpl->regs, &rpl->sregs,
> +						&rpl->msrs);
> +
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +static int handle_set_registers(struct kvm_vcpu *vcpu,
> +				const struct kvmi_msg_hdr *msg,
> +				const void *_req)
> +{
> +	const struct kvmi_set_registers *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_set_registers(vcpu, &req->regs);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_page_access(struct kvm_vcpu *vcpu,
> +				  const struct kvmi_msg_hdr *msg,
> +				  const void *_req)
> +{
> +	const struct kvmi_get_page_access *req = _req;
> +	struct kvmi_get_page_access_reply *rpl = NULL;
> +	size_t rpl_size = 0;
> +	u16 k, n = req->count;
> +	int err, ec = 0;
> +
> +	if (req->view != 0 && !kvm_eptp_switching_supported) {
> +		ec = -KVM_ENOSYS;
> +		goto out;
> +	}
> +
> +	if (req->view != 0) { /* TODO */
> +		ec = -KVM_EINVAL;
> +		goto out;
> +	}
> +
> +	rpl_size = sizeof(*rpl) + sizeof(rpl->access[0]) * n;
> +	rpl = kzalloc(rpl_size, GFP_KERNEL);
> +
> +	if (!rpl) {
> +		ec = -KVM_ENOMEM;
> +		goto out;
> +	}
> +
> +	for (k = 0; k < n && ec == 0; k++)
> +		ec = kvmi_cmd_get_page_access(vcpu, req->gpa[k],
> +						&rpl->access[k]);
> +
> +out:
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +static int handle_set_page_access(struct kvm_vcpu *vcpu,
> +				  const struct kvmi_msg_hdr *msg,
> +				  const void *_req)
> +{
> +	const struct kvmi_set_page_access *req = _req;
> +	struct kvm *kvm = vcpu->kvm;
> +	u16 k, n = req->count;
> +	int ec = 0;
> +
> +	if (req->view != 0) {
> +		if (!kvm_eptp_switching_supported)
> +			ec = -KVM_ENOSYS;
> +		else
> +			ec = -KVM_EINVAL; /* TODO */
> +	} else {
> +		for (k = 0; k < n; k++) {
> +			u64 gpa   = req->entries[k].gpa;
> +			u8 access = req->entries[k].access;
> +			int ec0;
> +
> +			if (access &  ~(KVMI_PAGE_ACCESS_R |
> +					KVMI_PAGE_ACCESS_W |
> +					KVMI_PAGE_ACCESS_X))
> +				ec0 = -KVM_EINVAL;
> +			else
> +				ec0 = kvmi_set_mem_access(kvm, gpa, access);
> +
> +			if (ec0 && !ec)
> +				ec = ec0;
> +
> +			trace_kvmi_set_mem_access(gpa_to_gfn(gpa), access, ec0);
> +		}
> +	}
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_inject_exception(struct kvm_vcpu *vcpu,
> +				   const struct kvmi_msg_hdr *msg,
> +				   const void *_req)
> +{
> +	const struct kvmi_inject_exception *req = _req;
> +	int ec;
> +
> +	ec = kvmi_cmd_inject_exception(vcpu, req->nr, req->has_error,
> +				       req->error_code, req->address);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_control_events(struct kvm_vcpu *vcpu,
> +				 const struct kvmi_msg_hdr *msg,
> +				 const void *_req)
> +{
> +	const struct kvmi_control_events *req = _req;
> +	u32 not_allowed = ~IKVM(vcpu->kvm)->event_allow_mask;
> +	u32 unknown = ~KVMI_KNOWN_EVENTS;
> +	int ec;
> +
> +	if (req->events & unknown)
> +		ec = -KVM_EINVAL;
> +	else if (req->events & not_allowed)
> +		ec = -KVM_EPERM;
> +	else
> +		ec = kvmi_cmd_control_events(vcpu, req->events);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, NULL, 0);
> +}
> +
> +static int handle_get_cpuid(struct kvm_vcpu *vcpu,
> +			    const struct kvmi_msg_hdr *msg,
> +			    const void *_req)
> +{
> +	const struct kvmi_get_cpuid *req = _req;
> +	struct kvmi_get_cpuid_reply rpl;
> +	int ec;
> +
> +	memset(&rpl, 0, sizeof(rpl));
> +
> +	ec = kvmi_cmd_get_cpuid(vcpu, req->function, req->index,
> +					&rpl.eax, &rpl.ebx, &rpl.ecx,
> +					&rpl.edx);
> +
> +	return kvmi_msg_vcpu_reply(vcpu, msg, ec, &rpl, sizeof(rpl));
> +}
> +
> +static int handle_get_xsave(struct kvm_vcpu *vcpu,
> +			    const struct kvmi_msg_hdr *msg, const void *req)
> +{
> +	struct kvmi_get_xsave_reply *rpl;
> +	size_t rpl_size = sizeof(*rpl) + sizeof(struct kvm_xsave);
> +	int ec = 0, err;
> +
> +	rpl = kzalloc(rpl_size, GFP_KERNEL);
> +
> +	if (!rpl)
> +		ec = -KVM_ENOMEM;

Again, because the else block has braces, the if should too.

> +	else {
> +		struct kvm_xsave *area;
> +
> +		area = (struct kvm_xsave *)&rpl->region[0];
> +		kvm_vcpu_ioctl_x86_get_xsave(vcpu, area);
> +	}
> +
> +	err = kvmi_msg_vcpu_reply(vcpu, msg, ec, rpl, rpl_size);
> +	kfree(rpl);
> +	return err;
> +}
> +
> +/*
> + * These commands are executed on the vCPU thread. The receiving thread
> + * saves the command into kvmi_vcpu.msg_buf[] and signals the vCPU to handle
> + * the command (including sending back the reply).
> + */
> +static int (*const msg_vcpu[])(struct kvm_vcpu *,
> +			       const struct kvmi_msg_hdr *, const void *) = {
> +	[KVMI_GET_GUEST_INFO]   = handle_get_guest_info,
> +	[KVMI_PAUSE_VCPU]       = handle_pause_vcpu,
> +	[KVMI_GET_REGISTERS]    = handle_get_registers,
> +	[KVMI_SET_REGISTERS]    = handle_set_registers,
> +	[KVMI_GET_PAGE_ACCESS]  = handle_get_page_access,
> +	[KVMI_SET_PAGE_ACCESS]  = handle_set_page_access,
> +	[KVMI_INJECT_EXCEPTION] = handle_inject_exception,
> +	[KVMI_CONTROL_EVENTS]   = handle_control_events,
> +	[KVMI_GET_CPUID]        = handle_get_cpuid,
> +	[KVMI_GET_XSAVE]        = handle_get_xsave,
> +};
> +
> +void kvmi_msg_handle_vcpu_cmd(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi_msg_hdr *msg = (void *) ivcpu->msg_buf;
> +	u8 *req = ivcpu->msg_buf + sizeof(*msg);
> +	int err;
> +
> +	err = msg_vcpu[msg->id](vcpu, msg, req);
> +
> +	if (err)
> +		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
> +			id2str(msg->id), err);
> +
> +	/*
> +	 * No error code is returned.
> +	 *
> +	 * The introspector gets its error code from the message handler
> +	 * or the socket is closed (and QEMU should reconnect).
> +	 */
> +}
> +
> +static int kvmi_msg_recv_varlen(struct kvmi *ikvm, size_t(*cbk) (const void *),
> +				size_t min_n, size_t msg_size)
> +{
> +	size_t extra_n;
> +	u8 *extra_buf;
> +	int err;
> +
> +	if (min_n > msg_size) {
> +		kvm_err("%s: got %zu bytes instead of min %zu\n",
> +			__func__, msg_size, min_n);
> +		return -EINVAL;
> +	}
> +
> +	if (!min_n)
> +		return 0;
> +
> +	err = kvmi_sock_read(ikvm, ikvm->msg_buf, min_n);
> +
> +	extra_buf = ikvm->msg_buf + min_n;
> +	extra_n = msg_size - min_n;
> +
> +	if (!err && extra_n) {
> +		if (cbk(ikvm->msg_buf) == msg_size)
> +			err = kvmi_sock_read(ikvm, extra_buf, extra_n);
> +		else
> +			err = -EINVAL;
> +	}
> +
> +	return err;
> +}
> +
> +static int kvmi_msg_recv_n(struct kvmi *ikvm, size_t n, size_t msg_size)
> +{
> +	if (n != msg_size) {
> +		kvm_err("%s: got %zu bytes instead of %zu\n",
> +			__func__, msg_size, n);
> +		return -EINVAL;
> +	}
> +
> +	if (!n)
> +		return 0;
> +
> +	return kvmi_sock_read(ikvm, ikvm->msg_buf, n);
> +}
> +
> +static int kvmi_msg_recv(struct kvmi *ikvm, const struct kvmi_msg_hdr *msg)
> +{
> +	size_t (*cbk)(const void *) = msg_bytes[msg->id].cbk_full_size;
> +	size_t expected = msg_bytes[msg->id].size;
> +
> +	if (cbk)
> +		return kvmi_msg_recv_varlen(ikvm, cbk, expected, msg->size);
> +	else
> +		return kvmi_msg_recv_n(ikvm, expected, msg->size);
> +}
> +
> +struct vcpu_msg_hdr {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +static int kvmi_msg_queue_to_vcpu(struct kvmi *ikvm,
> +				  const struct kvmi_msg_hdr *msg)
> +{
> +	struct vcpu_msg_hdr *vcpu_hdr = (struct vcpu_msg_hdr *)ikvm->msg_buf;
> +	struct kvmi_vcpu *ivcpu;
> +	struct kvm_vcpu *vcpu;
> +
> +	if (msg->size < sizeof(*vcpu_hdr)) {
> +		kvm_err("%s: invalid vcpu message: %d\n", __func__, msg->size);
> +		return -EINVAL; /* ABI error */
> +	}
> +
> +	vcpu = kvmi_get_vcpu(ikvm, vcpu_hdr->vcpu);
> +
> +	if (!vcpu) {
> +		kvm_err("%s: invalid vcpu: %d\n", __func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EINVAL, NULL, 0);
> +	}
> +
> +	ivcpu = vcpu->kvmi;
> +
> +	if (!ivcpu) {
> +		kvm_err("%s: not introspected vcpu: %d\n",
> +			__func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EAGAIN, NULL, 0);
> +	}
> +
> +	if (test_bit(REQ_CMD, &ivcpu->requests)) {
> +		kvm_err("%s: vcpu is busy: %d\n", __func__, vcpu_hdr->vcpu);
> +		return kvmi_msg_reply(ikvm, msg, -KVM_EBUSY, NULL, 0);
> +	}
> +
> +	memcpy(ivcpu->msg_buf, msg, sizeof(*msg));
> +	memcpy(ivcpu->msg_buf + sizeof(*msg), ikvm->msg_buf, msg->size);
> +
> +	kvmi_make_request(ivcpu, REQ_CMD);
> +	kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
> +	kvm_vcpu_kick(vcpu);
> +
> +	return 0;
> +}
> +
> +static bool kvmi_msg_dispatch_cmd(struct kvmi *ikvm,
> +				  const struct kvmi_msg_hdr *msg)
> +{
> +	int err = kvmi_msg_recv(ikvm, msg);
> +
> +	if (err)
> +		goto out;
> +
> +	if (!KVMI_ALLOWED_COMMAND(msg->id, ikvm->cmd_allow_mask)) {
> +		err = kvmi_msg_reply(ikvm, msg, -KVM_EPERM, NULL, 0);
> +		goto out;
> +	}
> +
> +	if (msg_vcpu[msg->id])
> +		err = kvmi_msg_queue_to_vcpu(ikvm, msg);
> +	else
> +		err = msg_vm[msg->id](ikvm, msg, ikvm->msg_buf);
> +
> +out:
> +	if (err)
> +		kvm_err("%s: id:%u (%s) err:%d\n", __func__, msg->id,
> +			id2str(msg->id), err);
> +
> +	return (err == 0);
> +}
> +
> +static bool handle_unsupported_msg(struct kvmi *ikvm,
> +				   const struct kvmi_msg_hdr *msg)
> +{
> +	int err;
> +
> +	kvm_err("%s: %u\n", __func__, msg->id);
> +
> +	err = consume_sock_bytes(ikvm, msg->size);
> +
> +	if (!err)
> +		err = kvmi_msg_reply(ikvm, msg, -KVM_ENOSYS, NULL, 0);
> +
> +	return (err == 0);
> +}
> +
> +static bool kvmi_msg_dispatch(struct kvmi *ikvm)
> +{
> +	struct kvmi_msg_hdr msg;
> +	int err;
> +
> +	err = kvmi_sock_read(ikvm, &msg, sizeof(msg));
> +
> +	if (err) {
> +		kvm_err("%s: can't read\n", __func__);
> +		return false;
> +	}
> +
> +	trace_kvmi_msg_dispatch(msg.id, msg.size);
> +
> +	kvm_debug("%s: id:%u (%s) size:%u\n", __func__, msg.id,
> +		  id2str(msg.id), msg.size);
> +
> +	if (msg.id == KVMI_EVENT_REPLY)
> +		return kvmi_msg_dispatch_reply(ikvm, &msg);
> +
> +	if (msg.id >= ARRAY_SIZE(msg_bytes)
> +	    || (!msg_vm[msg.id] && !msg_vcpu[msg.id]))
> +		return handle_unsupported_msg(ikvm, &msg);
> +
> +	return kvmi_msg_dispatch_cmd(ikvm, &msg);
> +}
> +
> +static void kvmi_sock_close(struct kvmi *ikvm)
> +{
> +	kvm_info("%s\n", __func__);
> +
> +	write_lock(&ikvm->sock_lock);
> +
> +	if (ikvm->sock) {
> +		kvm_info("Release the socket\n");
> +		sockfd_put(ikvm->sock);
> +
> +		ikvm->sock = NULL;
> +	}
> +
> +	write_unlock(&ikvm->sock_lock);
> +}
> +
> +bool kvmi_msg_process(struct kvmi *ikvm)
> +{
> +	if (!kvmi_msg_dispatch(ikvm)) {
> +		kvmi_sock_close(ikvm);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static void kvmi_setup_event(struct kvm_vcpu *vcpu, struct kvmi_event *ev,
> +			     u32 ev_id)
> +{
> +	memset(ev, 0, sizeof(*ev));
> +	ev->vcpu = vcpu->vcpu_id;
> +	ev->event = ev_id;
> +	kvm_arch_vcpu_ioctl_get_regs(vcpu, &ev->regs);
> +	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &ev->sregs);
> +	ev->mode = kvmi_vcpu_mode(vcpu, &ev->sregs);
> +	kvmi_get_msrs(vcpu, ev);
> +}
> +
> +static bool kvmi_send_event(struct kvm_vcpu *vcpu, u32 ev_id,
> +			    void *ev,  size_t ev_size,
> +			    void *rpl, size_t rpl_size)
> +{
> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +	struct kvmi_event common;
> +	struct kvmi_msg_hdr h;
> +	struct kvec vec[3] = {
> +		{.iov_base = &h,      .iov_len = sizeof(h)     },
> +		{.iov_base = &common, .iov_len = sizeof(common)},
> +		{.iov_base = ev,      .iov_len = ev_size       },
> +	};
> +	size_t msg_size = sizeof(h) + sizeof(common) + ev_size;
> +	size_t n = ev_size ? ARRAY_SIZE(vec) : ARRAY_SIZE(vec)-1;
> +
> +	memset(&h, 0, sizeof(h));
> +	h.id = KVMI_EVENT;
> +	h.seq = new_seq();
> +	h.size = msg_size - sizeof(h);
> +
> +	kvmi_setup_event(vcpu, &common, ev_id);
> +
> +	ivcpu->ev_rpl_ptr = rpl;
> +	ivcpu->ev_rpl_size = rpl_size;
> +	ivcpu->ev_seq = h.seq;
> +	ivcpu->ev_rpl_received = -1;
> +	WRITE_ONCE(ivcpu->ev_rpl_waiting, true);
> +	/* kvmi_vcpu_waiting_for_reply() */
> +	smp_wmb();
> +
> +	trace_kvmi_send_event(ev_id);
> +
> +	kvm_debug("%s: %-11s(seq:%u) size:%lu vcpu:%d\n",
> +		  __func__, event_str(ev_id), h.seq, ev_size, vcpu->vcpu_id);
> +
> +	if (kvmi_sock_write(IKVM(vcpu->kvm), vec, n, msg_size) == 0)
> +		kvmi_handle_request(vcpu);
> +
> +	kvm_debug("%s: reply for %-11s(seq:%u) size:%lu vcpu:%d\n",
> +		  __func__, event_str(ev_id), h.seq, rpl_size, vcpu->vcpu_id);
> +
> +	return (ivcpu->ev_rpl_received >= 0);
> +}
> +
> +u32 kvmi_msg_send_cr(struct kvm_vcpu *vcpu, u32 cr, u64 old_value,
> +		     u64 new_value, u64 *ret_value)
> +{
> +	struct kvmi_event_cr e;
> +	struct kvmi_event_cr_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.cr = cr;
> +	e.old_value = old_value;
> +	e.new_value = new_value;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_CR, &e, sizeof(e),
> +				&r, sizeof(r))) {
> +		*ret_value = new_value;
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +	}
> +
> +	*ret_value = r.new_val;
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_msr(struct kvm_vcpu *vcpu, u32 msr, u64 old_value,
> +		      u64 new_value, u64 *ret_value)
> +{
> +	struct kvmi_event_msr e;
> +	struct kvmi_event_msr_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.msr = msr;
> +	e.old_value = old_value;
> +	e.new_value = new_value;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_MSR, &e, sizeof(e),
> +				&r, sizeof(r))) {
> +		*ret_value = new_value;
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +	}
> +
> +	*ret_value = r.new_val;
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_xsetbv(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_XSETBV, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_bp(struct kvm_vcpu *vcpu, u64 gpa)
> +{
> +	struct kvmi_event_breakpoint e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.gpa = gpa;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_BREAKPOINT,
> +				&e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_hypercall(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_HYPERCALL, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +bool kvmi_msg_send_pf(struct kvm_vcpu *vcpu, u64 gpa, u64 gva, u32 mode,
> +		      u32 *action, bool *trap_access, u8 *ctx_data,
> +		      u32 *ctx_size)
> +{
> +	u32 max_ctx_size = *ctx_size;
> +	struct kvmi_event_page_fault e;
> +	struct kvmi_event_page_fault_reply r;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.gpa = gpa;
> +	e.gva = gva;
> +	e.mode = mode;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAGE_FAULT, &e, sizeof(e),
> +				&r, sizeof(r)))
> +		return false;
> +
> +	*action = IVCPU(vcpu)->ev_rpl.action;
> +	*trap_access = r.trap_access;
> +	*ctx_size = 0;
> +
> +	if (r.ctx_size <= max_ctx_size) {
> +		*ctx_size = min_t(u32, r.ctx_size, sizeof(r.ctx_data));
> +		if (*ctx_size)
> +			memcpy(ctx_data, r.ctx_data, *ctx_size);
> +	} else {
> +		kvm_err("%s: ctx_size (recv:%u max:%u)\n", __func__,
> +			r.ctx_size, *ctx_size);
> +		/*
> +		 * TODO: This is an ABI error.
> +		 * We should shutdown the socket?
> +		 */
> +	}
> +
> +	return true;
> +}
> +
> +u32 kvmi_msg_send_trap(struct kvm_vcpu *vcpu, u32 vector, u32 type,
> +		       u32 error_code, u64 cr2)
> +{
> +	struct kvmi_event_trap e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.vector = vector;
> +	e.type = type;
> +	e.error_code = error_code;
> +	e.cr2 = cr2;
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_TRAP, &e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_descriptor(struct kvm_vcpu *vcpu, u32 info,
> +			     u64 exit_qualification, u8 descriptor, u8 write)
> +{
> +	struct kvmi_event_descriptor e;
> +
> +	memset(&e, 0, sizeof(e));
> +	e.descriptor = descriptor;
> +	e.write = write;
> +
> +	if (cpu_has_vmx()) {
> +		e.arch.vmx.instr_info = info;
> +		e.arch.vmx.exit_qualification = exit_qualification;
> +	} else {
> +		e.arch.svm.exit_info = info;
> +	}
> +
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_DESCRIPTOR,
> +				&e, sizeof(e), NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_create_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_CREATE_VCPU, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> +
> +u32 kvmi_msg_send_pause_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvmi_send_event(vcpu, KVMI_EVENT_PAUSE_VCPU, NULL, 0, NULL, 0))
> +		return KVMI_EVENT_ACTION_CONTINUE;
> +
> +	return IVCPU(vcpu)->ev_rpl.action;
> +}
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
  2017-12-21 21:39     ` Patrick Colp
@ 2017-12-22  9:29       ` alazar
  -1 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22  9:29 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On Thu, 21 Dec 2017 16:39:02 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
> > the exceptions vector.
> > 
> > Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> > ---
> >   arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
> >   include/linux/kvm_host.h |  1 +
> >   2 files changed, 35 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e1a3c2c6ec08..4b0c3692386d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >   	return 0;
> >   }
> >   
> > +/*
> > + * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
> > + * the exceptions
> > + */
> > +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> > +{
> > +	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
> > +	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> > +
> > +	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
> > +	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
> > +	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
> > +	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
> > +	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
> > +#ifdef CONFIG_X86_64
> > +	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
> > +	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
> > +	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
> > +	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
> > +	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
> > +	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
> > +	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
> > +	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
> > +#endif
> > +
> > +	kvm_rip_write(vcpu, regs->rip);
> > +	kvm_set_rflags(vcpu, regs->rflags);
> > +
> > +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +}
> > +
> 
> kvm_arch_vcpu_ioctl_set_regs() returns an int (so that, for e.g., in ARM 
> it can return an error to indicate that the function is not 
> supported/implemented). Is there a reason this function shouldn't do the 
> same (is it only ever going to be implemented for x86)?
> 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6bdd4b9f6611..68e4d756f5c9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
> >   
> >   int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> >   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> > +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> >   int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
> >   				  struct kvm_sregs *sregs);
> 
> 
> Patrick

\x13Hi Patrick,

Thank you for taking the time to review these patches.

You're right. This function should return an error code, regardless on
the time \x12\x12\x12\x12when ARM will be supported.

Adalbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs()
@ 2017-12-22  9:29       ` alazar
  0 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22  9:29 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On Thu, 21 Dec 2017 16:39:02 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
> > the exceptions vector.
> > 
> > Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> > ---
> >   arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
> >   include/linux/kvm_host.h |  1 +
> >   2 files changed, 35 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e1a3c2c6ec08..4b0c3692386d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7389,6 +7389,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >   	return 0;
> >   }
> >   
> > +/*
> > + * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
> > + * the exceptions
> > + */
> > +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> > +{
> > +	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
> > +	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> > +
> > +	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
> > +	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
> > +	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
> > +	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
> > +	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
> > +	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
> > +#ifdef CONFIG_X86_64
> > +	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
> > +	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
> > +	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
> > +	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
> > +	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
> > +	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
> > +	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
> > +	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
> > +#endif
> > +
> > +	kvm_rip_write(vcpu, regs->rip);
> > +	kvm_set_rflags(vcpu, regs->rflags);
> > +
> > +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +}
> > +
> 
> kvm_arch_vcpu_ioctl_set_regs() returns an int (so that, for e.g., in ARM 
> it can return an error to indicate that the function is not 
> supported/implemented). Is there a reason this function shouldn't do the 
> same (is it only ever going to be implemented for x86)?
> 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6bdd4b9f6611..68e4d756f5c9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -767,6 +767,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
> >   
> >   int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> >   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> > +void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
> >   int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
> >   				  struct kvm_sregs *sregs);
> 
> 
> Patrick

\x13Hi Patrick,

Thank you for taking the time to review these patches.

You're right. This function should return an error code, regardless on
the time \x12\x12\x12\x12when ARM will be supported.

Adalbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
  2017-12-21 22:01     ` Patrick Colp
@ 2017-12-22 10:01       ` alazar
  -1 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22 10:01 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On Thu, 21 Dec 2017 17:01:02 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > These callbacks return a boolean value. If false, the emulation should
> > stop and the instruction should be reexecuted in guest. The preread
> > callback can return the bytes needed by the read operation.
> > 
> > The kvm_page_track_create_memslot() was extended in order to track gfn-s
> > as soon as the memory slots are created.
> > 
> > +/*
> > + * Notify the node that an instruction is about to be executed.
> > + * Returning false doesn't stop the other nodes from being called,
> > + * but it will stop the emulation with ?!.
> 
> With what?
> 
> > +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> 
> Patrick

With X86EMUL_RETRY_INSTR, or some return value, depending on the context.

Currently, we call this function when the instruction is fetched, to
give the introspection tool more options. Depending on its policies,
the introspector could:
 - skip the instruction (and retry to guest)
 - remove the tracking for the "current" page (and retry to guest)
 - change the instruction (and continue the emulation)
 - do nothing but log (and continue the emulation)

Thanks for spotting this,
Adalbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec
@ 2017-12-22 10:01       ` alazar
  0 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22 10:01 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On Thu, 21 Dec 2017 17:01:02 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > These callbacks return a boolean value. If false, the emulation should
> > stop and the instruction should be reexecuted in guest. The preread
> > callback can return the bytes needed by the read operation.
> > 
> > The kvm_page_track_create_memslot() was extended in order to track gfn-s
> > as soon as the memory slots are created.
> > 
> > +/*
> > + * Notify the node that an instruction is about to be executed.
> > + * Returning false doesn't stop the other nodes from being called,
> > + * but it will stop the emulation with ?!.
> 
> With what?
> 
> > +bool kvm_page_track_preexec(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> 
> Patrick

With X86EMUL_RETRY_INSTR, or some return value, depending on the context.

Currently, we call this function when the instruction is fetched, to
give the introspection tool more options. Depending on its policies,
the introspector could:
 - skip the instruction (and retry to guest)
 - remove the tracking for the "current" page (and retry to guest)
 - change the instruction (and continue the emulation)
 - do nothing but log (and continue the emulation)

Thanks for spotting this,
Adalbert

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
  2017-12-21 21:17     ` Patrick Colp
  (?)
@ 2017-12-22 10:44     ` Mircea CIRJALIU-MELIU
  2017-12-22 14:30       ` Patrick Colp
  -1 siblings, 1 reply; 79+ messages in thread
From: Mircea CIRJALIU-MELIU @ 2017-12-22 10:44 UTC (permalink / raw)
  To: Patrick Colp, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Mihai Donțu

Fixed all the issues addressed in the original comments + more.
Please see inline answers.

-----Original Message-----
From: Patrick Colp [mailto:patrick.colp@oracle.com] 
Sent: Thursday, 21 December 2017 23:17
To: Adalber Lazăr <alazar@bitdefender.com>; kvm@vger.kernel.org
Cc: linux-mm@kvack.org; Paolo Bonzini <pbonzini@redhat.com>; Radim Krčmář <rkrcmar@redhat.com>; Xiao Guangrong <guangrong.xiao@linux.intel.com>; Mihai Donțu <mdontu@bitdefender.com>; Mircea CIRJALIU-MELIU <mcirjaliu@bitdefender.com>
Subject: Re: [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> An introspection tool running in a dedicated VM can use the new device
> (/dev/kvmmem) to map memory from other introspected VM-s.
> 
> Two ioctl operations are supported:
>    - KVM_INTRO_MEM_MAP/struct kvmi_mem_map
>    - KVM_INTRO_MEM_UNMAP/unsigned long
> 
> In order to map an introspected gpa to the local gva, the process 
> using this device needs to obtain a token from the host KVMI subsystem 
> (see Documentation/virtual/kvm/kvmi.rst - KVMI_GET_MAP_TOKEN).
> 
> Both operations use hypercalls (KVM_HC_MEM_MAP, KVM_INTRO_MEM_UNMAP) 
> to pass the requests to the host kernel/KVMi (see hypercalls.txt).
> 
> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
> ---
>   arch/x86/Kconfig                  |   9 +
>   arch/x86/include/asm/kvmi_guest.h |  10 +
>   arch/x86/kernel/Makefile          |   1 +
>   arch/x86/kernel/kvmi_mem_guest.c  |  26 +++
>   virt/kvm/kvmi_mem_guest.c         | 379 ++++++++++++++++++++++++++++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 arch/x86/include/asm/kvmi_guest.h
>   create mode 100644 arch/x86/kernel/kvmi_mem_guest.c
>   create mode 100644 virt/kvm/kvmi_mem_guest.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 
> 8eed3f94bfc7..6e2548f4d44c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -782,6 +782,15 @@ config KVM_DEBUG_FS
>   	  Statistics are displayed in debugfs filesystem. Enabling this option
>   	  may incur significant overhead.
>   
> +config KVMI_MEM_GUEST
> +	bool "KVM Memory Introspection support on Guest"
> +	depends on KVM_GUEST
> +	default n
> +	---help---
> +	  This option enables functions and hypercalls for security applications
> +	  running in a separate VM to control the execution of other VM-s, query
> +	  the state of the vCPU-s (GPR-s, MSR-s etc.).
> +
>   config PARAVIRT_TIME_ACCOUNTING
>   	bool "Paravirtual steal time accounting"
>   	depends on PARAVIRT
> diff --git a/arch/x86/include/asm/kvmi_guest.h 
> b/arch/x86/include/asm/kvmi_guest.h
> new file mode 100644
> index 000000000000..c7ed53a938e0
> --- /dev/null
> +++ b/arch/x86/include/asm/kvmi_guest.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */ #ifndef __KVMI_GUEST_H__ 
> +#define __KVMI_GUEST_H__
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +	gpa_t req_gpa, gpa_t map_gpa);
> +long kvmi_arch_unmap_hc(gpa_t map_gpa);
> +
> +
> +#endif
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 
> 81bb565f4497..fdb54b65e46e 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -111,6 +111,7 @@ obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>   obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
>   obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>   obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
> +obj-$(CONFIG_KVMI_MEM_GUEST)	+= kvmi_mem_guest.o ../../../virt/kvm/kvmi_mem_guest.o
>   
>   obj-$(CONFIG_EISA)		+= eisa.o
>   obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/kvmi_mem_guest.c 
> b/arch/x86/kernel/kvmi_mem_guest.c
> new file mode 100644
> index 000000000000..c4e2613f90f3
> --- /dev/null
> +++ b/arch/x86/kernel/kvmi_mem_guest.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <uapi/linux/kvmi.h>
> +#include <uapi/linux/kvm_para.h>
> +#include <linux/kvm_types.h>
> +#include <asm/kvm_para.h>
> +
> +long kvmi_arch_map_hc(struct kvmi_map_mem_token *tknp,
> +		       gpa_t req_gpa, gpa_t map_gpa) {
> +	return kvm_hypercall3(KVM_HC_MEM_MAP, (unsigned long)tknp,
> +			      req_gpa, map_gpa);
> +}
> +
> +long kvmi_arch_unmap_hc(gpa_t map_gpa) {
> +	return kvm_hypercall1(KVM_HC_MEM_UNMAP, map_gpa); }
> diff --git a/virt/kvm/kvmi_mem_guest.c b/virt/kvm/kvmi_mem_guest.c new 
> file mode 100644 index 000000000000..118c22ca47c5
> --- /dev/null
> +++ b/virt/kvm/kvmi_mem_guest.c
> @@ -0,0 +1,379 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KVM introspection guest implementation
> + *
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/types.h>
> +#include <linux/kvm_host.h>
> +#include <linux/kvm_para.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/rmap.h>
> +#include <linux/sched.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <uapi/linux/kvmi.h>
> +#include <asm/kvmi_guest.h>
> +
> +#define ASSERT(exp) BUG_ON(!(exp))
> +
> +
> +static struct list_head file_list;
> +static spinlock_t file_lock;
> +
> +struct file_map {
> +	struct list_head file_list;
> +	struct file *file;
> +	struct list_head map_list;
> +	struct mutex lock;
> +	int active;	/* for tearing down */
> +};
> +
> +struct page_map {
> +	struct list_head map_list;
> +	__u64 gpa;
> +	unsigned long vaddr;
> +	unsigned long paddr;
> +};
> +
> +
> +static int kvm_dev_open(struct inode *inodep, struct file *filp) {
> +	struct file_map *fmp;
> +
> +	pr_debug("kvmi: file %016lx opened by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* link the file 1:1 with such a structure */
> +	fmp = kmalloc(sizeof(struct file_map), GFP_KERNEL);

I think this is supposed to be "kmalloc(sizeof(*fmp), GFP_KERNEL)".

Fixed all the kmalloc()s in kvmi_mem* files.

> +	if (fmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&fmp->file_list);
> +	fmp->file = filp;
> +	filp->private_data = fmp;
> +	INIT_LIST_HEAD(&fmp->map_list);
> +	mutex_init(&fmp->lock);
> +	fmp->active = 1;
> +
> +	/* add the entry to the global list */
> +	spin_lock(&file_lock);
> +	list_add_tail(&fmp->file_list, &file_list);
> +	spin_unlock(&file_lock);
> +
> +	return 0;
> +}
> +
> +/* actually does the mapping of a page */ static long 
> +_do_mapping(struct kvmi_mem_map *map_req, struct page_map *pmap)

Here you have a "struct page_map" and call it pmap. However, in the rest of the code, whenever there's a "struct page_map" it's called pmp. It seems that it would be good to stay consistent with the naming, so perhaps rename it here to pmp as well?

Done.

> +{
> +	unsigned long paddr;
> +	struct vm_area_struct *vma = NULL; > +	struct page *page;

Out of curiosity, why do you set "*vma = NULL" but not "*page = NULL"?

Leftover from older code. Removed.

> +	long result;
> +
> +	pr_debug("kvmi: mapping remote GPA %016llx into %016llx\n",
> +		 map_req->gpa, map_req->gva);
> +
> +	/* check access to memory location */
> +	if (!access_ok(VERIFY_READ, map_req->gva, PAGE_SIZE)) {
> +		pr_err("kvmi: invalid virtual address for mapping\n");
> +		return -EINVAL;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	/* find the page to be replaced */
> +	vma = find_vma(current->mm, map_req->gva);
> +	if (IS_ERR_OR_NULL(vma)) {
> +		result = PTR_ERR(vma);
> +		pr_err("kvmi: find_vma() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	page = follow_page(vma, map_req->gva, 0);
> +	if (IS_ERR_OR_NULL(page)) {
> +		result = PTR_ERR(page);
> +		pr_err("kvmi: follow_page() failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM))
> +		dump_page(page, "page to map_req into");
> +
> +	WARN(is_zero_pfn(page_to_pfn(page)), "zero-page still mapped");
> +
> +	/* get the physical address and store it in page_map */
> +	paddr = page_to_phys(page);
> +	pr_debug("kvmi: page phys addr %016lx\n", paddr);
> +	pmap->paddr = paddr;
> +
> +	/* last thing to do is host mapping */
> +	result = kvmi_arch_map_hc(&map_req->token, map_req->gpa, paddr);
> +	if (IS_ERR_VALUE(result)) {
> +		pr_err("kvmi: HC failed with result %ld\n", result);
> +		goto out;
> +	}
> +
> +out:
> +	up_read(&current->mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +/* actually does the unmapping of a page */ static long 
> +_do_unmapping(unsigned long paddr) {
> +	long result;
> +
> +	pr_debug("kvmi: unmapping request for phys addr %016lx\n", paddr);
> +
> +	/* local GPA uniquely identifies the mapping on the host */
> +	result = kvmi_arch_unmap_hc(paddr);
> +	if (IS_ERR_VALUE(result))
> +		pr_warn("kvmi: HC failed with result %ld\n", result);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_map(struct file_map *fmp, struct 
> +kvmi_mem_map *map) {
> +	struct page_map *pmp;
> +	long result = 0;

Out of curiosity again, why do you set "result = 0" here when it's always set before used (and, for e.g., _do_unmapping() doesn't do "result = 0")?

Same as above. Cleaned up unnecessary assignments to result in the rest of the code.

> +
> +	if (!access_ok(VERIFY_READ, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +	if (!access_ok(VERIFY_WRITE, map->gva, PAGE_SIZE))
> +		return -EINVAL;
> +
> +	/* prepare list entry */
> +	pmp = kmalloc(sizeof(struct page_map), GFP_KERNEL);

This should also probably be "kmalloc(sizeof(*pmp), GFP_KERNEL)".

Fixed.

> +	if (pmp == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&pmp->map_list);
> +	pmp->gpa = map->gpa;
> +	pmp->vaddr = map->gva;
> +
> +	/* acquire the file mapping */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		result = -ENODEV;
> +		pr_warn("kvmi: unable to map, file is being closed\n");
> +		goto out_err;
> +	}
> +
> +	/* do the actual mapping */
> +	result = _do_mapping(map, pmp);
> +	if (IS_ERR_VALUE(result))
> +		goto out_err;
> +
> +	/* link to list */
> +	list_add_tail(&pmp->map_list, &fmp->map_list);
> +
> +	/* all fine */
> +	result = 0;
> +	goto out_finalize;
> +
> +out_err:
> +	kfree(pmp);
> +
> +out_finalize:
> +	mutex_unlock(&fmp->lock);
> +
> +	return result;
> +}
> +
> +static long kvm_dev_ioctl_unmap(struct file_map *fmp, unsigned long 
> +vaddr) {
> +	struct list_head *cur;
> +	struct page_map *pmp;
> +	bool found = false;
> +
> +	/* acquire the file */
> +	mutex_lock(&fmp->lock);
> +
> +	/* check if other thread is closing the file */
> +	if (!fmp->active) {
> +		mutex_unlock(&fmp->lock);

Wouldn't this be better replaced with a "goto out_err" like in kvm_dev_ioctl_map()?

Not really. I used the out_err/out_finalize recovery model for cases where all the action happens inside a critical section and the lock has to be closed at the end.
In this case, more actions can be taken outside the lock.
But refactored anyway for the sake of consistency.

> +		pr_warn("kvmi: unable to unmap, file is being closed\n");
> +		return -ENODEV;
> +	}
> +
> +	/* check that this address belongs to us */
> +	list_for_each(cur, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* found */
> +		if (pmp->vaddr == vaddr) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	/* not found ? */
> +	if (!found) {
> +		mutex_unlock(&fmp->lock);

Here too: "goto out_err".

Fixed for both cases. Only 1 call to mutex_unlock() done now.

> +		pr_err("kvmi: address %016lx not mapped\n", vaddr);
> +		return -ENOENT;
> +	}
> +
> +	/* decouple guest mapping */
> +	list_del(&pmp->map_list);
> +	mutex_unlock(&fmp->lock);

In kvm_dev_ioctl_map(), the fmp mutex is held across the _do_mapping() call. Is there any particular reason why here the mutex doesn't need to be held across the _do_unmapping() call? Or was that more an artifact of having a common "out_err" exit in kvm_dev_ioctl_map()?

The fmp mutex:
1. protects the fmp list against concurrent access.
2. protects against teardown (one thread tries to do a mapping while another closes the file).
The call to _do_mapping() - which can fail, must be done inside the critical section before we add a valid pmp entry to the list. 
On the other hand, inside kvm_dev_ioctl_unmap() we must extract a valid pmp entry from the list before calling _do_unmapping().
There is no real reason for protecting the _do_mapping() call, but I chose not to revert the mapping in case I hit the teardown case.

> +
> +	/* unmap & ignore result */
> +	_do_unmapping(pmp->paddr);
> +
> +	/* free guest mapping */
> +	kfree(pmp);
> +
> +	return 0;
> +}
> +
> +static long kvm_dev_ioctl(struct file *filp,
> +			  unsigned int ioctl, unsigned long arg) {
> +	void __user *argp = (void __user *) arg;
> +	struct file_map *fmp;
> +	long result;
> +
> +	/* minor check */
> +	fmp = filp->private_data;
> +	ASSERT(fmp->file == filp);
> +
> +	switch (ioctl) {
> +	case KVM_INTRO_MEM_MAP: {
> +		struct kvmi_mem_map map;
> +
> +		result = -EFAULT;
> +		if (copy_from_user(&map, argp, sizeof(map)))
> +			break;
> +
> +		result = kvm_dev_ioctl_map(fmp, &map);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Since kvm_dev_ioctl_map() either returns an error or 0, couldn't this just be reduced to:
		result = kvm_dev_ioctl_map(fmap, &map);
		break;
	}

Again, leftovers from older code. Also fixed that.

> +	case KVM_INTRO_MEM_UNMAP: {
> +		unsigned long vaddr = (unsigned long) arg;
> +
> +		result = kvm_dev_ioctl_unmap(fmp, vaddr);
> +		if (IS_ERR_VALUE(result))
> +			break;
> +
> +		result = 0;
> +		break;
> +	}

Ditto here.

Fixed.

> +	default:
> +		pr_err("kvmi: ioctl %d not implemented\n", ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +static int kvm_dev_release(struct inode *inodep, struct file *filp) {
> +	int result = 0;

You set "result = 0" here, but result isn't used until the end, and just to return it.

Returned 0 directly.

> +	struct file_map *fmp;
> +	struct list_head *cur, *next;
> +	struct page_map *pmp;
> +
> +	pr_debug("kvmi: file %016lx closed by process %s\n",
> +		 (unsigned long) filp, current->comm);
> +
> +	/* acquire the file */
> +	fmp = filp->private_data;
> +	mutex_lock(&fmp->lock);
> +
> +	/* mark for teardown */
> +	fmp->active = 0;
> +
> +	/* release mappings taken on this instance of the file */
> +	list_for_each_safe(cur, next, &fmp->map_list) {
> +		pmp = list_entry(cur, struct page_map, map_list);
> +
> +		/* unmap address */
> +		_do_unmapping(pmp->paddr);
> +
> +		/* decouple & free guest mapping */
> +		list_del(&pmp->map_list);
> +		kfree(pmp);
> +	}
> +
> +	/* done processing this file mapping */
> +	mutex_unlock(&fmp->lock);
> +
> +	/* decouple file mapping */
> +	spin_lock(&file_lock);
> +	list_del(&fmp->file_list);
> +	spin_unlock(&file_lock);
> +
> +	/* free it */
> +	kfree(fmp);
> +
> +	return result;

This is the first time result is used. Couldn't this just be replaced with "return 0"?

Yes, it can.

> +}
> +
> +
> +static const struct file_operations kvmmem_ops = {
> +	.open		= kvm_dev_open,
> +	.unlocked_ioctl = kvm_dev_ioctl,
> +	.compat_ioctl   = kvm_dev_ioctl,
> +	.release	= kvm_dev_release,
> +};

Here you have all the rvals aligned...

> +
> +static struct miscdevice kvm_mem_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "kvmmem",
> +	.fops = &kvmmem_ops,
> +};

...but here you don't. I'm not sure what the "proper" style is, but I think it should at least just be consistent.

Fixed.

> +
> +int __init kvm_intro_guest_init(void) {
> +	int result;
> +
> +	if (!kvm_para_available()) {
> +		pr_err("kvmi: paravirt not available\n");
> +		return -EPERM;
> +	}
> +
> +	result = misc_register(&kvm_mem_dev);
> +	if (result) {
> +		pr_err("kvmi: misc device register failed: %d\n", result);
> +		return result;
> +	}
> +
> +	INIT_LIST_HEAD(&file_list);
> +	spin_lock_init(&file_lock);
> +
> +	pr_info("kvmi: guest introspection device created\n");
> +
> +	return 0;
> +}
> +
> +void kvm_intro_guest_exit(void)
> +{
> +	misc_deregister(&kvm_mem_dev);
> +}
> +
> +module_init(kvm_intro_guest_init)
> +module_exit(kvm_intro_guest_exit)
> 


Patrick

________________________
This email was scanned by Bitdefender

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
  2017-12-21 21:29     ` Patrick Colp
@ 2017-12-22 11:50       ` Mihai Donțu
  -1 siblings, 0 replies; 79+ messages in thread
From: Mihai Donțu @ 2017-12-22 11:50 UTC (permalink / raw)
  To: Patrick Colp, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Xiao Guangrong

Hi Patrick,

On Thu, 2017-12-21 at 16:29 -0500, Patrick Colp wrote:
> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > These are helper functions used by the VM introspection subsytem on the
> > PF call path.
> > 
> > Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |  7 +++++++
> >   arch/x86/include/asm/vmx.h      |  2 ++
> >   arch/x86/kvm/mmu.c              | 10 ++++++++++
> >   arch/x86/kvm/svm.c              |  8 ++++++++
> >   arch/x86/kvm/vmx.c              |  9 +++++++++
> >   5 files changed, 36 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 8842d8e1e4ee..239eb628f8fb 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
> >   	/* set at EPT violation at this point */
> >   	unsigned long exit_qualification;
> >   
> > +	/* #PF translated error code from EPT/NPT exit reason */
> > +	u64 error_code;
> > +
> >   	/* pv related host specific info */
> >   	struct {
> >   		bool pv_unhalted;
> > @@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
> >   	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
> >   
> >   	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
> > +	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
> >   };
> >   
> >   struct kvm_arch_async_pf {
> > @@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >   
> >   void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   				bool enable);
> > +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
> > +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
> > +
> >   #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index 8b6780751132..7036125349dd 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -530,6 +530,7 @@ struct vmx_msr_entry {
> >   #define EPT_VIOLATION_READABLE_BIT	3
> >   #define EPT_VIOLATION_WRITABLE_BIT	4
> >   #define EPT_VIOLATION_EXECUTABLE_BIT	5
> > +#define EPT_VIOLATION_GLA_VALID_BIT	7
> >   #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
> >   #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
> >   #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
> > @@ -537,6 +538,7 @@ struct vmx_msr_entry {
> >   #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
> >   #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
> >   #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> > +#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
> >   #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
> >   
> >   /*
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index c4deb1f34faa..55fcb0292724 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
> >   	unregister_shrinker(&mmu_shrinker);
> >   	mmu_audit_disable();
> >   }
> > +
> > +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	return kvm_x86_ops->fault_gla(vcpu);
> > +}
> > +
> > +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
> > +{
> > +	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
> > +}
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 5f7482851223..f41e4d7008d7 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
> >   	u64 fault_address = svm->vmcb->control.exit_info_2;
> >   	u64 error_code = svm->vmcb->control.exit_info_1;
> >   
> > +	svm->vcpu.arch.error_code = error_code;
> > +
> >   	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
> >   			svm->vmcb->control.insn_bytes,
> >   			svm->vmcb->control.insn_len);
> > @@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   	set_msr_interception(msrpm, msr, enable, enable);
> >   }
> >   
> > +static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	return ~0ull;
> > +}
> > +
> >   static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
> >   	.cpu_has_kvm_support = has_svm,
> >   	.disabled_by_bios = is_disabled,
> > @@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
> >   	.enable_smi_window = enable_smi_window,
> >   
> >   	.msr_intercept = svm_msr_intercept,
> > +	.fault_gla = svm_fault_gla
> 
> Minor nit, it seems like this line should probably end with a "," so 
> that future additions don't need to modify this line.

Will do.

> 
> >   };
> >   
> >   static int __init svm_init(void)
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 9c984bbe263e..5487e0242030 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >   	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> >   
> >   	vcpu->arch.exit_qualification = exit_qualification;
> > +	vcpu->arch.error_code = error_code;
> >   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> >   }
> >   
> > @@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   	}
> >   }
> >   
> > +static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
> > +		return vmcs_readl(GUEST_LINEAR_ADDRESS);
> > +	return ~0ul;
> 
> Should this not be "return ~0ull" (like in svm_fault_gla())?

Yes, it should.

> > +}
> > +
> >   static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >   	.cpu_has_kvm_support = cpu_has_kvm_support,
> >   	.disabled_by_bios = vmx_disabled_by_bios,
> > @@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >   	.enable_smi_window = enable_smi_window,
> >   
> >   	.msr_intercept = vmx_msr_intercept,
> > +	.fault_gla = vmx_fault_gla
> 
> Same deal here with the trailing ","

Will do.

> 
> >   };
> >   
> >   static int __init vmx_init(void)
> > 

Thank you for the review!

-- 
Mihai Donțu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla()
@ 2017-12-22 11:50       ` Mihai Donțu
  0 siblings, 0 replies; 79+ messages in thread
From: Mihai Donțu @ 2017-12-22 11:50 UTC (permalink / raw)
  To: Patrick Colp, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Xiao Guangrong

Hi Patrick,

On Thu, 2017-12-21 at 16:29 -0500, Patrick Colp wrote:
> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > These are helper functions used by the VM introspection subsytem on the
> > PF call path.
> > 
> > Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |  7 +++++++
> >   arch/x86/include/asm/vmx.h      |  2 ++
> >   arch/x86/kvm/mmu.c              | 10 ++++++++++
> >   arch/x86/kvm/svm.c              |  8 ++++++++
> >   arch/x86/kvm/vmx.c              |  9 +++++++++
> >   5 files changed, 36 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 8842d8e1e4ee..239eb628f8fb 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -692,6 +692,9 @@ struct kvm_vcpu_arch {
> >   	/* set at EPT violation at this point */
> >   	unsigned long exit_qualification;
> >   
> > +	/* #PF translated error code from EPT/NPT exit reason */
> > +	u64 error_code;
> > +
> >   	/* pv related host specific info */
> >   	struct {
> >   		bool pv_unhalted;
> > @@ -1081,6 +1084,7 @@ struct kvm_x86_ops {
> >   	int (*enable_smi_window)(struct kvm_vcpu *vcpu);
> >   
> >   	void (*msr_intercept)(struct kvm_vcpu *vcpu, unsigned int msr, bool enable);
> > +	u64 (*fault_gla)(struct kvm_vcpu *vcpu);
> >   };
> >   
> >   struct kvm_arch_async_pf {
> > @@ -1455,4 +1459,7 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >   
> >   void kvm_arch_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   				bool enable);
> > +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu);
> > +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu);
> > +
> >   #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index 8b6780751132..7036125349dd 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -530,6 +530,7 @@ struct vmx_msr_entry {
> >   #define EPT_VIOLATION_READABLE_BIT	3
> >   #define EPT_VIOLATION_WRITABLE_BIT	4
> >   #define EPT_VIOLATION_EXECUTABLE_BIT	5
> > +#define EPT_VIOLATION_GLA_VALID_BIT	7
> >   #define EPT_VIOLATION_GVA_TRANSLATED_BIT 8
> >   #define EPT_VIOLATION_ACC_READ		(1 << EPT_VIOLATION_ACC_READ_BIT)
> >   #define EPT_VIOLATION_ACC_WRITE		(1 << EPT_VIOLATION_ACC_WRITE_BIT)
> > @@ -537,6 +538,7 @@ struct vmx_msr_entry {
> >   #define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
> >   #define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
> >   #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> > +#define EPT_VIOLATION_GLA_VALID		(1 << EPT_VIOLATION_GLA_VALID_BIT)
> >   #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
> >   
> >   /*
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index c4deb1f34faa..55fcb0292724 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -5530,3 +5530,13 @@ void kvm_mmu_module_exit(void)
> >   	unregister_shrinker(&mmu_shrinker);
> >   	mmu_audit_disable();
> >   }
> > +
> > +u64 kvm_mmu_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	return kvm_x86_ops->fault_gla(vcpu);
> > +}
> > +
> > +bool kvm_mmu_nested_guest_page_fault(struct kvm_vcpu *vcpu)
> > +{
> > +	return !!(vcpu->arch.error_code & PFERR_GUEST_PAGE_MASK);
> > +}
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 5f7482851223..f41e4d7008d7 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -2145,6 +2145,8 @@ static int pf_interception(struct vcpu_svm *svm)
> >   	u64 fault_address = svm->vmcb->control.exit_info_2;
> >   	u64 error_code = svm->vmcb->control.exit_info_1;
> >   
> > +	svm->vcpu.arch.error_code = error_code;
> > +
> >   	return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
> >   			svm->vmcb->control.insn_bytes,
> >   			svm->vmcb->control.insn_len);
> > @@ -5514,6 +5516,11 @@ static void svm_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   	set_msr_interception(msrpm, msr, enable, enable);
> >   }
> >   
> > +static u64 svm_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	return ~0ull;
> > +}
> > +
> >   static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
> >   	.cpu_has_kvm_support = has_svm,
> >   	.disabled_by_bios = is_disabled,
> > @@ -5631,6 +5638,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
> >   	.enable_smi_window = enable_smi_window,
> >   
> >   	.msr_intercept = svm_msr_intercept,
> > +	.fault_gla = svm_fault_gla
> 
> Minor nit, it seems like this line should probably end with a "," so 
> that future additions don't need to modify this line.

Will do.

> 
> >   };
> >   
> >   static int __init svm_init(void)
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 9c984bbe263e..5487e0242030 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6541,6 +6541,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >   	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> >   
> >   	vcpu->arch.exit_qualification = exit_qualification;
> > +	vcpu->arch.error_code = error_code;
> >   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> >   }
> >   
> > @@ -12120,6 +12121,13 @@ static void vmx_msr_intercept(struct kvm_vcpu *vcpu, unsigned int msr,
> >   	}
> >   }
> >   
> > +static u64 vmx_fault_gla(struct kvm_vcpu *vcpu)
> > +{
> > +	if (vcpu->arch.exit_qualification & EPT_VIOLATION_GLA_VALID)
> > +		return vmcs_readl(GUEST_LINEAR_ADDRESS);
> > +	return ~0ul;
> 
> Should this not be "return ~0ull" (like in svm_fault_gla())?

Yes, it should.

> > +}
> > +
> >   static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >   	.cpu_has_kvm_support = cpu_has_kvm_support,
> >   	.disabled_by_bios = vmx_disabled_by_bios,
> > @@ -12252,6 +12260,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >   	.enable_smi_window = enable_smi_window,
> >   
> >   	.msr_intercept = vmx_msr_intercept,
> > +	.fault_gla = vmx_fault_gla
> 
> Same deal here with the trailing ","

Will do.

> 
> >   };
> >   
> >   static int __init vmx_init(void)
> > 

Thank you for the review!

-- 
Mihai DonE?u

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22  7:34     ` Patrick Colp
@ 2017-12-22 14:11       ` Adalbert LazA?r
  -1 siblings, 0 replies; 79+ messages in thread
From: Adalbert Lazăr @ 2017-12-22 14:11 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

We've made changes in all the places pointed by you, but read below.
Thanks again,
Adalbert

On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This subsystem is split into three source files:
> >   - kvmi_msg.c - ABI and socket related functions
> >   - kvmi_mem.c - handle map/unmap requests from the introspector
> >   - kvmi.c - all the other
> > 
> > The new data used by this subsystem is attached to the 'kvm' and
> > 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> > structures).
> > 
> > Besides the KVMI system, this patch exports the
> > kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> > adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> > (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> > 
> > Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> > Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> > Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
> > Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
> > Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |    1 +
> >   arch/x86/kvm/Makefile           |    1 +
> >   arch/x86/kvm/x86.c              |    4 +-
> >   include/linux/kvm_host.h        |    4 +
> >   include/linux/kvmi.h            |   32 +
> >   include/linux/mm.h              |    3 +
> >   include/trace/events/kvmi.h     |  174 +++++
> >   include/uapi/linux/kvm.h        |    8 +
> >   mm/internal.h                   |    5 -
> >   virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
> >   virt/kvm/kvmi_int.h             |  121 ++++
> >   virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
> >   virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
> >   13 files changed, 3620 insertions(+), 7 deletions(-)
> >   create mode 100644 include/linux/kvmi.h
> >   create mode 100644 include/trace/events/kvmi.h
> >   create mode 100644 virt/kvm/kvmi.c
> >   create mode 100644 virt/kvm/kvmi_int.h
> >   create mode 100644 virt/kvm/kvmi_mem.c
> >   create mode 100644 virt/kvm/kvmi_msg.c
> > 
> > +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
> > +{
> > +	struct kvmi_mem_access *m;
> > +	struct kvmi_mem_access *__m;
> > +	struct kvmi *ikvm = IKVM(kvm);
> > +	gfn_t gfn = gpa_to_gfn(gpa);
> > +
> > +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
> > +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
> 
> If there's an error, should this not return or something instead of 
> continuing as if nothing is wrong?

It was a debug message masqueraded as an error message to be logged in dmesg.
The page will be tracked when the memslot becomes available.

> > +static bool alloc_kvmi(struct kvm *kvm)
> > +{
> > +	bool done;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	done = (
> > +		maybe_delayed_init() == 0    &&
> > +		IKVM(kvm)            == NULL &&
> > +		__alloc_kvmi(kvm)    == true
> > +	);
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return done;
> > +}
> > +
> > +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > +		if (!IKVM(vcpu))
> > +			__alloc_vcpu_kvmi(vcpu);
> > +	mutex_unlock(&kvm->lock);
> > +}
> > +
> > +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
> > +{
> > +	struct kvmi *ikvm = IKVM(kvm);
> > +
> > +	if (is_introspected(ikvm)) {
> > +		kvm_err("Guest already introspected\n");
> > +		return false;
> > +	}
> > +
> > +	if (!kvmi_msg_init(ikvm, qemu->fd))
> > +		return false;
> 
> kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and 
> then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check 
> if ikvm is NULL, but if it is, it returns false, which would still end 
> up here. There should be a check that ikvm is not NULL before this if 
> statement.

setup_socket() is called only when 'ikvm' is not NULL.

is_introspected() checks 'ikvm' because it is called from other contexts.
The real check is ikvm->sock (to see if the 'command channel' is 'active').

> > +
> > +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
> > +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
> > +
> > +	alloc_all_kvmi_vcpu(kvm);
> > +	queue_work(wq, &ikvm->work);
> > +
> > +	return true;
> > +}
> > +
> > +/*
> > + * When called from outside a page fault handler, this call should
> > + * return ~0ull
> > + */
> > +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> > +	u64 gla;
> > +	u64 gla_val;
> > +	u64 v;
> > +
> > +	if (!vcpu->arch.gpa_available)
> > +		return ~0ull;
> > +
> > +	gla = kvm_mmu_fault_gla(vcpu);
> > +	if (gla == ~0ull)
> > +		return gla;
> > +	gla_val = gla;
> > +
> > +	/* Handle the potential overflow by returning ~0ull */
> > +	if (vcpu->arch.gpa_val > gpa) {
> > +		v = vcpu->arch.gpa_val - gpa;
> > +		if (v > gla)
> > +			gla = ~0ull;
> > +		else
> > +			gla -= v;
> > +	} else {
> > +		v = gpa - vcpu->arch.gpa_val;
> > +		if (v > (U64_MAX - gla))
> > +			gla = ~0ull;
> > +		else
> > +			gla += v;
> > +	}
> > +
> > +	return gla;
> > +}
> > +
> > +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
> > +			       u8 *new,
> > +			       int bytes,
> > +			       struct kvm_page_track_notifier_node *node,
> > +			       bool *data_ready)
> > +{
> > +	u64 gla;
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	bool ret = true;
> > +
> > +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> > +		return ret;
> > +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> > +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
> 
> Should you not check the value of ret here before proceeding?
> 

Indeed. These 'track' functions are new additions and aren't integrated
well with kvmi_page_fault_event(). We'll change this. The code is ugly
but 'safe' (ctx_size will be non-zero only with ret == true).

> > +	if (ivcpu && ivcpu->ctx_size > 0) {
> > +		int s = min_t(int, bytes, ivcpu->ctx_size);
> > +
> > +		memcpy(new, ivcpu->ctx_data, s);
> > +		ivcpu->ctx_size = 0;
> > +
> > +		if (*data_ready)
> > +			kvm_err("Override custom data");
> > +
> > +		*data_ready = true;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> > +{
> > +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> > +
> > +	kvm_page_track_register_notifier(kvm, &kptn_node);
> > +
> > +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
> 
> Is this safe? It could return false if the alloc fails (in which case 
> the caller has to do nothing) or if setting up the socket fails (in 
> which case the caller needs to free the allocated kvmi).
>

If the socket fails for any reason (eg. the introspection tool is
stopped == socket closed) 'the plan' is to signal QEMU to reconnect
(and call kvmi_hook() again) or else let the introspected VM continue (and
try to reconnect asynchronously).

I see that kvm_page_track_register_notifier() should not be called more
than once.

Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().

> > +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
> > +{
> > +	u32 action;
> > +	u64 gpa;
> > +
> > +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
> > +		/* qemu will automatically reinject the breakpoint */
> > +		return false;
> > +
> > +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
> > +
> > +	if (gpa == UNMAPPED_GVA)
> > +		kvm_err("%s: invalid gva: %llx", __func__, gva);
> 
> If the gpa is unmapped, shouldn't it return false rather than proceeding?
> 

This was just a debug message. I'm not sure if is possible for 'gpa'
to be unmapped. Even so, the introspection tool should still be notified.

> > +
> > +	action = kvmi_msg_send_bp(vcpu, gpa);
> > +
> > +	switch (action) {
> > +	case KVMI_EVENT_ACTION_CONTINUE:
> > +		break;
> > +	case KVMI_EVENT_ACTION_RETRY:
> > +		/* rip was most likely adjusted past the INT 3 instruction */
> > +		return true;
> > +	default:
> > +		handle_common_event_actions(vcpu, action);
> > +	}
> > +
> > +	/* qemu will automatically reinject the breakpoint */
> > +	return false;
> > +}
> > +EXPORT_SYMBOL(kvmi_breakpoint_event);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 14:11       ` Adalbert LazA?r
  0 siblings, 0 replies; 79+ messages in thread
From: Adalbert LazA?r @ 2017-12-22 14:11 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

We've made changes in all the places pointed by you, but read below.
Thanks again,
Adalbert

On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This subsystem is split into three source files:
> >   - kvmi_msg.c - ABI and socket related functions
> >   - kvmi_mem.c - handle map/unmap requests from the introspector
> >   - kvmi.c - all the other
> > 
> > The new data used by this subsystem is attached to the 'kvm' and
> > 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> > structures).
> > 
> > Besides the KVMI system, this patch exports the
> > kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> > adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> > (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> > 
> > Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> > Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
> > Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
> > Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
> > Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |    1 +
> >   arch/x86/kvm/Makefile           |    1 +
> >   arch/x86/kvm/x86.c              |    4 +-
> >   include/linux/kvm_host.h        |    4 +
> >   include/linux/kvmi.h            |   32 +
> >   include/linux/mm.h              |    3 +
> >   include/trace/events/kvmi.h     |  174 +++++
> >   include/uapi/linux/kvm.h        |    8 +
> >   mm/internal.h                   |    5 -
> >   virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
> >   virt/kvm/kvmi_int.h             |  121 ++++
> >   virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
> >   virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
> >   13 files changed, 3620 insertions(+), 7 deletions(-)
> >   create mode 100644 include/linux/kvmi.h
> >   create mode 100644 include/trace/events/kvmi.h
> >   create mode 100644 virt/kvm/kvmi.c
> >   create mode 100644 virt/kvm/kvmi_int.h
> >   create mode 100644 virt/kvm/kvmi_mem.c
> >   create mode 100644 virt/kvm/kvmi_msg.c
> > 
> > +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
> > +{
> > +	struct kvmi_mem_access *m;
> > +	struct kvmi_mem_access *__m;
> > +	struct kvmi *ikvm = IKVM(kvm);
> > +	gfn_t gfn = gpa_to_gfn(gpa);
> > +
> > +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
> > +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
> 
> If there's an error, should this not return or something instead of 
> continuing as if nothing is wrong?

It was a debug message masqueraded as an error message to be logged in dmesg.
The page will be tracked when the memslot becomes available.

> > +static bool alloc_kvmi(struct kvm *kvm)
> > +{
> > +	bool done;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	done = (
> > +		maybe_delayed_init() == 0    &&
> > +		IKVM(kvm)            == NULL &&
> > +		__alloc_kvmi(kvm)    == true
> > +	);
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return done;
> > +}
> > +
> > +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	kvm_for_each_vcpu(i, vcpu, kvm)
> > +		if (!IKVM(vcpu))
> > +			__alloc_vcpu_kvmi(vcpu);
> > +	mutex_unlock(&kvm->lock);
> > +}
> > +
> > +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
> > +{
> > +	struct kvmi *ikvm = IKVM(kvm);
> > +
> > +	if (is_introspected(ikvm)) {
> > +		kvm_err("Guest already introspected\n");
> > +		return false;
> > +	}
> > +
> > +	if (!kvmi_msg_init(ikvm, qemu->fd))
> > +		return false;
> 
> kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and 
> then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check 
> if ikvm is NULL, but if it is, it returns false, which would still end 
> up here. There should be a check that ikvm is not NULL before this if 
> statement.

setup_socket() is called only when 'ikvm' is not NULL.

is_introspected() checks 'ikvm' because it is called from other contexts.
The real check is ikvm->sock (to see if the 'command channel' is 'active').

> > +
> > +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
> > +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
> > +
> > +	alloc_all_kvmi_vcpu(kvm);
> > +	queue_work(wq, &ikvm->work);
> > +
> > +	return true;
> > +}
> > +
> > +/*
> > + * When called from outside a page fault handler, this call should
> > + * return ~0ull
> > + */
> > +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> > +	u64 gla;
> > +	u64 gla_val;
> > +	u64 v;
> > +
> > +	if (!vcpu->arch.gpa_available)
> > +		return ~0ull;
> > +
> > +	gla = kvm_mmu_fault_gla(vcpu);
> > +	if (gla == ~0ull)
> > +		return gla;
> > +	gla_val = gla;
> > +
> > +	/* Handle the potential overflow by returning ~0ull */
> > +	if (vcpu->arch.gpa_val > gpa) {
> > +		v = vcpu->arch.gpa_val - gpa;
> > +		if (v > gla)
> > +			gla = ~0ull;
> > +		else
> > +			gla -= v;
> > +	} else {
> > +		v = gpa - vcpu->arch.gpa_val;
> > +		if (v > (U64_MAX - gla))
> > +			gla = ~0ull;
> > +		else
> > +			gla += v;
> > +	}
> > +
> > +	return gla;
> > +}
> > +
> > +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
> > +			       u8 *new,
> > +			       int bytes,
> > +			       struct kvm_page_track_notifier_node *node,
> > +			       bool *data_ready)
> > +{
> > +	u64 gla;
> > +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> > +	bool ret = true;
> > +
> > +	if (kvm_mmu_nested_guest_page_fault(vcpu))
> > +		return ret;
> > +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
> > +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
> 
> Should you not check the value of ret here before proceeding?
> 

Indeed. These 'track' functions are new additions and aren't integrated
well with kvmi_page_fault_event(). We'll change this. The code is ugly
but 'safe' (ctx_size will be non-zero only with ret == true).

> > +	if (ivcpu && ivcpu->ctx_size > 0) {
> > +		int s = min_t(int, bytes, ivcpu->ctx_size);
> > +
> > +		memcpy(new, ivcpu->ctx_data, s);
> > +		ivcpu->ctx_size = 0;
> > +
> > +		if (*data_ready)
> > +			kvm_err("Override custom data");
> > +
> > +		*data_ready = true;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> > +{
> > +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> > +
> > +	kvm_page_track_register_notifier(kvm, &kptn_node);
> > +
> > +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
> 
> Is this safe? It could return false if the alloc fails (in which case 
> the caller has to do nothing) or if setting up the socket fails (in 
> which case the caller needs to free the allocated kvmi).
>

If the socket fails for any reason (eg. the introspection tool is
stopped == socket closed) 'the plan' is to signal QEMU to reconnect
(and call kvmi_hook() again) or else let the introspected VM continue (and
try to reconnect asynchronously).

I see that kvm_page_track_register_notifier() should not be called more
than once.

Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().

> > +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
> > +{
> > +	u32 action;
> > +	u64 gpa;
> > +
> > +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
> > +		/* qemu will automatically reinject the breakpoint */
> > +		return false;
> > +
> > +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
> > +
> > +	if (gpa == UNMAPPED_GVA)
> > +		kvm_err("%s: invalid gva: %llx", __func__, gva);
> 
> If the gpa is unmapped, shouldn't it return false rather than proceeding?
> 

This was just a debug message. I'm not sure if is possible for 'gpa'
to be unmapped. Even so, the introspection tool should still be notified.

> > +
> > +	action = kvmi_msg_send_bp(vcpu, gpa);
> > +
> > +	switch (action) {
> > +	case KVMI_EVENT_ACTION_CONTINUE:
> > +		break;
> > +	case KVMI_EVENT_ACTION_RETRY:
> > +		/* rip was most likely adjusted past the INT 3 instruction */
> > +		return true;
> > +	default:
> > +		handle_common_event_actions(vcpu, action);
> > +	}
> > +
> > +	/* qemu will automatically reinject the breakpoint */
> > +	return false;
> > +}
> > +EXPORT_SYMBOL(kvmi_breakpoint_event);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side
  2017-12-22 10:44     ` Mircea CIRJALIU-MELIU
@ 2017-12-22 14:30       ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 14:30 UTC (permalink / raw)
  To: Mircea CIRJALIU-MELIU, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Mihai Donțu

*snip*

>> +		pr_err("kvmi: address %016lx not mapped\n", vaddr);
>> +		return -ENOENT;
>> +	}
>> +
>> +	/* decouple guest mapping */
>> +	list_del(&pmp->map_list);
>> +	mutex_unlock(&fmp->lock);
> 
> In kvm_dev_ioctl_map(), the fmp mutex is held across the _do_mapping() call. Is there any particular reason why here the mutex doesn't need to be held across the _do_unmapping() call? Or was that more an artifact of having a common "out_err" exit in kvm_dev_ioctl_map()?
> 
> The fmp mutex:
> 1. protects the fmp list against concurrent access.
> 2. protects against teardown (one thread tries to do a mapping while another closes the file).
> The call to _do_mapping() - which can fail, must be done inside the critical section before we add a valid pmp entry to the list.
> On the other hand, inside kvm_dev_ioctl_unmap() we must extract a valid pmp entry from the list before calling _do_unmapping().
> There is no real reason for protecting the _do_mapping() call, but I chose not to revert the mapping in case I hit the teardown case.
>

Gotcha. That makes sense.


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 14:11       ` Adalbert LazA?r
@ 2017-12-22 15:12         ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 15:12 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-22 09:11 AM, Adalbert Laz����r wrote:
> We've made changes in all the places pointed by you, but read below.
> Thanks again,
> Adalbert
> 
> On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
>>> From: Adalbert Lazar <alazar@bitdefender.com>
>>>
>>> This subsystem is split into three source files:
>>>    - kvmi_msg.c - ABI and socket related functions
>>>    - kvmi_mem.c - handle map/unmap requests from the introspector
>>>    - kvmi.c - all the other
>>>
>>> The new data used by this subsystem is attached to the 'kvm' and
>>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
>>> structures).
>>>
>>> Besides the KVMI system, this patch exports the
>>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
>>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
>>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
>>>
>>> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
>>> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
>>> Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
>>> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
>>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
>>> ---
>>>    arch/x86/include/asm/kvm_host.h |    1 +
>>>    arch/x86/kvm/Makefile           |    1 +
>>>    arch/x86/kvm/x86.c              |    4 +-
>>>    include/linux/kvm_host.h        |    4 +
>>>    include/linux/kvmi.h            |   32 +
>>>    include/linux/mm.h              |    3 +
>>>    include/trace/events/kvmi.h     |  174 +++++
>>>    include/uapi/linux/kvm.h        |    8 +
>>>    mm/internal.h                   |    5 -
>>>    virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>>>    virt/kvm/kvmi_int.h             |  121 ++++
>>>    virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>>>    virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>>>    13 files changed, 3620 insertions(+), 7 deletions(-)
>>>    create mode 100644 include/linux/kvmi.h
>>>    create mode 100644 include/trace/events/kvmi.h
>>>    create mode 100644 virt/kvm/kvmi.c
>>>    create mode 100644 virt/kvm/kvmi_int.h
>>>    create mode 100644 virt/kvm/kvmi_mem.c
>>>    create mode 100644 virt/kvm/kvmi_msg.c
>>>
>>> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
>>> +{
>>> +	struct kvmi_mem_access *m;
>>> +	struct kvmi_mem_access *__m;
>>> +	struct kvmi *ikvm = IKVM(kvm);
>>> +	gfn_t gfn = gpa_to_gfn(gpa);
>>> +
>>> +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
>>> +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
>>
>> If there's an error, should this not return or something instead of
>> continuing as if nothing is wrong?
> 
> It was a debug message masqueraded as an error message to be logged in dmesg.
> The page will be tracked when the memslot becomes available.

I began to wonder if that's what was going on afterwards (I saw 
kvm_err() used in some other places where it was more obvious that it 
was debug messages).

>>> +static bool alloc_kvmi(struct kvm *kvm)
>>> +{
>>> +	bool done;
>>> +
>>> +	mutex_lock(&kvm->lock);
>>> +	done = (
>>> +		maybe_delayed_init() == 0    &&
>>> +		IKVM(kvm)            == NULL &&
>>> +		__alloc_kvmi(kvm)    == true
>>> +	);
>>> +	mutex_unlock(&kvm->lock);
>>> +
>>> +	return done;
>>> +}
>>> +
>>> +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
>>> +{
>>> +	struct kvm_vcpu *vcpu;
>>> +	int i;
>>> +
>>> +	mutex_lock(&kvm->lock);
>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>> +		if (!IKVM(vcpu))
>>> +			__alloc_vcpu_kvmi(vcpu);
>>> +	mutex_unlock(&kvm->lock);
>>> +}
>>> +
>>> +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
>>> +{
>>> +	struct kvmi *ikvm = IKVM(kvm);
>>> +
>>> +	if (is_introspected(ikvm)) {
>>> +		kvm_err("Guest already introspected\n");
>>> +		return false;
>>> +	}
>>> +
>>> +	if (!kvmi_msg_init(ikvm, qemu->fd))
>>> +		return false;
>>
>> kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and
>> then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check
>> if ikvm is NULL, but if it is, it returns false, which would still end
>> up here. There should be a check that ikvm is not NULL before this if
>> statement.
> 
> setup_socket() is called only when 'ikvm' is not NULL.

Ah, right. Forgot to check that :)

> 
> is_introspected() checks 'ikvm' because it is called from other contexts.
> The real check is ikvm->sock (to see if the 'command channel' is 'active').

Yes, that makes more sense.

> 
>>> +
>>> +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
>>> +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
>>> +
>>> +	alloc_all_kvmi_vcpu(kvm);
>>> +	queue_work(wq, &ikvm->work);
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +/*
>>> + * When called from outside a page fault handler, this call should
>>> + * return ~0ull
>>> + */
>>> +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
>>> +{
>>> +	u64 gla;
>>> +	u64 gla_val;
>>> +	u64 v;
>>> +
>>> +	if (!vcpu->arch.gpa_available)
>>> +		return ~0ull;
>>> +
>>> +	gla = kvm_mmu_fault_gla(vcpu);
>>> +	if (gla == ~0ull)
>>> +		return gla;
>>> +	gla_val = gla;
>>> +
>>> +	/* Handle the potential overflow by returning ~0ull */
>>> +	if (vcpu->arch.gpa_val > gpa) {
>>> +		v = vcpu->arch.gpa_val - gpa;
>>> +		if (v > gla)
>>> +			gla = ~0ull;
>>> +		else
>>> +			gla -= v;
>>> +	} else {
>>> +		v = gpa - vcpu->arch.gpa_val;
>>> +		if (v > (U64_MAX - gla))
>>> +			gla = ~0ull;
>>> +		else
>>> +			gla += v;
>>> +	}
>>> +
>>> +	return gla;
>>> +}
>>> +
>>> +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
>>> +			       u8 *new,
>>> +			       int bytes,
>>> +			       struct kvm_page_track_notifier_node *node,
>>> +			       bool *data_ready)
>>> +{
>>> +	u64 gla;
>>> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
>>> +	bool ret = true;
>>> +
>>> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
>>> +		return ret;
>>> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
>>> +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
>>
>> Should you not check the value of ret here before proceeding?
>>
> 
> Indeed. These 'track' functions are new additions and aren't integrated
> well with kvmi_page_fault_event(). We'll change this. The code is ugly
> but 'safe' (ctx_size will be non-zero only with ret == true).
> 

Ah, yes, I can see that now. Not the most obvious interaction.

>>> +	if (ivcpu && ivcpu->ctx_size > 0) {
>>> +		int s = min_t(int, bytes, ivcpu->ctx_size);
>>> +
>>> +		memcpy(new, ivcpu->ctx_data, s);
>>> +		ivcpu->ctx_size = 0;
>>> +
>>> +		if (*data_ready)
>>> +			kvm_err("Override custom data");
>>> +
>>> +		*data_ready = true;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
>>> +{
>>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
>>> +
>>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
>>> +
>>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
>>
>> Is this safe? It could return false if the alloc fails (in which case
>> the caller has to do nothing) or if setting up the socket fails (in
>> which case the caller needs to free the allocated kvmi).
>>
> 
> If the socket fails for any reason (eg. the introspection tool is
> stopped == socket closed) 'the plan' is to signal QEMU to reconnect
> (and call kvmi_hook() again) or else let the introspected VM continue (and
> try to reconnect asynchronously).
> 
> I see that kvm_page_track_register_notifier() should not be called more
> than once.
> 
> Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().

I assume that a kvmi_rehook() function would then not call 
kvm_page_track_register_notifier() or at least have some check to make 
sure it only calls it once?

One approach would be to have separate kvmi_hook() and kvmi_rehook() 
functions. Another possibility is to have kvmi_hook() take an extra 
argument that's a boolean to specify if it's the first attempt at 
hooking or not.

>>> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
>>> +{
>>> +	u32 action;
>>> +	u64 gpa;
>>> +
>>> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
>>> +		/* qemu will automatically reinject the breakpoint */
>>> +		return false;
>>> +
>>> +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
>>> +
>>> +	if (gpa == UNMAPPED_GVA)
>>> +		kvm_err("%s: invalid gva: %llx", __func__, gva);
>>
>> If the gpa is unmapped, shouldn't it return false rather than proceeding?
>>
> 
> This was just a debug message. I'm not sure if is possible for 'gpa'
> to be unmapped. Even so, the introspection tool should still be notified.
> 
>>> +
>>> +	action = kvmi_msg_send_bp(vcpu, gpa);
>>> +
>>> +	switch (action) {
>>> +	case KVMI_EVENT_ACTION_CONTINUE:
>>> +		break;
>>> +	case KVMI_EVENT_ACTION_RETRY:
>>> +		/* rip was most likely adjusted past the INT 3 instruction */
>>> +		return true;
>>> +	default:
>>> +		handle_common_event_actions(vcpu, action);
>>> +	}
>>> +
>>> +	/* qemu will automatically reinject the breakpoint */
>>> +	return false;
>>> +}
>>> +EXPORT_SYMBOL(kvmi_breakpoint_event);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 15:12         ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 15:12 UTC (permalink / raw)
  To: Adalbert Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-22 09:11 AM, Adalbert Lazi? 1/2 i? 1/2 i? 1/2 i? 1/2 r wrote:
> We've made changes in all the places pointed by you, but read below.
> Thanks again,
> Adalbert
> 
> On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
>>> From: Adalbert Lazar <alazar@bitdefender.com>
>>>
>>> This subsystem is split into three source files:
>>>    - kvmi_msg.c - ABI and socket related functions
>>>    - kvmi_mem.c - handle map/unmap requests from the introspector
>>>    - kvmi.c - all the other
>>>
>>> The new data used by this subsystem is attached to the 'kvm' and
>>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
>>> structures).
>>>
>>> Besides the KVMI system, this patch exports the
>>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
>>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
>>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
>>>
>>> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
>>> Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
>>> Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
>>> Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
>>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
>>> ---
>>>    arch/x86/include/asm/kvm_host.h |    1 +
>>>    arch/x86/kvm/Makefile           |    1 +
>>>    arch/x86/kvm/x86.c              |    4 +-
>>>    include/linux/kvm_host.h        |    4 +
>>>    include/linux/kvmi.h            |   32 +
>>>    include/linux/mm.h              |    3 +
>>>    include/trace/events/kvmi.h     |  174 +++++
>>>    include/uapi/linux/kvm.h        |    8 +
>>>    mm/internal.h                   |    5 -
>>>    virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>>>    virt/kvm/kvmi_int.h             |  121 ++++
>>>    virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>>>    virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>>>    13 files changed, 3620 insertions(+), 7 deletions(-)
>>>    create mode 100644 include/linux/kvmi.h
>>>    create mode 100644 include/trace/events/kvmi.h
>>>    create mode 100644 virt/kvm/kvmi.c
>>>    create mode 100644 virt/kvm/kvmi_int.h
>>>    create mode 100644 virt/kvm/kvmi_mem.c
>>>    create mode 100644 virt/kvm/kvmi_msg.c
>>>
>>> +int kvmi_set_mem_access(struct kvm *kvm, u64 gpa, u8 access)
>>> +{
>>> +	struct kvmi_mem_access *m;
>>> +	struct kvmi_mem_access *__m;
>>> +	struct kvmi *ikvm = IKVM(kvm);
>>> +	gfn_t gfn = gpa_to_gfn(gpa);
>>> +
>>> +	if (kvm_is_error_hva(gfn_to_hva_safe(kvm, gfn)))
>>> +		kvm_err("Invalid gpa %llx (or memslot not available yet)", gpa);
>>
>> If there's an error, should this not return or something instead of
>> continuing as if nothing is wrong?
> 
> It was a debug message masqueraded as an error message to be logged in dmesg.
> The page will be tracked when the memslot becomes available.

I began to wonder if that's what was going on afterwards (I saw 
kvm_err() used in some other places where it was more obvious that it 
was debug messages).

>>> +static bool alloc_kvmi(struct kvm *kvm)
>>> +{
>>> +	bool done;
>>> +
>>> +	mutex_lock(&kvm->lock);
>>> +	done = (
>>> +		maybe_delayed_init() == 0    &&
>>> +		IKVM(kvm)            == NULL &&
>>> +		__alloc_kvmi(kvm)    == true
>>> +	);
>>> +	mutex_unlock(&kvm->lock);
>>> +
>>> +	return done;
>>> +}
>>> +
>>> +static void alloc_all_kvmi_vcpu(struct kvm *kvm)
>>> +{
>>> +	struct kvm_vcpu *vcpu;
>>> +	int i;
>>> +
>>> +	mutex_lock(&kvm->lock);
>>> +	kvm_for_each_vcpu(i, vcpu, kvm)
>>> +		if (!IKVM(vcpu))
>>> +			__alloc_vcpu_kvmi(vcpu);
>>> +	mutex_unlock(&kvm->lock);
>>> +}
>>> +
>>> +static bool setup_socket(struct kvm *kvm, struct kvm_introspection *qemu)
>>> +{
>>> +	struct kvmi *ikvm = IKVM(kvm);
>>> +
>>> +	if (is_introspected(ikvm)) {
>>> +		kvm_err("Guest already introspected\n");
>>> +		return false;
>>> +	}
>>> +
>>> +	if (!kvmi_msg_init(ikvm, qemu->fd))
>>> +		return false;
>>
>> kvmi_msg_init assumes that ikvm is not NULL -- it makes no check and
>> then does "WRITE_ONCE(ikvm->sock, sock)". is_introspected() does check
>> if ikvm is NULL, but if it is, it returns false, which would still end
>> up here. There should be a check that ikvm is not NULL before this if
>> statement.
> 
> setup_socket() is called only when 'ikvm' is not NULL.

Ah, right. Forgot to check that :)

> 
> is_introspected() checks 'ikvm' because it is called from other contexts.
> The real check is ikvm->sock (to see if the 'command channel' is 'active').

Yes, that makes more sense.

> 
>>> +
>>> +	ikvm->cmd_allow_mask = -1; /* TODO: qemu->commands; */
>>> +	ikvm->event_allow_mask = -1; /* TODO: qemu->events; */
>>> +
>>> +	alloc_all_kvmi_vcpu(kvm);
>>> +	queue_work(wq, &ikvm->work);
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +/*
>>> + * When called from outside a page fault handler, this call should
>>> + * return ~0ull
>>> + */
>>> +static u64 kvmi_mmu_fault_gla(struct kvm_vcpu *vcpu, gpa_t gpa)
>>> +{
>>> +	u64 gla;
>>> +	u64 gla_val;
>>> +	u64 v;
>>> +
>>> +	if (!vcpu->arch.gpa_available)
>>> +		return ~0ull;
>>> +
>>> +	gla = kvm_mmu_fault_gla(vcpu);
>>> +	if (gla == ~0ull)
>>> +		return gla;
>>> +	gla_val = gla;
>>> +
>>> +	/* Handle the potential overflow by returning ~0ull */
>>> +	if (vcpu->arch.gpa_val > gpa) {
>>> +		v = vcpu->arch.gpa_val - gpa;
>>> +		if (v > gla)
>>> +			gla = ~0ull;
>>> +		else
>>> +			gla -= v;
>>> +	} else {
>>> +		v = gpa - vcpu->arch.gpa_val;
>>> +		if (v > (U64_MAX - gla))
>>> +			gla = ~0ull;
>>> +		else
>>> +			gla += v;
>>> +	}
>>> +
>>> +	return gla;
>>> +}
>>> +
>>> +static bool kvmi_track_preread(struct kvm_vcpu *vcpu, gpa_t gpa,
>>> +			       u8 *new,
>>> +			       int bytes,
>>> +			       struct kvm_page_track_notifier_node *node,
>>> +			       bool *data_ready)
>>> +{
>>> +	u64 gla;
>>> +	struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
>>> +	bool ret = true;
>>> +
>>> +	if (kvm_mmu_nested_guest_page_fault(vcpu))
>>> +		return ret;
>>> +	gla = kvmi_mmu_fault_gla(vcpu, gpa);
>>> +	ret = kvmi_page_fault_event(vcpu, gpa, gla, KVMI_PAGE_ACCESS_R);
>>
>> Should you not check the value of ret here before proceeding?
>>
> 
> Indeed. These 'track' functions are new additions and aren't integrated
> well with kvmi_page_fault_event(). We'll change this. The code is ugly
> but 'safe' (ctx_size will be non-zero only with ret == true).
> 

Ah, yes, I can see that now. Not the most obvious interaction.

>>> +	if (ivcpu && ivcpu->ctx_size > 0) {
>>> +		int s = min_t(int, bytes, ivcpu->ctx_size);
>>> +
>>> +		memcpy(new, ivcpu->ctx_data, s);
>>> +		ivcpu->ctx_size = 0;
>>> +
>>> +		if (*data_ready)
>>> +			kvm_err("Override custom data");
>>> +
>>> +		*data_ready = true;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
>>> +{
>>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
>>> +
>>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
>>> +
>>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
>>
>> Is this safe? It could return false if the alloc fails (in which case
>> the caller has to do nothing) or if setting up the socket fails (in
>> which case the caller needs to free the allocated kvmi).
>>
> 
> If the socket fails for any reason (eg. the introspection tool is
> stopped == socket closed) 'the plan' is to signal QEMU to reconnect
> (and call kvmi_hook() again) or else let the introspected VM continue (and
> try to reconnect asynchronously).
> 
> I see that kvm_page_track_register_notifier() should not be called more
> than once.
> 
> Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().

I assume that a kvmi_rehook() function would then not call 
kvm_page_track_register_notifier() or at least have some check to make 
sure it only calls it once?

One approach would be to have separate kvmi_hook() and kvmi_rehook() 
functions. Another possibility is to have kvmi_hook() take an extra 
argument that's a boolean to specify if it's the first attempt at 
hooking or not.

>>> +bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gva)
>>> +{
>>> +	u32 action;
>>> +	u64 gpa;
>>> +
>>> +	if (!is_event_enabled(vcpu->kvm, KVMI_EVENT_BREAKPOINT))
>>> +		/* qemu will automatically reinject the breakpoint */
>>> +		return false;
>>> +
>>> +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
>>> +
>>> +	if (gpa == UNMAPPED_GVA)
>>> +		kvm_err("%s: invalid gva: %llx", __func__, gva);
>>
>> If the gpa is unmapped, shouldn't it return false rather than proceeding?
>>
> 
> This was just a debug message. I'm not sure if is possible for 'gpa'
> to be unmapped. Even so, the introspection tool should still be notified.
> 
>>> +
>>> +	action = kvmi_msg_send_bp(vcpu, gpa);
>>> +
>>> +	switch (action) {
>>> +	case KVMI_EVENT_ACTION_CONTINUE:
>>> +		break;
>>> +	case KVMI_EVENT_ACTION_RETRY:
>>> +		/* rip was most likely adjusted past the INT 3 instruction */
>>> +		return true;
>>> +	default:
>>> +		handle_common_event_actions(vcpu, action);
>>> +	}
>>> +
>>> +	/* qemu will automatically reinject the breakpoint */
>>> +	return false;
>>> +}
>>> +EXPORT_SYMBOL(kvmi_breakpoint_event);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 15:12         ` Patrick Colp
@ 2017-12-22 15:51           ` alazar
  -1 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22 15:51 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On Fri, 22 Dec 2017 10:12:35 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-22 09:11 AM, Adalbert Laz����r wrote:
> > We've made changes in all the places pointed by you, but read below.
> > Thanks again,
> > Adalbert
> > 
> > On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> >> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> >>> From: Adalbert Lazar <alazar@bitdefender.com>
> >>>
> >>> This subsystem is split into three source files:
> >>>    - kvmi_msg.c - ABI and socket related functions
> >>>    - kvmi_mem.c - handle map/unmap requests from the introspector
> >>>    - kvmi.c - all the other
> >>>
> >>> The new data used by this subsystem is attached to the 'kvm' and
> >>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> >>> structures).
> >>>
> >>> Besides the KVMI system, this patch exports the
> >>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> >>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> >>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> >>>
> >>> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
> >>> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> >>> Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
> >>> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
> >>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> >>> ---
> >>>    arch/x86/include/asm/kvm_host.h |    1 +
> >>>    arch/x86/kvm/Makefile           |    1 +
> >>>    arch/x86/kvm/x86.c              |    4 +-
> >>>    include/linux/kvm_host.h        |    4 +
> >>>    include/linux/kvmi.h            |   32 +
> >>>    include/linux/mm.h              |    3 +
> >>>    include/trace/events/kvmi.h     |  174 +++++
> >>>    include/uapi/linux/kvm.h        |    8 +
> >>>    mm/internal.h                   |    5 -
> >>>    virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
> >>>    virt/kvm/kvmi_int.h             |  121 ++++
> >>>    virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
> >>>    virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
> >>>    13 files changed, 3620 insertions(+), 7 deletions(-)
> >>>    create mode 100644 include/linux/kvmi.h
> >>>    create mode 100644 include/trace/events/kvmi.h
> >>>    create mode 100644 virt/kvm/kvmi.c
> >>>    create mode 100644 virt/kvm/kvmi_int.h
> >>>    create mode 100644 virt/kvm/kvmi_mem.c
> >>>    create mode 100644 virt/kvm/kvmi_msg.c
> >>>
> >>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> >>> +{
> >>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> >>> +
> >>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
> >>> +
> >>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
> >>
> >> Is this safe? It could return false if the alloc fails (in which case
> >> the caller has to do nothing) or if setting up the socket fails (in
> >> which case the caller needs to free the allocated kvmi).
> >>
> > 
> > If the socket fails for any reason (eg. the introspection tool is
> > stopped == socket closed) 'the plan' is to signal QEMU to reconnect
> > (and call kvmi_hook() again) or else let the introspected VM continue (and
> > try to reconnect asynchronously).
> > 
> > I see that kvm_page_track_register_notifier() should not be called more
> > than once.
> > 
> > Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().
> 
> I assume that a kvmi_rehook() function would then not call 
> kvm_page_track_register_notifier() or at least have some check to make 
> sure it only calls it once?
> 
> One approach would be to have separate kvmi_hook() and kvmi_rehook() 
> functions. Another possibility is to have kvmi_hook() take an extra 
> argument that's a boolean to specify if it's the first attempt at 
> hooking or not.
> 

alloc_kvmi() didn't worked with a second kvmi_hook() call.

For the moment I've changed the code to:

kvmi_hook
	return (alloc_kvmi && setup_socket)

alloc_kvmi
	return (IKVM(kvm) || __alloc_kvmi)

__alloc_kvmi
	kzalloc
	kvm_page_track_register_notifier

At least it works as 'advertised' :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 15:51           ` alazar
  0 siblings, 0 replies; 79+ messages in thread
From: alazar @ 2017-12-22 15:51 UTC (permalink / raw)
  To: Patrick Colp, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On Fri, 22 Dec 2017 10:12:35 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> On 2017-12-22 09:11 AM, Adalbert Lazi? 1/2 i? 1/2 i? 1/2 i? 1/2 r wrote:
> > We've made changes in all the places pointed by you, but read below.
> > Thanks again,
> > Adalbert
> > 
> > On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
> >> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> >>> From: Adalbert Lazar <alazar@bitdefender.com>
> >>>
> >>> This subsystem is split into three source files:
> >>>    - kvmi_msg.c - ABI and socket related functions
> >>>    - kvmi_mem.c - handle map/unmap requests from the introspector
> >>>    - kvmi.c - all the other
> >>>
> >>> The new data used by this subsystem is attached to the 'kvm' and
> >>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
> >>> structures).
> >>>
> >>> Besides the KVMI system, this patch exports the
> >>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
> >>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
> >>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
> >>>
> >>> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
> >>> Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
> >>> Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
> >>> Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
> >>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
> >>> ---
> >>>    arch/x86/include/asm/kvm_host.h |    1 +
> >>>    arch/x86/kvm/Makefile           |    1 +
> >>>    arch/x86/kvm/x86.c              |    4 +-
> >>>    include/linux/kvm_host.h        |    4 +
> >>>    include/linux/kvmi.h            |   32 +
> >>>    include/linux/mm.h              |    3 +
> >>>    include/trace/events/kvmi.h     |  174 +++++
> >>>    include/uapi/linux/kvm.h        |    8 +
> >>>    mm/internal.h                   |    5 -
> >>>    virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
> >>>    virt/kvm/kvmi_int.h             |  121 ++++
> >>>    virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
> >>>    virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
> >>>    13 files changed, 3620 insertions(+), 7 deletions(-)
> >>>    create mode 100644 include/linux/kvmi.h
> >>>    create mode 100644 include/trace/events/kvmi.h
> >>>    create mode 100644 virt/kvm/kvmi.c
> >>>    create mode 100644 virt/kvm/kvmi_int.h
> >>>    create mode 100644 virt/kvm/kvmi_mem.c
> >>>    create mode 100644 virt/kvm/kvmi_msg.c
> >>>
> >>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
> >>> +{
> >>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
> >>> +
> >>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
> >>> +
> >>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
> >>
> >> Is this safe? It could return false if the alloc fails (in which case
> >> the caller has to do nothing) or if setting up the socket fails (in
> >> which case the caller needs to free the allocated kvmi).
> >>
> > 
> > If the socket fails for any reason (eg. the introspection tool is
> > stopped == socket closed) 'the plan' is to signal QEMU to reconnect
> > (and call kvmi_hook() again) or else let the introspected VM continue (and
> > try to reconnect asynchronously).
> > 
> > I see that kvm_page_track_register_notifier() should not be called more
> > than once.
> > 
> > Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().
> 
> I assume that a kvmi_rehook() function would then not call 
> kvm_page_track_register_notifier() or at least have some check to make 
> sure it only calls it once?
> 
> One approach would be to have separate kvmi_hook() and kvmi_rehook() 
> functions. Another possibility is to have kvmi_hook() take an extra 
> argument that's a boolean to specify if it's the first attempt at 
> hooking or not.
> 

alloc_kvmi() didn't worked with a second kvmi_hook() call.

For the moment I've changed the code to:

kvmi_hook
	return (alloc_kvmi && setup_socket)

alloc_kvmi
	return (IKVM(kvm) || __alloc_kvmi)

__alloc_kvmi
	kzalloc
	kvm_page_track_register_notifier

At least it works as 'advertised' :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-22 16:02     ` Paolo Bonzini
  -1 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:02 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 18/12/2017 20:06, Adalber Lazăr wrote:
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +

Is there a locking rule when locking multiple mmap_sems at the same
time?  As it's written, this can cause deadlocks.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 16:02     ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:02 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 18/12/2017 20:06, Adalber LazA?r wrote:
> +	/* VMAs will be modified */
> +	down_write(&req_mm->mmap_sem);
> +	down_write(&map_mm->mmap_sem);
> +

Is there a locking rule when locking multiple mmap_sems at the same
time?  As it's written, this can cause deadlocks.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-22 16:09     ` Paolo Bonzini
  -1 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:09 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 18/12/2017 20:06, Adalber Lazăr wrote:
> +	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
> +			     32, 1, token, sizeof(struct kvmi_map_mem_token),
> +			     false);
> +
> +	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);
> +	if (tep == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tep->token_list);
> +	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));
> +	tep->kvm = kvm;
> +
> +	spin_lock(&token_lock);
> +	list_add_tail(&tep->token_list, &token_list);
> +	spin_unlock(&token_lock);
> +
> +	return 0;

This allows unlimited allocations on the host from the introspector
guest.  You must only allow a fixed number of unconsumed tokens (e.g. 64).

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 16:09     ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:09 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 18/12/2017 20:06, Adalber LazA?r wrote:
> +	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
> +			     32, 1, token, sizeof(struct kvmi_map_mem_token),
> +			     false);
> +
> +	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);
> +	if (tep == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tep->token_list);
> +	memcpy(&tep->token, token, sizeof(struct kvmi_map_mem_token));
> +	tep->kvm = kvm;
> +
> +	spin_lock(&token_lock);
> +	list_add_tail(&tep->token_list, &token_list);
> +	spin_unlock(&token_lock);
> +
> +	return 0;

This allows unlimited allocations on the host from the introspector
guest.  You must only allow a fixed number of unconsumed tokens (e.g. 64).

Thanks,

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 16:02     ` Paolo Bonzini
  (?)
@ 2017-12-22 16:18     ` Mircea CIRJALIU-MELIU
  2017-12-22 16:35         ` Paolo Bonzini
  -1 siblings, 1 reply; 79+ messages in thread
From: Mircea CIRJALIU-MELIU @ 2017-12-22 16:18 UTC (permalink / raw)
  To: Paolo Bonzini, Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Mihai Donțu, Nicusor CITU, Marian Cristian ROTARIU

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1134 bytes --]



> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Friday, 22 December 2017 18:02
> To: Adalber Lazăr <alazar@bitdefender.com>; kvm@vger.kernel.org
> Cc: linux-mm@kvack.org; Radim Krčmář <rkrcmar@redhat.com>; Xiao
> Guangrong <guangrong.xiao@linux.intel.com>; Mihai Donțu
> <mdontu@bitdefender.com>; Nicusor CITU <ncitu@bitdefender.com>;
> Mircea CIRJALIU-MELIU <mcirjaliu@bitdefender.com>; Marian Cristian
> ROTARIU <mrotariu@bitdefender.com>
> Subject: Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
> 
> On 18/12/2017 20:06, Adalber Lazăr wrote:
> > +	/* VMAs will be modified */
> > +	down_write(&req_mm->mmap_sem);
> > +	down_write(&map_mm->mmap_sem);
> > +
> 
> Is there a locking rule when locking multiple mmap_sems at the same
> time?  As it's written, this can cause deadlocks.

First req_mm, second map_mm.
The other function uses the same nesting.

> 
> Paolo
> 
> ________________________
> This email was scanned by Bitdefender
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 15:51           ` alazar
@ 2017-12-22 16:26             ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 16:26 UTC (permalink / raw)
  To: alazar, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-22 10:51 AM, alazar@bitdefender.com wrote:
> On Fri, 22 Dec 2017 10:12:35 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>> On 2017-12-22 09:11 AM, Adalbert Laz����r wrote:
>>> We've made changes in all the places pointed by you, but read below.
>>> Thanks again,
>>> Adalbert
>>>
>>> On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>>>> On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
>>>>> From: Adalbert Lazar <alazar@bitdefender.com>
>>>>>
>>>>> This subsystem is split into three source files:
>>>>>     - kvmi_msg.c - ABI and socket related functions
>>>>>     - kvmi_mem.c - handle map/unmap requests from the introspector
>>>>>     - kvmi.c - all the other
>>>>>
>>>>> The new data used by this subsystem is attached to the 'kvm' and
>>>>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
>>>>> structures).
>>>>>
>>>>> Besides the KVMI system, this patch exports the
>>>>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
>>>>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
>>>>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
>>>>>
>>>>> Signed-off-by: Mihai Donțu <mdontu@bitdefender.com>
>>>>> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
>>>>> Signed-off-by: Nicușor Cîțu <ncitu@bitdefender.com>
>>>>> Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
>>>>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
>>>>> ---
>>>>>     arch/x86/include/asm/kvm_host.h |    1 +
>>>>>     arch/x86/kvm/Makefile           |    1 +
>>>>>     arch/x86/kvm/x86.c              |    4 +-
>>>>>     include/linux/kvm_host.h        |    4 +
>>>>>     include/linux/kvmi.h            |   32 +
>>>>>     include/linux/mm.h              |    3 +
>>>>>     include/trace/events/kvmi.h     |  174 +++++
>>>>>     include/uapi/linux/kvm.h        |    8 +
>>>>>     mm/internal.h                   |    5 -
>>>>>     virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>>>>>     virt/kvm/kvmi_int.h             |  121 ++++
>>>>>     virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>>>>>     virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>>>>>     13 files changed, 3620 insertions(+), 7 deletions(-)
>>>>>     create mode 100644 include/linux/kvmi.h
>>>>>     create mode 100644 include/trace/events/kvmi.h
>>>>>     create mode 100644 virt/kvm/kvmi.c
>>>>>     create mode 100644 virt/kvm/kvmi_int.h
>>>>>     create mode 100644 virt/kvm/kvmi_mem.c
>>>>>     create mode 100644 virt/kvm/kvmi_msg.c
>>>>>
>>>>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
>>>>> +{
>>>>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
>>>>> +
>>>>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
>>>>> +
>>>>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
>>>>
>>>> Is this safe? It could return false if the alloc fails (in which case
>>>> the caller has to do nothing) or if setting up the socket fails (in
>>>> which case the caller needs to free the allocated kvmi).
>>>>
>>>
>>> If the socket fails for any reason (eg. the introspection tool is
>>> stopped == socket closed) 'the plan' is to signal QEMU to reconnect
>>> (and call kvmi_hook() again) or else let the introspected VM continue (and
>>> try to reconnect asynchronously).
>>>
>>> I see that kvm_page_track_register_notifier() should not be called more
>>> than once.
>>>
>>> Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().
>>
>> I assume that a kvmi_rehook() function would then not call
>> kvm_page_track_register_notifier() or at least have some check to make
>> sure it only calls it once?
>>
>> One approach would be to have separate kvmi_hook() and kvmi_rehook()
>> functions. Another possibility is to have kvmi_hook() take an extra
>> argument that's a boolean to specify if it's the first attempt at
>> hooking or not.
>>
> 
> alloc_kvmi() didn't worked with a second kvmi_hook() call.
> 
> For the moment I've changed the code to:
> 
> kvmi_hook
> 	return (alloc_kvmi && setup_socket)
> 
> alloc_kvmi
> 	return (IKVM(kvm) || __alloc_kvmi)
> 
> __alloc_kvmi
> 	kzalloc
> 	kvm_page_track_register_notifier
> 
> At least it works as 'advertised' :)

Yes, that seems a lot more clear :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 16:26             ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 16:26 UTC (permalink / raw)
  To: alazar, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Mihai Donțu, Nicușor Cîțu,
	Mircea Cîrjaliu, Marian Rotariu

On 2017-12-22 10:51 AM, alazar@bitdefender.com wrote:
> On Fri, 22 Dec 2017 10:12:35 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>> On 2017-12-22 09:11 AM, Adalbert Lazi? 1/2 i? 1/2 i? 1/2 i? 1/2 r wrote:
>>> We've made changes in all the places pointed by you, but read below.
>>> Thanks again,
>>> Adalbert
>>>
>>> On Fri, 22 Dec 2017 02:34:45 -0500, Patrick Colp <patrick.colp@oracle.com> wrote:
>>>> On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
>>>>> From: Adalbert Lazar <alazar@bitdefender.com>
>>>>>
>>>>> This subsystem is split into three source files:
>>>>>     - kvmi_msg.c - ABI and socket related functions
>>>>>     - kvmi_mem.c - handle map/unmap requests from the introspector
>>>>>     - kvmi.c - all the other
>>>>>
>>>>> The new data used by this subsystem is attached to the 'kvm' and
>>>>> 'kvm_vcpu' structures as opaque pointers (to 'kvmi' and 'kvmi_vcpu'
>>>>> structures).
>>>>>
>>>>> Besides the KVMI system, this patch exports the
>>>>> kvm_vcpu_ioctl_x86_get_xsave() and the mm_find_pmd() functions,
>>>>> adds a new vCPU request (KVM_REQ_INTROSPECTION) and a new VM ioctl
>>>>> (KVM_INTROSPECTION) used to pass the connection file handle from QEMU.
>>>>>
>>>>> Signed-off-by: Mihai DonE?u <mdontu@bitdefender.com>
>>>>> Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
>>>>> Signed-off-by: NicuE?or CA(R)E?u <ncitu@bitdefender.com>
>>>>> Signed-off-by: Mircea CA(R)rjaliu <mcirjaliu@bitdefender.com>
>>>>> Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
>>>>> ---
>>>>>     arch/x86/include/asm/kvm_host.h |    1 +
>>>>>     arch/x86/kvm/Makefile           |    1 +
>>>>>     arch/x86/kvm/x86.c              |    4 +-
>>>>>     include/linux/kvm_host.h        |    4 +
>>>>>     include/linux/kvmi.h            |   32 +
>>>>>     include/linux/mm.h              |    3 +
>>>>>     include/trace/events/kvmi.h     |  174 +++++
>>>>>     include/uapi/linux/kvm.h        |    8 +
>>>>>     mm/internal.h                   |    5 -
>>>>>     virt/kvm/kvmi.c                 | 1410 +++++++++++++++++++++++++++++++++++++++
>>>>>     virt/kvm/kvmi_int.h             |  121 ++++
>>>>>     virt/kvm/kvmi_mem.c             |  730 ++++++++++++++++++++
>>>>>     virt/kvm/kvmi_msg.c             | 1134 +++++++++++++++++++++++++++++++
>>>>>     13 files changed, 3620 insertions(+), 7 deletions(-)
>>>>>     create mode 100644 include/linux/kvmi.h
>>>>>     create mode 100644 include/trace/events/kvmi.h
>>>>>     create mode 100644 virt/kvm/kvmi.c
>>>>>     create mode 100644 virt/kvm/kvmi_int.h
>>>>>     create mode 100644 virt/kvm/kvmi_mem.c
>>>>>     create mode 100644 virt/kvm/kvmi_msg.c
>>>>>
>>>>> +bool kvmi_hook(struct kvm *kvm, struct kvm_introspection *qemu)
>>>>> +{
>>>>> +	kvm_info("Hooking vm with fd: %d\n", qemu->fd);
>>>>> +
>>>>> +	kvm_page_track_register_notifier(kvm, &kptn_node);
>>>>> +
>>>>> +	return (alloc_kvmi(kvm) && setup_socket(kvm, qemu));
>>>>
>>>> Is this safe? It could return false if the alloc fails (in which case
>>>> the caller has to do nothing) or if setting up the socket fails (in
>>>> which case the caller needs to free the allocated kvmi).
>>>>
>>>
>>> If the socket fails for any reason (eg. the introspection tool is
>>> stopped == socket closed) 'the plan' is to signal QEMU to reconnect
>>> (and call kvmi_hook() again) or else let the introspected VM continue (and
>>> try to reconnect asynchronously).
>>>
>>> I see that kvm_page_track_register_notifier() should not be called more
>>> than once.
>>>
>>> Maybe we should rename this to kvmi_rehook() or kvmi_reconnect().
>>
>> I assume that a kvmi_rehook() function would then not call
>> kvm_page_track_register_notifier() or at least have some check to make
>> sure it only calls it once?
>>
>> One approach would be to have separate kvmi_hook() and kvmi_rehook()
>> functions. Another possibility is to have kvmi_hook() take an extra
>> argument that's a boolean to specify if it's the first attempt at
>> hooking or not.
>>
> 
> alloc_kvmi() didn't worked with a second kvmi_hook() call.
> 
> For the moment I've changed the code to:
> 
> kvmi_hook
> 	return (alloc_kvmi && setup_socket)
> 
> alloc_kvmi
> 	return (IKVM(kvm) || __alloc_kvmi)
> 
> __alloc_kvmi
> 	kzalloc
> 	kvm_page_track_register_notifier
> 
> At least it works as 'advertised' :)

Yes, that seems a lot more clear :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 16:09     ` Paolo Bonzini
  (?)
@ 2017-12-22 16:34     ` Mircea CIRJALIU-MELIU
  -1 siblings, 0 replies; 79+ messages in thread
From: Mircea CIRJALIU-MELIU @ 2017-12-22 16:34 UTC (permalink / raw)
  To: Paolo Bonzini, Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Nicusor CITU,
	Marian Cristian ROTARIU



> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Friday, 22 December 2017 18:09
> To: Adalber Lazăr <alazar@bitdefender.com>; kvm@vger.kernel.org
> Cc: linux-mm@kvack.org; Radim Krčmář <rkrcmar@redhat.com>; Xiao
> Guangrong <guangrong.xiao@linux.intel.com>; Mihai Donțu
> <mdontu@bitdefender.com>; Nicusor CITU <ncitu@bitdefender.com>;
> Mircea CIRJALIU-MELIU <mcirjaliu@bitdefender.com>; Marian Cristian
> ROTARIU <mrotariu@bitdefender.com>
> Subject: Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
> 
> On 18/12/2017 20:06, Adalber Lazăr wrote:
> > +	print_hex_dump_debug("kvmi: new token ", DUMP_PREFIX_NONE,
> > +			     32, 1, token, sizeof(struct
> kvmi_map_mem_token),
> > +			     false);
> > +
> > +	tep = kmalloc(sizeof(struct token_entry), GFP_KERNEL);
> > +	if (tep == NULL)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&tep->token_list);
> > +	memcpy(&tep->token, token, sizeof(struct
> kvmi_map_mem_token));
> > +	tep->kvm = kvm;
> > +
> > +	spin_lock(&token_lock);
> > +	list_add_tail(&tep->token_list, &token_list);
> > +	spin_unlock(&token_lock);
> > +
> > +	return 0;
> 
> This allows unlimited allocations on the host from the introspector
> guest.  You must only allow a fixed number of unconsumed tokens (e.g. 64).
> 

A few commits ago Adalbert Lazar suggested only one token for every VM (I guess introspected VM).
Original text here:
/* TODO: Should we limit the number of these tokens?
 * Have only one for every VM?
 */

I suggest using the token as an authentication key with finite life-time (similar to a banking token).
The introspector (process/thread) can request a new token as soon as the old one expires.
The introspected machine shouldn't be associated with the token in this case.

> Thanks,
> 
> Paolo
> 
> ________________________
> This email was scanned by Bitdefender

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
  2017-12-22 16:18     ` Mircea CIRJALIU-MELIU
@ 2017-12-22 16:35         ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:35 UTC (permalink / raw)
  To: Mircea CIRJALIU-MELIU, Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Mihai Donțu, Nicusor CITU, Marian Cristian ROTARIU,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Rik van Riel

On 22/12/2017 17:18, Mircea CIRJALIU-MELIU wrote:
> 
> 
>> -----Original Message-----
>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>> Sent: Friday, 22 December 2017 18:02
>> To: Adalber Lazăr <alazar@bitdefender.com>; kvm@vger.kernel.org
>> Cc: linux-mm@kvack.org; Radim Krčmář <rkrcmar@redhat.com>; Xiao
>> Guangrong <guangrong.xiao@linux.intel.com>; Mihai Donțu
>> <mdontu@bitdefender.com>; Nicusor CITU <ncitu@bitdefender.com>;
>> Mircea CIRJALIU-MELIU <mcirjaliu@bitdefender.com>; Marian Cristian
>> ROTARIU <mrotariu@bitdefender.com>
>> Subject: Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
>>
>> On 18/12/2017 20:06, Adalber Lazăr wrote:
>>> +	/* VMAs will be modified */
>>> +	down_write(&req_mm->mmap_sem);
>>> +	down_write(&map_mm->mmap_sem);
>>> +
>>
>> Is there a locking rule when locking multiple mmap_sems at the same
>> time?  As it's written, this can cause deadlocks.
> 
> First req_mm, second map_mm.
> The other function uses the same nesting.

You could have two tasks, both of which register themselves as the
introspector of the other.  That would cause a deadlock.  There may be
also other cases in the kernel that lock two VMAs at the same time, and
you have to be consistent with those.

Usually what you do is comparing pointers and locking the lowest address
first.  Alternatively, you could have a separate lock that is taken by
everyone who needs more than one lock, that is:

	down_write(&some_mm->mmap_sem);

but:

	mutex_lock(&locking_many_mmaps);
	down_write(&some_mm->mmap_sem);
	down_write(&another_mm->mmap_sem);
	mutex_unlock(&locking_many_mmaps);
	...
	up_write(&some_mm->mmap_sem);
	up_write(&another_mm->mmap_sem);

However, I'm not sure how it works for mmap_sem.  We'll have to ask the
mm guys, let me Cc a few of them randomly.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
@ 2017-12-22 16:35         ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2017-12-22 16:35 UTC (permalink / raw)
  To: Mircea CIRJALIU-MELIU, Adalber Lazăr, kvm
  Cc: linux-mm, Radim Krčmář,
	Mihai Donțu, Nicusor CITU, Marian Cristian ROTARIU,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Rik van Riel

On 22/12/2017 17:18, Mircea CIRJALIU-MELIU wrote:
> 
> 
>> -----Original Message-----
>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>> Sent: Friday, 22 December 2017 18:02
>> To: Adalber LazA?r <alazar@bitdefender.com>; kvm@vger.kernel.org
>> Cc: linux-mm@kvack.org; Radim KrA?mA!A? <rkrcmar@redhat.com>; Xiao
>> Guangrong <guangrong.xiao@linux.intel.com>; Mihai DonE?u
>> <mdontu@bitdefender.com>; Nicusor CITU <ncitu@bitdefender.com>;
>> Mircea CIRJALIU-MELIU <mcirjaliu@bitdefender.com>; Marian Cristian
>> ROTARIU <mrotariu@bitdefender.com>
>> Subject: Re: [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem
>>
>> On 18/12/2017 20:06, Adalber LazA?r wrote:
>>> +	/* VMAs will be modified */
>>> +	down_write(&req_mm->mmap_sem);
>>> +	down_write(&map_mm->mmap_sem);
>>> +
>>
>> Is there a locking rule when locking multiple mmap_sems at the same
>> time?  As it's written, this can cause deadlocks.
> 
> First req_mm, second map_mm.
> The other function uses the same nesting.

You could have two tasks, both of which register themselves as the
introspector of the other.  That would cause a deadlock.  There may be
also other cases in the kernel that lock two VMAs at the same time, and
you have to be consistent with those.

Usually what you do is comparing pointers and locking the lowest address
first.  Alternatively, you could have a separate lock that is taken by
everyone who needs more than one lock, that is:

	down_write(&some_mm->mmap_sem);

but:

	mutex_lock(&locking_many_mmaps);
	down_write(&some_mm->mmap_sem);
	down_write(&another_mm->mmap_sem);
	mutex_unlock(&locking_many_mmaps);
	...
	up_write(&some_mm->mmap_sem);
	up_write(&another_mm->mmap_sem);

However, I'm not sure how it works for mmap_sem.  We'll have to ask the
mm guys, let me Cc a few of them randomly.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 09/18] kvm: hook in the VM introspection subsystem
  2017-12-18 19:06   ` Adalber Lazăr
@ 2017-12-22 16:36     ` Patrick Colp
  -1 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 16:36 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> Handle the new KVM_INTROSPECTION ioctl and pass the socket from QEMU to
> the KVMI subsystem. Notify KVMI on vCPU create/destroy and VM destroy
> events. Also, the EPT AD bits feature is disabled by this patch.
> 
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>   arch/x86/kvm/vmx.c  |  3 ++-
>   virt/kvm/kvm_main.c | 19 +++++++++++++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 093a2e1f7ea6..c03580abf9e8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -34,6 +34,7 @@
>   #include <linux/tboot.h>
>   #include <linux/hrtimer.h>
>   #include <linux/frame.h>
> +#include <linux/kvmi.h>
>   #include "kvm_cache_regs.h"
>   #include "x86.h"
>   
> @@ -6785,7 +6786,7 @@ static __init int hardware_setup(void)
>   	    !cpu_has_vmx_invept_global())
>   		enable_ept = 0;
>   
> -	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
> +	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
>   		enable_ept_ad_bits = 0;
>   
>   	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 210bf820385a..7895d490bd71 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -51,6 +51,7 @@
>   #include <linux/slab.h>
>   #include <linux/sort.h>
>   #include <linux/bsearch.h>
> +#include <linux/kvmi.h>
>   
>   #include <asm/processor.h>
>   #include <asm/io.h>
> @@ -298,6 +299,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	r = kvm_arch_vcpu_init(vcpu);
>   	if (r < 0)
>   		goto fail_free_run;
> +
> +	kvmi_vcpu_init(vcpu);
> +
>   	return 0;
>   
>   fail_free_run:
> @@ -315,6 +319,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>   	 * descriptors are already gone.
>   	 */
>   	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> +	kvmi_vcpu_uninit(vcpu);
>   	kvm_arch_vcpu_uninit(vcpu);
>   	free_page((unsigned long)vcpu->run);
>   }
> @@ -711,6 +716,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   	int i;
>   	struct mm_struct *mm = kvm->mm;
>   
> +	kvmi_destroy_vm(kvm);
>   	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>   	kvm_destroy_vm_debugfs(kvm);
>   	kvm_arch_sync_events(kvm);
> @@ -3118,6 +3124,15 @@ static long kvm_vm_ioctl(struct file *filp,
>   	case KVM_CHECK_EXTENSION:
>   		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>   		break;
> +	case KVM_INTROSPECTION: {
> +		struct kvm_introspection i;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&i, argp, sizeof(i)) || !kvmi_hook(kvm, &i))
> +			goto out;

Looking at this, I wonder if it would actually make more sense to have 
kvmi_hook() return an int? This can then be broken into two separate if 
checks. This way would match the rest of the kvm_vm_ioctl() code a bit 
better.

> +		r = 0;
> +		break;
> +	}
>   	default:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   	}
> @@ -4072,6 +4087,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>   	r = kvm_vfio_ops_init();
>   	WARN_ON(r);
>   
> +	r = kvmi_init();
> +	WARN_ON(r);
> +
>   	return 0;
>   
>   out_undebugfs:
> @@ -4100,6 +4118,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
>   
>   void kvm_exit(void)
>   {
> +	kvmi_uninit();
>   	debugfs_remove_recursive(kvm_debugfs_dir);
>   	misc_deregister(&kvm_dev);
>   	kmem_cache_destroy(kvm_vcpu_cache);
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 09/18] kvm: hook in the VM introspection subsystem
@ 2017-12-22 16:36     ` Patrick Colp
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick Colp @ 2017-12-22 16:36 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu

On 2017-12-18 02:06 PM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> Handle the new KVM_INTROSPECTION ioctl and pass the socket from QEMU to
> the KVMI subsystem. Notify KVMI on vCPU create/destroy and VM destroy
> events. Also, the EPT AD bits feature is disabled by this patch.
> 
> Signed-off-by: Adalbert LazA?r <alazar@bitdefender.com>
> ---
>   arch/x86/kvm/vmx.c  |  3 ++-
>   virt/kvm/kvm_main.c | 19 +++++++++++++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 093a2e1f7ea6..c03580abf9e8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -34,6 +34,7 @@
>   #include <linux/tboot.h>
>   #include <linux/hrtimer.h>
>   #include <linux/frame.h>
> +#include <linux/kvmi.h>
>   #include "kvm_cache_regs.h"
>   #include "x86.h"
>   
> @@ -6785,7 +6786,7 @@ static __init int hardware_setup(void)
>   	    !cpu_has_vmx_invept_global())
>   		enable_ept = 0;
>   
> -	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
> +	if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
>   		enable_ept_ad_bits = 0;
>   
>   	if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 210bf820385a..7895d490bd71 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -51,6 +51,7 @@
>   #include <linux/slab.h>
>   #include <linux/sort.h>
>   #include <linux/bsearch.h>
> +#include <linux/kvmi.h>
>   
>   #include <asm/processor.h>
>   #include <asm/io.h>
> @@ -298,6 +299,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   	r = kvm_arch_vcpu_init(vcpu);
>   	if (r < 0)
>   		goto fail_free_run;
> +
> +	kvmi_vcpu_init(vcpu);
> +
>   	return 0;
>   
>   fail_free_run:
> @@ -315,6 +319,7 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>   	 * descriptors are already gone.
>   	 */
>   	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> +	kvmi_vcpu_uninit(vcpu);
>   	kvm_arch_vcpu_uninit(vcpu);
>   	free_page((unsigned long)vcpu->run);
>   }
> @@ -711,6 +716,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   	int i;
>   	struct mm_struct *mm = kvm->mm;
>   
> +	kvmi_destroy_vm(kvm);
>   	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>   	kvm_destroy_vm_debugfs(kvm);
>   	kvm_arch_sync_events(kvm);
> @@ -3118,6 +3124,15 @@ static long kvm_vm_ioctl(struct file *filp,
>   	case KVM_CHECK_EXTENSION:
>   		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>   		break;
> +	case KVM_INTROSPECTION: {
> +		struct kvm_introspection i;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&i, argp, sizeof(i)) || !kvmi_hook(kvm, &i))
> +			goto out;

Looking at this, I wonder if it would actually make more sense to have 
kvmi_hook() return an int? This can then be broken into two separate if 
checks. This way would match the rest of the kvm_vm_ioctl() code a bit 
better.

> +		r = 0;
> +		break;
> +	}
>   	default:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   	}
> @@ -4072,6 +4087,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>   	r = kvm_vfio_ops_init();
>   	WARN_ON(r);
>   
> +	r = kvmi_init();
> +	WARN_ON(r);
> +
>   	return 0;
>   
>   out_undebugfs:
> @@ -4100,6 +4118,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
>   
>   void kvm_exit(void)
>   {
> +	kvmi_uninit();
>   	debugfs_remove_recursive(kvm_debugfs_dir);
>   	misc_deregister(&kvm_dev);
>   	kmem_cache_destroy(kvm_vcpu_cache);
> 


Patrick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2018-01-03  3:34   ` Xiao Guangrong
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Xiao Guangrong @ 2018-01-03  3:34 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu



On 12/19/2017 03:06 AM, Adalber Lazăr wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This patch series proposes a VM introspection subsystem for KVM (KVMI).
> 
> The previous RFC can be read here: https://marc.info/?l=kvm&m=150514457912721
> 
> These patches were tested on kvm/master,
> commit 43aabca38aa9668eee3c3c1206207034614c0901 (Merge tag 'kvm-arm-fixes-for-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD).
> 
> In this iteration we refactored the code based on the feedback received
> from Paolo and others.

I am thinking if we can define some check points in KVM where
BPF programs are allowed to attach, then employ the policies
in BPFs instead...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
@ 2018-01-03  3:34   ` Xiao Guangrong
  0 siblings, 0 replies; 79+ messages in thread
From: Xiao Guangrong @ 2018-01-03  3:34 UTC (permalink / raw)
  To: Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu



On 12/19/2017 03:06 AM, Adalber LazA?r wrote:
> From: Adalbert Lazar <alazar@bitdefender.com>
> 
> This patch series proposes a VM introspection subsystem for KVM (KVMI).
> 
> The previous RFC can be read here: https://marc.info/?l=kvm&m=150514457912721
> 
> These patches were tested on kvm/master,
> commit 43aabca38aa9668eee3c3c1206207034614c0901 (Merge tag 'kvm-arm-fixes-for-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD).
> 
> In this iteration we refactored the code based on the feedback received
> from Paolo and others.

I am thinking if we can define some check points in KVM where
BPF programs are allowed to attach, then employ the policies
in BPFs instead...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
  2018-01-03  3:34   ` Xiao Guangrong
@ 2018-01-03 14:32     ` Mihai Donțu
  -1 siblings, 0 replies; 79+ messages in thread
From: Mihai Donțu @ 2018-01-03 14:32 UTC (permalink / raw)
  To: Xiao Guangrong, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Xiao Guangrong

On Wed, 2018-01-03 at 11:34 +0800, Xiao Guangrong wrote:
> On 12/19/2017 03:06 AM, Adalber Lazăr wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This patch series proposes a VM introspection subsystem for KVM (KVMI).
> > 
> > The previous RFC can be read here: https://marc.info/?l=kvm&m=150514457912721
> > 
> > These patches were tested on kvm/master,
> > commit 43aabca38aa9668eee3c3c1206207034614c0901 (Merge tag 'kvm-arm-fixes-for-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD).
> > 
> > In this iteration we refactored the code based on the feedback received
> > from Paolo and others.
> 
> I am thinking if we can define some check points in KVM where
> BPF programs are allowed to attach, then employ the policies
> in BPFs instead...

That would be a nice feature to have. For example, we could use it to
pre-filter the events (eg. drop EPT #PF events generated by A/D bit
updates). Also, sure, given how BPF has evolved in Linux these past few
years (see JIT) we could upload some pretty complex introspection
logic.

Regards,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
@ 2018-01-03 14:32     ` Mihai Donțu
  0 siblings, 0 replies; 79+ messages in thread
From: Mihai Donțu @ 2018-01-03 14:32 UTC (permalink / raw)
  To: Xiao Guangrong, Adalber Lazăr, kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář, Xiao Guangrong

On Wed, 2018-01-03 at 11:34 +0800, Xiao Guangrong wrote:
> On 12/19/2017 03:06 AM, Adalber LazA?r wrote:
> > From: Adalbert Lazar <alazar@bitdefender.com>
> > 
> > This patch series proposes a VM introspection subsystem for KVM (KVMI).
> > 
> > The previous RFC can be read here: https://marc.info/?l=kvm&m=150514457912721
> > 
> > These patches were tested on kvm/master,
> > commit 43aabca38aa9668eee3c3c1206207034614c0901 (Merge tag 'kvm-arm-fixes-for-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD).
> > 
> > In this iteration we refactored the code based on the feedback received
> > from Paolo and others.
> 
> I am thinking if we can define some check points in KVM where
> BPF programs are allowed to attach, then employ the policies
> in BPFs instead...

That would be a nice feature to have. For example, we could use it to
pre-filter the events (eg. drop EPT #PF events generated by A/D bit
updates). Also, sure, given how BPF has evolved in Linux these past few
years (see JIT) we could upload some pretty complex introspection
logic.

Regards,

-- 
Mihai DonE?u

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
  2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
@ 2018-01-03 18:52   ` Adalbert Lazăr
  2017-12-18 19:06   ` Adalber Lazăr
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 79+ messages in thread
From: Adalbert Lazăr @ 2018-01-03 18:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Patrick Colp

On Mon, 18 Dec 2017 21:06:24 +0200, Adalber Lazăr <alazar@bitdefender.com> wrote:
> This patch series proposes a VM introspection subsystem for KVM (KVMI).

...

> We hope to make public our repositories (kernel, QEMU,
> userland/simple-introspector) in a couple of days ...

Thanks to Mathieu Tarral, these patches (updated with the Patrick's
suggestions) can be found in the kvmi branch of the KVM-VMI project[1].
There is also a userland library and a simple demo/test program
in tools/kvm/kvmi[2]. The QEMU patch has its own kvmi[3] branch/repo.

[1]: https://github.com/KVM-VMI/kvm/tree/kvmi
[2]: https://github.com/KVM-VMI/kvm/tree/kvmi/tools/kvm/kvmi
[3]: https://github.com/KVM-VMI/qemu/tree/kvmi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [RFC PATCH v4 00/18] VM introspection
@ 2018-01-03 18:52   ` Adalbert Lazăr
  0 siblings, 0 replies; 79+ messages in thread
From: Adalbert Lazăr @ 2018-01-03 18:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-mm, Paolo Bonzini, Radim Krčmář,
	Xiao Guangrong, Mihai Donțu, Patrick Colp

On Mon, 18 Dec 2017 21:06:24 +0200, Adalber LazA?r <alazar@bitdefender.com> wrote:
> This patch series proposes a VM introspection subsystem for KVM (KVMI).

...

> We hope to make public our repositories (kernel, QEMU,
> userland/simple-introspector) in a couple of days ...

Thanks to Mathieu Tarral, these patches (updated with the Patrick's
suggestions) can be found in the kvmi branch of the KVM-VMI project[1].
There is also a userland library and a simple demo/test program
in tools/kvm/kvmi[2]. The QEMU patch has its own kvmi[3] branch/repo.

[1]: https://github.com/KVM-VMI/kvm/tree/kvmi
[2]: https://github.com/KVM-VMI/kvm/tree/kvmi/tools/kvm/kvmi
[3]: https://github.com/KVM-VMI/qemu/tree/kvmi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2018-01-03 18:52 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-18 19:06 [RFC PATCH v4 00/18] VM introspection Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 01/18] kvm: add documentation and ABI/API headers for the VM introspection subsystem Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 02/18] add memory map/unmap support for VM introspection on the guest side Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-21 21:17   ` Patrick Colp
2017-12-21 21:17     ` Patrick Colp
2017-12-22 10:44     ` Mircea CIRJALIU-MELIU
2017-12-22 14:30       ` Patrick Colp
2017-12-18 19:06 ` [RFC PATCH v4 03/18] kvm: x86: add kvm_arch_msr_intercept() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 04/18] kvm: x86: add kvm_mmu_nested_guest_page_fault() and kvmi_mmu_fault_gla() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-21 21:29   ` Patrick Colp
2017-12-21 21:29     ` Patrick Colp
2017-12-22 11:50     ` Mihai Donțu
2017-12-22 11:50       ` Mihai Donțu
2017-12-18 19:06 ` [RFC PATCH v4 05/18] kvm: x86: add kvm_arch_vcpu_set_regs() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-21 21:39   ` Patrick Colp
2017-12-21 21:39     ` Patrick Colp
2017-12-22  9:29     ` alazar
2017-12-22  9:29       ` alazar
2017-12-18 19:06 ` [RFC PATCH v4 06/18] kvm: vmx: export the availability of EPT views Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 07/18] kvm: page track: add support for preread, prewrite and preexec Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-21 22:01   ` Patrick Colp
2017-12-21 22:01     ` Patrick Colp
2017-12-22 10:01     ` alazar
2017-12-22 10:01       ` alazar
2017-12-18 19:06 ` [RFC PATCH v4 08/18] kvm: add the VM introspection subsystem Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-22  7:34   ` Patrick Colp
2017-12-22  7:34     ` Patrick Colp
2017-12-22 14:11     ` Adalbert Lazăr
2017-12-22 14:11       ` Adalbert LazA?r
2017-12-22 15:12       ` Patrick Colp
2017-12-22 15:12         ` Patrick Colp
2017-12-22 15:51         ` alazar
2017-12-22 15:51           ` alazar
2017-12-22 16:26           ` Patrick Colp
2017-12-22 16:26             ` Patrick Colp
2017-12-22 16:02   ` Paolo Bonzini
2017-12-22 16:02     ` Paolo Bonzini
2017-12-22 16:18     ` Mircea CIRJALIU-MELIU
2017-12-22 16:35       ` Paolo Bonzini
2017-12-22 16:35         ` Paolo Bonzini
2017-12-22 16:09   ` Paolo Bonzini
2017-12-22 16:09     ` Paolo Bonzini
2017-12-22 16:34     ` Mircea CIRJALIU-MELIU
2017-12-18 19:06 ` [RFC PATCH v4 09/18] kvm: hook in " Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-22 16:36   ` Patrick Colp
2017-12-22 16:36     ` Patrick Colp
2017-12-18 19:06 ` [RFC PATCH v4 10/18] kvm: x86: handle the new vCPU request (KVM_REQ_INTROSPECTION) Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 11/18] kvm: x86: hook in the page tracking Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 12/18] kvm: x86: hook in kvmi_breakpoint_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 13/18] kvm: x86: hook in kvmi_descriptor_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 14/18] kvm: x86: hook in kvmi_cr_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 15/18] kvm: x86: hook in kvmi_xsetbv_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 16/18] kvm: x86: hook in kvmi_msr_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 17/18] kvm: x86: handle the introspection hypercalls Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2017-12-18 19:06 ` [RFC PATCH v4 18/18] kvm: x86: hook in kvmi_trap_event() Adalber Lazăr
2017-12-18 19:06   ` Adalber Lazăr
2018-01-03  3:34 ` [RFC PATCH v4 00/18] VM introspection Xiao Guangrong
2018-01-03  3:34   ` Xiao Guangrong
2018-01-03 14:32   ` Mihai Donțu
2018-01-03 14:32     ` Mihai Donțu
2018-01-03 18:52 ` Adalbert Lazăr
2018-01-03 18:52   ` Adalbert Lazăr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.