kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/1] VM introspection
@ 2017-07-07 14:34 Adalbert Lazar
  2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar
                   ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Adalbert Lazar @ 2017-07-07 14:34 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Mihai Dontu, Adalbert Lazar

The following patch adds the documentation for an introspection subsystem for
KVM (KVMi). It details the purpose and the use case that has shaped the
proposed API/ABI, as well as the wire protocol.

During the discussion that has developed around our previous RFC patchset a
number of TODO-s have been highlighted:

 * the integration in qemu: a socket-based protocol used to initiate the
   connection with an introspection tool and then passes control to KVM, the
   in-kernel mechanism taking over from there;

 * the integration of the SPT-handling mechanism into the KVM MMU;

 * the elaboration of a set of policies and a mechanism to better control
   this feature;

One bit of code that has passed (maybe) unnoticed in the RFC is a new
function added to Linux' mm called vm_replace_page() which, much like KSM's
replace_page(), gets two processes to share a page (read-write, no-COW):

https://marc.info/?l=kvm&m=149762056518799&w=2

This is used to quickly scan and patch the guest software.

The patch following this cover letter does not yet address the points above
but aims to clear with the community the overall ABI/API, with a focus on
x86.

Thank you,

v2:
  - add documentation and ABI [Paolo, Jan]
  - drop all the other patches for now [Paolo]
  - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
    and let libvirt/qemu handle this [Stefan, Paolo]
  - change the license from LGPL to GPL [Jan]
  - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
  - make the interface a little more consistent

Adalbert Lazar (1):
  kvm: Add documentation and ABI/API header for VM introspection

 Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 310 ++++++++++++
 2 files changed, 1295 insertions(+)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 include/uapi/linux/kvmi.h

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar
@ 2017-07-07 14:34 ` Adalbert Lazar
  2017-07-07 16:52   ` Paolo Bonzini
  2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini
  2017-07-12 14:09 ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 50+ messages in thread
From: Adalbert Lazar @ 2017-07-07 14:34 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Mihai Dontu, Adalbert Lazar

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazar <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 310 ++++++++++++
 2 files changed, 1295 insertions(+)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 include/uapi/linux/kvmi.h

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..63d3a75d5ffc
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,985 @@
+=========================================================
+KVMi - the kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPU-s (GPR-s, MSR-s etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preveting an MSR from being written), others are mere informative
+(like breakpoint events which are used for execution tracing), though the
+option to alter the GPR-s is common to each of them (usually the program
+counter is advanced past the instruction that triggered the guest exit).
+All events are optional. An application using this subsystem will explicitly
+register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is higly influenced by how the guest
+software (kernel, applications) see the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for VMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made VMI a separate subsystem, even though many of these features are
+available in the device manager (eg. qemu). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from an user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. qemu) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (qemu) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. A such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed) at which point it will wake up and properly cleanup. If the peer
+goes away KVM will exit to user space and the device manager will try and
+reconnect. If it fails, the device manager will inform KVM to cleanup and
+continue normal guest execution as if the introspection subsystem has never
+been used on that guest.
+
+All events have a common header::
+
+	struct kvmi_socket_hdr {
+		__u16 msg_id;
+		__u16 size;
+		__u32 seq;
+	};
+
+and all need a reply with the same kind of header, having the same
+sequence number (seq) and the same message id (msg_id).
+
+Because events from different vCPU threads can send messages at the same
+time and the replies can come in any order, the receiver loop uses the
+sequence number (seq) to identify which reply belongs to which vCPU, in
+order to dispatch the message to the right thread waiting for it.
+
+After 'kvmi_socket_hdr', 'msg_id' specific data of 'kvmi_socket_hdr.size'
+bytes will follow.
+
+The message header and its data must be sent with one write() call
+to the socket (as a whole). This simplifies the receiver loop and avoids
+the recontruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+Replies to commands have an error code (__s32) at offset 0 in the message
+data. Specific message data will follow this. If the error code is not
+zero, all the other data members will have undefined content (not random
+heap or stack data, but valid results at the time of the failure), unless
+otherwise specified.
+
+In case of an unsupported command, the message data will contain only
+the error code (-ENOSYS).
+
+The error code is related to the processing of the corresponding
+message. For all the other errors (socket errrors, incomplete messages,
+wrong sequence numbers etc.) the socket must be closed and the connection
+can be retried.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               --------
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message "in the same time"::
+
+   KVMi                      Userland
+   ----                     --------
+   event X ->               <- command X
+
+the host kernel should reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures. The
+members should have the 'natural' alignment of the host.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures providing this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
+Handshake
+---------
+
+Allthough this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+Introspection capabilities
+--------------------------
+
+TODO
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+1. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_get_version_reply {
+		__s32 err;
+		__u32 version;
+	};
+
+Returns the introspection API version (the KVMI_VERSION constant) and the
+error code (zero). In case of an unlikely error, the version will have an
+undefined value.
+
+2. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_get_guest_info_reply {
+		__s32 err;
+		__u16 vcpu_count;
+		__u16 padding;
+		__u64 tsc_speed;
+	};
+
+Returns the number of online vcpus, and the TSC frequency in HZ, if supported
+by the architecture (otherwise is 0).
+
+3. KVMI_PAUSE_GUEST
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+This command will pause all vcpus threads, by getting them out of guest mode
+and put them in the "waiting introspection commands" state.
+
+4. KVMI_UNPAUSE_GUEST
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Resume the vcpu threads, or at least get them out of "waiting introspection
+commands" state.
+
+5. KVMI_SHUTDOWN_GUEST
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Ungracefully shutdown the guest.
+
+6. KVMI_GET_REGISTERS
+---------------------
+
+:Architectures: x86 (could be all, but with different input/output)
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_get_registers_x86 {
+		__u16 vcpu;
+		__u16 nmsrs;
+		__u32 msrs_idx[0];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_get_registers_x86_reply {
+		__s32 err;
+		__u32 mode;
+		struct kvm_regs  regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs  msrs;
+	};
+
+For the given vcpu_id and the nmsrs sized array of MSRs registers, returns
+the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers,
+the special registers and the requested set of MSR-s.
+
+7. KVMI_SET_REGISTERS
+---------------------
+
+:Architectures: x86 (could be all, but with different input)
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_set_registers_x86 {
+		__u16 vcpu;
+		__u16 padding[3];
+		struct kvm_regs regs;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Sets the general purpose registers for the given vcpu_id.
+
+8. KVMI_GET_MTRR_TYPE
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_mtrr_type {
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_mtrr_type_reply {
+		__s32 err;
+		__u32 type;
+	};
+
+Returns the guest memory type for a specific physical address.
+
+9. KVMI_GET_MTRRS
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_mtrrs {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_mtrrs_reply {
+		__s32 err;
+		__u32 padding;
+		__u64 pat;
+		__u64 cap;
+		__u64 type;
+	};
+
+Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
+vCPU.
+
+10. KVMI_GET_XSAVE_INFO
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_xsave_info {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_xsave_info_reply {
+		__s32 err;
+		__u32 size;
+	};
+
+Returns the xstate size for the specified vCPU.
+
+11. KVMI_GET_PAGE_ACCESS
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 vcpu;
+		__u16 padding[3];
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_get_page_access_reply {
+		__s32 err;
+		__u32 access;
+	};
+
+Returns the spte flags (rwx - present, write & user) for the specified
+vCPU and guest physical address.
+
+12. KVMI_SET_PAGE_ACCESS
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_set_page_access {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 access;
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Sets the spte flags (rwx - present, write & user) - access - for the specified
+vCPU and guest physical address.
+
+13. KVMI_INJECT_PAGE_FAULT
+--------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_page_fault {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 error;
+		__u64 gva;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Injects a vCPU page fault with the specified guest virtual address and
+error code.
+
+14. KVMI_INJECT_BREAKPOINT
+--------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_inject_breakpoint {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Injects a breakpoint for the specified vCPU. This command is usually sent in
+response to an event and as such the proper GPR-s will be set with the reply.
+
+15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST
+-----------------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_map_physical_page_to_guest {
+		__u64 gpa_src;
+		__u64 gfn_dest;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Maps a page from an introspected guest memory (gpa_src) to the guest running
+the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one
+page in size.
+
+This command is used to "read" the introspected guest memory and potentially
+place patches (eg. INT3-s).
+
+16. KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST
+---------------------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_unmap_physical_page_from_guest {
+		__u64 gfn_dest;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Unmaps a previously mapped page.
+
+17. KVMI_CONTROL_EVENTS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_control_events {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 events;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables vCPU introspection events, by setting/clearing one or more
+of the following bits (see 'Events' below) :
+
+	KVMI_EVENT_CR
+	KVMI_EVENT_MSR
+	KVMI_EVENT_XSETBV
+	KVMI_EVENT_BREAKPOINT
+	KVMI_EVENT_USER_CALL
+	KVMI_EVENT_PAGE_FAULT
+	KVMI_EVENT_TRAP
+
+Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
+architecture would fail and -EINVAL will be returned.
+
+18. KVMI_CR_CONTROL
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_cr_control {
+		__u8 enable;
+		__u8 padding[3];
+		__u32 cr;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables introspection for a specific CR register and must
+be used in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_CR bit
+flag set.
+
+Eg. kvmi_cr_control { .enable=1, .cr=3 } will enable introspection
+for CR3.
+
+Currently, trying to set any register but CR0, CR3 and CR4 will return
+-EINVAL.
+
+19. KVMI_MSR_CONTROL
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_msr_control {
+		__u8 enable;
+		__u8 padding[3];
+		__u32 msr;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables introspection for a specific MSR, and must be used
+in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set.
+
+Currently, only MSRs within the following 3 ranges are supported. Trying
+to control any other register will return -EINVAL. ::
+
+	0          ... 0x00001fff
+	0x40000000 ... 0x40001fff
+	0xc0000000 ... 0xc0001fff
+
+Events
+------
+
+All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
+be sent unless enabled with a KVMI_CONTROL_EVENTS command.
+
+For x86, the message data starts with a common structure::
+
+	struct kvmi_event_x86 {
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding1;
+		__u32 event;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+		} msrs;
+	};
+
+In order to help the introspection tool with the event analysis while
+avoiding unnecessary introspection commands, the message data holds some
+registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
+the vCPU id, its mode (in bytes) and the event (one of the flags set
+with the KVMI_CONTROL_EVENTS command).
+
+The replies to events also start with a common structure, having the
+KVMI_EVENT_VCPU_REPLY message id::
+
+	struct kvmi_event_x86_reply {
+		struct kvm_regs regs;
+		__u32 actions;
+		__u32 padding;
+	};
+
+The 'actions' member holds one or more flags. For example, if
+KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will
+be overwritten with the new values (regs) from introspector.
+
+Specific data can follow these common structures.
+
+1. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a CR register was modified and the introspection
+has already been enabled for this kind of event (KVMI_CONTROL_EVENTS)
+and for this specific register (KVMI_CR_CONTROL).
+
+kvmi_event_x86, the CR number, the old value and the new value are
+sent to the introspector, which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
+   value from introspector (new_val), otherwise deny the modification
+   but allow the guest to proceed as if the register has been loaded
+   with the desired value.
+
+2. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a MSR was modified and the introspection has already
+been enabled for this kind of event (KVMI_CONTROL_EVENTS) and for this
+specific register (KVMI_MSR_CONTROL).
+
+kvmi_event_x86, the MSR number, the old value and the new value are
+sent to the introspector, which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
+   value from introspector (new_val), otherwise deny the modification
+   but allow the guest to proceed as if the register has been loaded
+   with the desired value.
+
+3. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_xsetbv {
+		__u64 xcr0;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent when the extended control register XCR0 was modified
+and the introspection has already been enabled for this kind of event
+(KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 and the new value are sent to the introspector, which
+can respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions',
+instructing KVMi to override the general purpose registers using the
+values from introspector (regs).
+
+4. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 and the guest physical address are sent to the introspector,
+which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - is implied if not specified
+
+5. KVMI_EVENT_USER_CALL
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent on a user hypercall and the introspection has already
+already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 is sent to the introspector, which can respond with the
+KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
+kernel to override the general purpose registers using the values from
+introspector (regs).
+
+6. KVMI_EVENT_PAGE_FAULT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_page_fault {
+		__u64 gva;
+		__u64 gpa;
+		__u32 mode;
+		__u32 padding;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_page_fault_reply {
+		__u32 ctx_size;
+		__u8 ctx_data[256];
+	};
+
+This event is sent if a hypervisor page fault was encountered, the
+introspection has already enabled the reports for this kind of event
+(KVMI_CONTROL_EVENTS), and it was generated for a page for which the
+introspector has shown interest (ie. has previously touched it by
+adjusting the permissions).
+
+kvmi_event_x86, guest virtual address, guest physical address and
+the exit qualification (mode) are sent to the introspector, which
+can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger
+   the page fault
+
+   (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault
+   via emulation but with custom input (ctx_data, ctx_size). This is
+   used to trick the guest software into believing it has read
+   certain data. In practice it is used to hide the contents of certain
+   memory areas
+
+   KVMI_EVENT_ALLOW - allow the page fault via emulation
+
+If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler
+which usually implies overwriting any spte page access changes made before.
+An introspection tool will always set this flag and prevent unwanted changes
+to memory by skipping the instruction. It is up to the tool to adjust the
+program counter in order to achieve this result.
+
+7. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 err;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent if a trap will be delivered to the guest (page fault,
+breakpoint, etc.) and the introspection has already enabled the reports
+for this kind of event (KVMI_CONTROL_EVENTS).
+
+This is used to inform the introspector of all pending traps giving it
+a chance to determine if it should try again later in case a previous
+KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
+by an interrupt picked up during guest reentry.
+
+kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
+type, exception code (err) and CR2 are sent to the introspector, which can
+respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
+the host kernel to override the general purpose registers using the values
+from introspector (regs).
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..54a2d8ebf649
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,310 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the named License, or any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/>
+ */
+#ifndef __KVMI_H_INCLUDED__
+#define __KVMI_H_INCLUDED__
+
+#include "asm/kvm.h"
+#include <linux/types.h>
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_EVENT_CR         (1 << 1)	/* control register was modified */
+#define KVMI_EVENT_MSR        (1 << 2)	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV     (1 << 3)	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT (1 << 4)	/* breakpoint was reached */
+#define KVMI_EVENT_USER_CALL  (1 << 5)	/* user hypercall */
+#define KVMI_EVENT_PAGE_FAULT (1 << 6)	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP       (1 << 7)	/* trap was injected */
+
+#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
+			   KVMI_EVENT_MSR | \
+			   KVMI_EVENT_XSETBV | \
+			   KVMI_EVENT_BREAKPOINT | \
+			   KVMI_EVENT_USER_CALL | \
+			   KVMI_EVENT_PAGE_FAULT | \
+			   KVMI_EVENT_TRAP)
+
+#define KVMI_EVENT_ACTION_ALLOW      (1 << 0)	/* used in replies */
+#define KVMI_EVENT_ACTION_SET_REGS   (1 << 1)	/* registers need to be written back */
+#define KVMI_EVENT_ACTION_SET_CTX    (1 << 2)	/* set the emulation context */
+#define KVMI_EVENT_ACTION_NOEMU      (1 << 3)	/* return to guest without emulation */
+
+#define KVMI_GET_VERSION                    1
+#define KVMI_GET_GUESTS                     2 /* TODO: remove me */
+#define KVMI_GET_GUEST_INFO                 3
+#define KVMI_PAUSE_GUEST                    4
+#define KVMI_UNPAUSE_GUEST                  5
+#define KVMI_GET_REGISTERS                  6
+#define KVMI_SET_REGISTERS                  7
+#define KVMI_SHUTDOWN_GUEST                 8
+#define KVMI_GET_MTRR_TYPE                  9
+#define KVMI_GET_MTRRS                      10
+#define KVMI_GET_XSAVE_INFO                 11
+#define KVMI_GET_PAGE_ACCESS                12
+#define KVMI_SET_PAGE_ACCESS                13
+#define KVMI_INJECT_PAGE_FAULT              14
+#define KVMI_READ_PHYSICAL                  15 /* TODO: remove me */
+#define KVMI_WRITE_PHYSICAL                 16 /* TODO: remove me */
+#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST     17
+#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18
+#define KVMI_CONTROL_EVENTS                 19
+#define KVMI_CR_CONTROL                     20
+#define KVMI_MSR_CONTROL                    21
+#define KVMI_INJECT_BREAKPOINT              22
+#define KVMI_EVENT_GUEST_ON                 23 /* TODO: remove me */
+#define KVMI_EVENT_GUEST_OFF                24 /* TODO: remove me */
+#define KVMI_EVENT_VCPU                     25
+#define KVMI_EVENT_VCPU_REPLY               26
+
+/* TODO: remove me */
+struct kvmi_guest {
+	__u8 uuid[16];
+};
+
+/* TODO: remove me */
+struct kvmi_guests {
+	__u32 size;		/* in: the size of the entire structure */
+	struct kvmi_guest guests[1];
+};
+
+/* TODO: remove me */
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+/* TODO: remove me */
+struct kvmi_read_physical_reply {
+	__s32 err;
+	__u8 bytes[0];
+};
+
+/* TODO: remove me */
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8 bytes[0];
+};
+
+
+struct kvmi_socket_hdr {
+	__u16 msg_id;
+	__u16 size;
+	__u32 seq;
+};
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
+struct kvmi_get_version_reply {
+	__s32 err;
+	__u32 version;
+};
+
+struct kvmi_get_guest_info_reply {
+	__s32 err;
+	__u16 vcpu_count;
+	__u16 padding;
+	__u64 tsc_speed;
+};
+
+struct kvmi_get_registers_x86 {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_x86_reply {
+	__s32 err;
+	__u32 mode;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers_x86 {
+	__u16 vcpu;
+	__u16 padding[3];
+	struct kvm_regs regs;
+};
+
+struct kvmi_mtrr_type {
+	__u64 gpa;
+};
+
+struct kvmi_mtrr_type_reply {
+	__s32 err;
+	__u32 padding;
+	__u64 type;
+};
+
+struct kvmi_mtrrs {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_mtrrs_reply {
+	__s32 err;
+	__u32 padding;
+	__u64 pat;
+	__u64 cap;
+	__u64 type;
+};
+
+struct kvmi_xsave_info {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_xsave_info_reply {
+	__s32 err;
+	__u32 size;
+};
+
+struct kvmi_get_page_access {
+	__u16 vcpu;
+	__u16 padding[3];
+	__u64 gpa;
+};
+
+struct kvmi_get_page_access_reply {
+	__s32 err;
+	__u32 access;
+};
+
+struct kvmi_set_page_access {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 access;
+	__u64 gpa;
+};
+
+struct kvmi_page_fault {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 error;
+	__u64 gva;
+};
+
+struct kvmi_inject_breakpoint {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_map_physical_page_to_guest {
+	__u64 gpa_src;
+	__u64 gfn_dest;
+};
+
+struct kvmi_unmap_physical_page_from_guest {
+	__u64 gfn_dest;
+};
+
+struct kvmi_control_events {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_cr_control {
+	__u8 enable;
+	__u8 padding[3];
+	__u32 cr;
+};
+
+struct kvmi_msr_control {
+	__u8 enable;
+	__u8 padding[3];
+	__u32 msr;
+};
+
+struct kvmi_event_x86 {
+	__u16 vcpu;
+	__u8 mode;
+	__u8 padding1;
+	__u32 event;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+	} msrs;
+};
+
+struct kvmi_event_x86_reply {
+	struct kvm_regs regs;
+	__u32 actions;
+	__u32 padding;
+};
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_xsetbv {
+	__u64 xcr0;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_page_fault {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_page_fault_reply {
+	__u32 ctx_size;
+	__u8 ctx_data[256];
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 err;
+	__u32 padding;
+	__u64 cr2;
+};
+
+#endif /* __KVMI_H_INCLUDED__ */

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar
@ 2017-07-07 16:52   ` Paolo Bonzini
  2017-07-10 15:32     ` alazar
                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-07 16:52 UTC (permalink / raw)
  To: Adalbert Lazar, kvm
  Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

I have a lot of comments, but it's a very good start.  Thanks for
writing down the specification.

On 07/07/2017 16:34, Adalbert Lazar wrote:
> +1. KVMI_GET_VERSION
> +-------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴

What is this character? :)

> +::
> +
> +	struct kvmi_get_version_reply {
> +		__s32 err;
> +		__u32 version;
> +	};
> +
> +Returns the introspection API version (the KVMI_VERSION constant) and the
> +error code (zero). In case of an unlikely error, the version will have an
> +undefined value.

Would it make sense to return a few more information fields, for example
the set of supported KVMI_CONTROL_EVENTS events?

> +2. KVMI_GET_GUEST_INFO
> +----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_guest_info_reply {
> +		__s32 err;
> +		__u16 vcpu_count;
> +		__u16 padding;
> +		__u64 tsc_speed;
> +	};
> +
> +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> +by the architecture (otherwise is 0).

Can the TSC frequency be specified only if the guest is using TSC
scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
would be the host.

Maybe define the second padding to be

	__u16 arch;

(0 = x86) followed by an arch-specific payload.

> +3. KVMI_PAUSE_GUEST
> +-------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +This command will pause all vcpus threads, by getting them out of guest mode
> +and put them in the "waiting introspection commands" state.

Pausing all threads synchronously is a tricky concept.  The main issue
is that the guest could be paused at the userspace level.  Then KVM
would not be running (userspace is stuck in pthread_cond_wait,
typically) and could not acknowledge the pause.

Worse, KVM is not able to distinguish userspace that has paused the VM
from userspace that is doing MMIO or userspace that has a bug and hung
somewhere.  And even worse, there are cases where userspace wants to
modify registers while doing port I/O (the awful VMware RPCI port).  So
I'd rather avoid this.

> +4. KVMI_UNPAUSE_GUEST
> +---------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Resume the vcpu threads, or at least get them out of "waiting introspection
> +commands" state.

And this as well, of course.

> +5. KVMI_SHUTDOWN_GUEST
> +----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Ungracefully shutdown the guest.

Note that all KVM can do here is pass down a request asynchronously to
userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
alternative action.

> +6. KVMI_GET_REGISTERS
> +---------------------
> +
> +:Architectures: x86 (could be all, but with different input/output)
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_get_registers_x86 {
> +		__u16 vcpu;
> +		__u16 nmsrs;
> +		__u32 msrs_idx[0];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_registers_x86_reply {
> +		__s32 err;
> +		__u32 mode;
> +		struct kvm_regs  regs;
> +		struct kvm_sregs sregs;
> +		struct kvm_msrs  msrs;
> +	};
> +
> +For the given vcpu_id and the nmsrs sized array of MSRs registers, returns
> +the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers,
> +the special registers and the requested set of MSR-s.
> +
> +7. KVMI_SET_REGISTERS
> +---------------------
> +
> +:Architectures: x86 (could be all, but with different input)
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_set_registers_x86 {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +		struct kvm_regs regs;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Sets the general purpose registers for the given vcpu_id.
> +
> +8. KVMI_GET_MTRR_TYPE
> +---------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_mtrr_type {
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_mtrr_type_reply {
> +		__s32 err;
> +		__u32 type;
> +	};
> +
> +Returns the guest memory type for a specific physical address.

What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
rather avoid it.

> +9. KVMI_GET_MTRRS
> +-----------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_mtrrs {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_mtrrs_reply {
> +		__s32 err;
> +		__u32 padding;
> +		__u64 pat;
> +		__u64 cap;
> +		__u64 type;
> +	};
> +
> +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
> +vCPU.

This could use KVM_GET_REGISTERS, couldn't it?

> +10. KVMI_GET_XSAVE_INFO
> +-----------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_xsave_info {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_xsave_info_reply {
> +		__s32 err;
> +		__u32 size;
> +	};
> +
> +Returns the xstate size for the specified vCPU.

Could this be replaced by a generic CPUID accessor?

> +11. KVMI_GET_PAGE_ACCESS
> +------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_get_page_access {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_page_access_reply {
> +		__s32 err;
> +		__u32 access;
> +	};
> +
> +Returns the spte flags (rwx - present, write & user) for the specified
> +vCPU and guest physical address.
> +
> +12. KVMI_SET_PAGE_ACCESS
> +------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_set_page_access {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 access;
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Sets the spte flags (rwx - present, write & user) - access - for the specified
> +vCPU and guest physical address.

rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
documentation.

Also, it should be noted here that the spte flags are ANDed with
whatever is provided by userspace, because there could be readonly
memslots, and that some access combinations could fail (--x) or will
surely fail (-wx for example).

> +13. KVMI_INJECT_PAGE_FAULT
> +--------------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_page_fault {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 error;

error_code, I guess?  Why not a generic inject exception message?

> +		__u64 gva;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Injects a vCPU page fault with the specified guest virtual address and
> +error code.
> +
> +14. KVMI_INJECT_BREAKPOINT
> +--------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_inject_breakpoint {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Injects a breakpoint for the specified vCPU. This command is usually sent in
> +response to an event and as such the proper GPR-s will be set with the reply.

What is a "breakpoint" in this context?

> +15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST
> +-----------------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_map_physical_page_to_guest {
> +		__u64 gpa_src;
> +		__u64 gfn_dest;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Maps a page from an introspected guest memory (gpa_src) to the guest running
> +the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one
> +page in size.
> +
> +This command is used to "read" the introspected guest memory and potentially
> +place patches (eg. INT3-s).

Two problems here:

1) The gpa->hva mapping can change as the guest runs.  I'd prefer a
read/write API to map/unmap, where read replies and events provide a
"cookie" (generation count).  Write (including breakpoints?  it's not
clear to me who physically writes the 0xCC) then return an error if the
incoming generation count doesn't match the current one.

2) There can be multiple address spaces, notably SMM and non-SMM.

> +17. KVMI_CONTROL_EVENTS
> +-----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_control_events {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 events;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Enables/disables vCPU introspection events, by setting/clearing one or more
> +of the following bits (see 'Events' below) :
> +
> +	KVMI_EVENT_CR
> +	KVMI_EVENT_MSR
> +	KVMI_EVENT_XSETBV
> +	KVMI_EVENT_BREAKPOINT
> +	KVMI_EVENT_USER_CALL
> +	KVMI_EVENT_PAGE_FAULT
> +	KVMI_EVENT_TRAP
> +
> +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
> +architecture would fail and -EINVAL will be returned.

I would prefer the interface to allow enable (set bits), disable (clear
bits) in addition to set (what you have here) the events.

The return value could indicate the set of enabled events.


> +19. KVMI_MSR_CONTROL
> +--------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_msr_control {
> +		__u8 enable;
> +		__u8 padding[3];
> +		__u32 msr;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Enables/disables introspection for a specific MSR, and must be used
> +in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set.
> +
> +Currently, only MSRs within the following 3 ranges are supported. Trying
> +to control any other register will return -EINVAL. ::
> +
> +	0          ... 0x00001fff
> +	0x40000000 ... 0x40001fff
> +	0xc0000000 ... 0xc0001fff

Note that KVM custom MSRs fall in the range 0x4b564d00-0x4b564dff.

> +Events
> +------
> +
> +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
> +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
> +
> +For x86, the message data starts with a common structure::
> +
> +	struct kvmi_event_x86 {
> +		__u16 vcpu;
> +		__u8 mode;
> +		__u8 padding1;
> +		__u32 event;
> +		struct kvm_regs regs;
> +		struct kvm_sregs sregs;
> +		struct {
> +			__u64 sysenter_cs;
> +			__u64 sysenter_esp;
> +			__u64 sysenter_eip;
> +			__u64 efer;
> +			__u64 star;
> +			__u64 lstar;

cstar too, for AMD CPUs.

> +		} msrs;
> +	};
> +
> +In order to help the introspection tool with the event analysis while
> +avoiding unnecessary introspection commands, the message data holds some
> +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
> +the vCPU id, its mode (in bytes) and the event (one of the flags set
> +with the KVMI_CONTROL_EVENTS command).
> +
> +The replies to events also start with a common structure, having the
> +KVMI_EVENT_VCPU_REPLY message id::
> +
> +	struct kvmi_event_x86_reply {
> +		struct kvm_regs regs;

Put regs last?

> +		__u32 actions;
> +		__u32 padding;
> +	};
> +
> +The 'actions' member holds one or more flags. For example, if
> +KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will
> +be overwritten with the new values (regs) from introspector.
> +
> +Specific data can follow these common structures.
> +
> +1. KVMI_EVENT_CR
> +----------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_cr {
> +		__u16 cr;
> +		__u16 padding[3];
> +		__u64 old_value;
> +		__u64 new_value;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +	struct kvmi_event_cr_reply {
> +		__u64 new_val;
> +	};
> +
> +This event is sent when a CR register was modified and the introspection
> +has already been enabled for this kind of event (KVMI_CONTROL_EVENTS)
> +and for this specific register (KVMI_CR_CONTROL).
> +
> +kvmi_event_x86, the CR number, the old value and the new value are
> +sent to the introspector, which can respond with one or more action flags:
> +
> +   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
> +   using the values from introspector (regs)

I think the action should be KVMI_EVENT_ACTION_SKIP.  Setting registers
can be done separately with a command.

> +   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
> +   value from introspector (new_val), otherwise deny the modification
> +   but allow the guest to proceed as if the register has been loaded
> +   with the desired value.

Maybe add an extra action for exit to userspace (which would presumably
shutdown the guest)?  This would apply to all events.

Also, a retry action can be useful in case e.g. the write cookie becomes
stale.

> +5. KVMI_EVENT_USER_CALL
> +-----------------------

Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.

> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +
> +This event is sent on a user hypercall and the introspection has already
> +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> +
> +kvmi_event_x86 is sent to the introspector, which can respond with the
> +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> +kernel to override the general purpose registers using the values from
> +introspector (regs).

Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
SKIP/RETRY/ALLOW/CRASH as the actions.

> +6. KVMI_EVENT_PAGE_FAULT
> +------------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_page_fault {
> +		__u64 gva;
> +		__u64 gpa;
> +		__u32 mode;
> +		__u32 padding;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +	struct kvmi_event_page_fault_reply {
> +		__u32 ctx_size;
> +		__u8 ctx_data[256];
> +	};
> +
> +This event is sent if a hypervisor page fault was encountered, the
> +introspection has already enabled the reports for this kind of event
> +(KVMI_CONTROL_EVENTS), and it was generated for a page for which the
> +introspector has shown interest (ie. has previously touched it by
> +adjusting the permissions).

If the introspector sets permissions to r-- and you get a read page
fault because the page is swapped out, should the introspector get the
event?

> +kvmi_event_x86, guest virtual address, guest physical address and
> +the exit qualification (mode) are sent to the introspector, which
> +can respond with one or more action flags:
> +
> +   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
> +   using the values from introspector (regs)
> +
> +   (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger
> +   the page fault

"re-execute the instruction"?  Maybe call it KVMI_EVENT_RETRY?

> +   (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault
> +   via emulation but with custom input (ctx_data, ctx_size). This is
> +   used to trick the guest software into believing it has read
> +   certain data. In practice it is used to hide the contents of certain
> +   memory areas

This is tricky.  The instruction could be, for example, a string read.
I think there should be a separate event for I/O from a trapping page
(so perhaps return KVMI_EVENT_ALLOW | KVMI_EVENT_TRAP_ACCESS).

> +   KVMI_EVENT_ALLOW - allow the page fault via emulation
>
> +If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler
> +which usually implies overwriting any spte page access changes made before.
> +An introspection tool will always set this flag and prevent unwanted changes
> +to memory by skipping the instruction. It is up to the tool to adjust the
> +program counter in order to achieve this result.

I don't like this.  It should be what KVMI_EVENT_RETRY is for.  But
maybe I'm missing the difference between "0" and KVMI_EVENT_RETRY.

Here, the actions would be SKIP/RETRY/ALLOW/CRASH, with an optional
KVMI_EVENT_TRAP_ACCESS flag for "allow".


> +7. KVMI_EVENT_TRAP
> +------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_trap {
> +		__u32 vector;
> +		__u32 type;
> +		__u32 err;

error_code?

> +		__u32 padding;
> +		__u64 cr2;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +
> +This event is sent if a trap will be delivered to the guest (page fault,
> +breakpoint, etc.) and the introspection has already enabled the reports
> +for this kind of event (KVMI_CONTROL_EVENTS).
> +
> +This is used to inform the introspector of all pending traps giving it
> +a chance to determine if it should try again later in case a previous
> +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
> +by an interrupt picked up during guest reentry.
> +
> +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
> +type, exception code (err) and CR2 are sent to the introspector, which can
> +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
> +the host kernel to override the general purpose registers using the values
> +from introspector (regs).

Here I think the actions could be RETRY/ALLOW/CRASH only, with "set
regs" done as a separate command.

Some x86 exceptions are faults, others are traps.  Traps should not
allow RETRY as an action.  There should be an input telling the
introspector if retrying is allowed.

How are dr6/dr7 handled around breakpoints?

Thanks,

Paolo

> diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
> new file mode 100644
> index 000000000000..54a2d8ebf649
> --- /dev/null
> +++ b/include/uapi/linux/kvmi.h
> @@ -0,0 +1,310 @@
> +/*
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * The KVMI Library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the named License, or any later version.
> + *
> + * The KVMI Library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/>
> + */
> +#ifndef __KVMI_H_INCLUDED__
> +#define __KVMI_H_INCLUDED__
> +
> +#include "asm/kvm.h"
> +#include <linux/types.h>
> +
> +#define KVMI_VERSION 0x00000001
> +
> +#define KVMI_EVENT_CR         (1 << 1)	/* control register was modified */
> +#define KVMI_EVENT_MSR        (1 << 2)	/* model specific reg. was modified */
> +#define KVMI_EVENT_XSETBV     (1 << 3)	/* ext. control register was modified */
> +#define KVMI_EVENT_BREAKPOINT (1 << 4)	/* breakpoint was reached */
> +#define KVMI_EVENT_USER_CALL  (1 << 5)	/* user hypercall */
> +#define KVMI_EVENT_PAGE_FAULT (1 << 6)	/* hyp. page fault was encountered */
> +#define KVMI_EVENT_TRAP       (1 << 7)	/* trap was injected */
> +
> +#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
> +			   KVMI_EVENT_MSR | \
> +			   KVMI_EVENT_XSETBV | \
> +			   KVMI_EVENT_BREAKPOINT | \
> +			   KVMI_EVENT_USER_CALL | \
> +			   KVMI_EVENT_PAGE_FAULT | \
> +			   KVMI_EVENT_TRAP)
> +
> +#define KVMI_EVENT_ACTION_ALLOW      (1 << 0)	/* used in replies */
> +#define KVMI_EVENT_ACTION_SET_REGS   (1 << 1)	/* registers need to be written back */
> +#define KVMI_EVENT_ACTION_SET_CTX    (1 << 2)	/* set the emulation context */
> +#define KVMI_EVENT_ACTION_NOEMU      (1 << 3)	/* return to guest without emulation */
> +
> +#define KVMI_GET_VERSION                    1
> +#define KVMI_GET_GUESTS                     2 /* TODO: remove me */
> +#define KVMI_GET_GUEST_INFO                 3
> +#define KVMI_PAUSE_GUEST                    4
> +#define KVMI_UNPAUSE_GUEST                  5
> +#define KVMI_GET_REGISTERS                  6
> +#define KVMI_SET_REGISTERS                  7
> +#define KVMI_SHUTDOWN_GUEST                 8
> +#define KVMI_GET_MTRR_TYPE                  9
> +#define KVMI_GET_MTRRS                      10
> +#define KVMI_GET_XSAVE_INFO                 11
> +#define KVMI_GET_PAGE_ACCESS                12
> +#define KVMI_SET_PAGE_ACCESS                13
> +#define KVMI_INJECT_PAGE_FAULT              14
> +#define KVMI_READ_PHYSICAL                  15 /* TODO: remove me */
> +#define KVMI_WRITE_PHYSICAL                 16 /* TODO: remove me */
> +#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST     17
> +#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18
> +#define KVMI_CONTROL_EVENTS                 19
> +#define KVMI_CR_CONTROL                     20
> +#define KVMI_MSR_CONTROL                    21
> +#define KVMI_INJECT_BREAKPOINT              22
> +#define KVMI_EVENT_GUEST_ON                 23 /* TODO: remove me */
> +#define KVMI_EVENT_GUEST_OFF                24 /* TODO: remove me */
> +#define KVMI_EVENT_VCPU                     25
> +#define KVMI_EVENT_VCPU_REPLY               26

Errnos values are not portable, I'd prefer to have them defined
explicitly in the header.

Paolo

> +/* TODO: remove me */
> +struct kvmi_guest {
> +	__u8 uuid[16];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_guests {
> +	__u32 size;		/* in: the size of the entire structure */
> +	struct kvmi_guest guests[1];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_read_physical {
> +	__u64 gpa;
> +	__u64 size;
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_read_physical_reply {
> +	__s32 err;
> +	__u8 bytes[0];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_write_physical {
> +	__u64 gpa;
> +	__u64 size;
> +	__u8 bytes[0];
> +};
> +
> +
> +struct kvmi_socket_hdr {
> +	__u16 msg_id;
> +	__u16 size;
> +	__u32 seq;
> +};
> +
> +struct kvmi_error_code {
> +	__s32 err;
> +	__u32 padding;
> +};
> +
> +struct kvmi_get_version_reply {
> +	__s32 err;
> +	__u32 version;
> +};
> +
> +struct kvmi_get_guest_info_reply {
> +	__s32 err;
> +	__u16 vcpu_count;
> +	__u16 padding;
> +	__u64 tsc_speed;
> +};
> +
> +struct kvmi_get_registers_x86 {
> +	__u16 vcpu;
> +	__u16 nmsrs;
> +	__u32 msrs_idx[0];
> +};
> +
> +struct kvmi_get_registers_x86_reply {
> +	__s32 err;
> +	__u32 mode;
> +	struct kvm_regs regs;
> +	struct kvm_sregs sregs;
> +	struct kvm_msrs msrs;
> +};
> +
> +struct kvmi_set_registers_x86 {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +	struct kvm_regs regs;
> +};
> +
> +struct kvmi_mtrr_type {
> +	__u64 gpa;
> +};
> +
> +struct kvmi_mtrr_type_reply {
> +	__s32 err;
> +	__u32 padding;
> +	__u64 type;
> +};
> +
> +struct kvmi_mtrrs {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_mtrrs_reply {
> +	__s32 err;
> +	__u32 padding;
> +	__u64 pat;
> +	__u64 cap;
> +	__u64 type;
> +};
> +
> +struct kvmi_xsave_info {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_xsave_info_reply {
> +	__s32 err;
> +	__u32 size;
> +};
> +
> +struct kvmi_get_page_access {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +	__u64 gpa;
> +};
> +
> +struct kvmi_get_page_access_reply {
> +	__s32 err;
> +	__u32 access;
> +};
> +
> +struct kvmi_set_page_access {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 access;
> +	__u64 gpa;
> +};
> +
> +struct kvmi_page_fault {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 error;
> +	__u64 gva;
> +};
> +
> +struct kvmi_inject_breakpoint {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_map_physical_page_to_guest {
> +	__u64 gpa_src;
> +	__u64 gfn_dest;
> +};
> +
> +struct kvmi_unmap_physical_page_from_guest {
> +	__u64 gfn_dest;
> +};
> +
> +struct kvmi_control_events {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 events;
> +};
> +
> +struct kvmi_cr_control {
> +	__u8 enable;
> +	__u8 padding[3];
> +	__u32 cr;
> +};
> +
> +struct kvmi_msr_control {
> +	__u8 enable;
> +	__u8 padding[3];
> +	__u32 msr;
> +};
> +
> +struct kvmi_event_x86 {
> +	__u16 vcpu;
> +	__u8 mode;
> +	__u8 padding1;
> +	__u32 event;
> +	struct kvm_regs regs;
> +	struct kvm_sregs sregs;
> +	struct {
> +		__u64 sysenter_cs;
> +		__u64 sysenter_esp;
> +		__u64 sysenter_eip;
> +		__u64 efer;
> +		__u64 star;
> +		__u64 lstar;
> +	} msrs;
> +};
> +
> +struct kvmi_event_x86_reply {
> +	struct kvm_regs regs;
> +	__u32 actions;
> +	__u32 padding;
> +};
> +
> +struct kvmi_event_cr {
> +	__u16 cr;
> +	__u16 padding[3];
> +	__u64 old_value;
> +	__u64 new_value;
> +};
> +
> +struct kvmi_event_cr_reply {
> +	__u64 new_val;
> +};
> +
> +struct kvmi_event_msr {
> +	__u32 msr;
> +	__u32 padding;
> +	__u64 old_value;
> +	__u64 new_value;
> +};
> +
> +struct kvmi_event_msr_reply {
> +	__u64 new_val;
> +};
> +
> +struct kvmi_event_xsetbv {
> +	__u64 xcr0;
> +};
> +
> +struct kvmi_event_breakpoint {
> +	__u64 gpa;
> +};
> +
> +struct kvmi_event_page_fault {
> +	__u64 gva;
> +	__u64 gpa;
> +	__u32 mode;
> +	__u32 padding;
> +};
> +
> +struct kvmi_event_page_fault_reply {
> +	__u32 ctx_size;
> +	__u8 ctx_data[256];
> +};
> +
> +struct kvmi_event_trap {
> +	__u32 vector;
> +	__u32 type;
> +	__u32 err;
> +	__u32 padding;
> +	__u64 cr2;
> +};
> +
> +#endif /* __KVMI_H_INCLUDED__ */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar
  2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar
@ 2017-07-07 17:29 ` Paolo Bonzini
  2017-08-07 15:28   ` Mihai Donțu
  2017-07-12 14:09 ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-07 17:29 UTC (permalink / raw)
  To: Adalbert Lazar
  Cc: Radim Krčmář,
	Mihai Dontu, Jan Kiszka, Stefan Hajnoczi, KVM list,
	Andy Lutomirski



On 07/07/2017 16:34, Adalbert Lazar wrote:
> One bit of code that has passed (maybe) unnoticed in the RFC is a new
> function added to Linux' mm called vm_replace_page() which, much like KSM's
> replace_page(), gets two processes to share a page (read-write, no-COW):
> 
> https://marc.info/?l=kvm&m=149762056518799&w=2
> 
> This is used to quickly scan and patch the guest software.

Thanks for pointing this out.

In my review of patch 1 I suggested using only read/write, but it's slow.

I think we need to figure out a safe way to map foreign memory, as I'm
worried of TOC/TOU races for obvious reasons.

One thing I was thinking about (but didn't have much time to completely
think through) is a special /dev/kvmmem device, where you could do

    kvmmem_fd = open("/dev/kvmmem", O_RDWR);
    ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size });
    ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size });

The map/unmap memory operation would be a hypercall, not a socket
command, but the random "token" would be returned on the socket via some
KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a
replacement accepting {gpa, size} instead of {gpa, gfn_dest}).  Handles
can be short lived, e.g. you could have at most a small number tokens
per host created (and passed back via KVMI) but not yet used by the
hypercall.  Once it's used by the hypercall, the token is not needed
anymore, so this is not a strong limitation.

After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout
changes (userfaultfd can be used by the introspector to simplify the
handling and retry).  You'd have to re-map the memory explicitly.

Alas I have no idea how to verify the handle securely on the host, since
the host is not supposed to know which guests are introspectors and
which host got which token.  But maybe if the token namespace is big
enough (256 bits?) and random, it's okay to ignore the possibility that
a guest tries to guess.  (This idea is roughly based on how SCSI
offloaded copies work).

Andy, does it look like utter BS or could it have some merit?

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-07 16:52   ` Paolo Bonzini
@ 2017-07-10 15:32     ` alazar
  2017-07-10 17:03       ` Paolo Bonzini
  2017-07-13  8:36     ` Mihai Donțu
  2017-07-27 17:06     ` Mihai Donțu
  2 siblings, 1 reply; 50+ messages in thread
From: alazar @ 2017-07-10 15:32 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

Thanks for your review, Paolo.
Below are some answers.
Will have to chew on the others.

On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 07/07/2017 16:34, Adalbert Lazar wrote:
> > +:Returns: ↴
> 
> What is this character? :)

I thought using a nice right-down arrow instead of a boring/inexpressive
ASCII char in order to avoid trailing spaces will be cool. At least it
was fun :)

> > +	struct kvmi_get_version_reply {
> > +		__s32 err;
> > +		__u32 version;
> > +	};
> > +
> > +Returns the introspection API version (the KVMI_VERSION constant) and the
> > +error code (zero). In case of an unlikely error, the version will have an
> > +undefined value.
> 
> Would it make sense to return a few more information fields, for example
> the set of supported KVMI_CONTROL_EVENTS events?

Sure.

> > +	struct kvmi_get_guest_info_reply {
> > +		__s32 err;
> > +		__u16 vcpu_count;
> > +		__u16 padding;
> > +		__u64 tsc_speed;
> > +	};
> > +
> > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> > +by the architecture (otherwise is 0).
> 
> Can the TSC frequency be specified only if the guest is using TSC
> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
> would be the host.
> 
> Maybe define the second padding to be
> 
> 	__u16 arch;
> 
> (0 = x86) followed by an arch-specific payload.

Sure.

> > +5. KVMI_SHUTDOWN_GUEST
> > +----------------------
> > +
> > +Ungracefully shutdown the guest.
> 
> Note that all KVM can do here is pass down a request asynchronously to
> userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
> alternative action.

We were thinking about sending an "ungraceful" TERM signal :D

Adding another reponse (exit-to-userspace-with-shutdown) on events,
as you suggested below, makes sense.

> > +12. KVMI_SET_PAGE_ACCESS
> > +------------------------
> > +
> > +Sets the spte flags (rwx - present, write & user) - access - for the specified
> > +vCPU and guest physical address.
> 
> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
> documentation.

RWX.
Will do.

> > +17. KVMI_CONTROL_EVENTS
> > +-----------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_control_events {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 events;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Enables/disables vCPU introspection events, by setting/clearing one or more
> > +of the following bits (see 'Events' below) :
> > +
> > +	KVMI_EVENT_CR
> > +	KVMI_EVENT_MSR
> > +	KVMI_EVENT_XSETBV
> > +	KVMI_EVENT_BREAKPOINT
> > +	KVMI_EVENT_USER_CALL
> > +	KVMI_EVENT_PAGE_FAULT
> > +	KVMI_EVENT_TRAP
> > +
> > +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
> > +architecture would fail and -EINVAL will be returned.
> 
> I would prefer the interface to allow enable (set bits), disable (clear
> bits) in addition to set (what you have here) the events.

Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will
enable CR and MSR events and disable all the other events. It functions like
a set and clear, in the same time.

It will be better to have KVMI_ENABLE_CONTROL_EVENTS and
KVMI_DISABLE_CONTROL_EVENTS instead?

> The return value could indicate the set of enabled events.

Indeed.

> > +Events
> > +------
> > +
> > +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
> > +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
> > +
> > +For x86, the message data starts with a common structure::
> > +
> > +	struct kvmi_event_x86 {
> > +		__u16 vcpu;
> > +		__u8 mode;
> > +		__u8 padding1;
> > +		__u32 event;
> > +		struct kvm_regs regs;
> > +		struct kvm_sregs sregs;
> > +		struct {
> > +			__u64 sysenter_cs;
> > +			__u64 sysenter_esp;
> > +			__u64 sysenter_eip;
> > +			__u64 efer;
> > +			__u64 star;
> > +			__u64 lstar;
> 
> cstar too, for AMD CPUs.

Thanks. Will do.

> > +		} msrs;
> > +	};
> > +
> > +In order to help the introspection tool with the event analysis while
> > +avoiding unnecessary introspection commands, the message data holds some
> > +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
> > +the vCPU id, its mode (in bytes) and the event (one of the flags set
> > +with the KVMI_CONTROL_EVENTS command).
> > +
> > +The replies to events also start with a common structure, having the
> > +KVMI_EVENT_VCPU_REPLY message id::
> > +
> > +	struct kvmi_event_x86_reply {
> > +		struct kvm_regs regs;
> 
> Put regs last?
> 
> > +		__u32 actions;
> > +		__u32 padding;
> > +	};

Right.

> > +5. KVMI_EVENT_USER_CALL
> > +-----------------------
> 
> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.

Sure.
 
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent on a user hypercall and the introspection has already
> > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +kvmi_event_x86 is sent to the introspector, which can respond with the
> > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> > +kernel to override the general purpose registers using the values from
> > +introspector (regs).
> 
> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
> SKIP/RETRY/ALLOW/CRASH as the actions.

We've used this as way for guest code to communicate with the introspector.
KVMI_EVENT_ACTION_ALLOW is implied here.

> > --- /dev/null
> > +++ b/include/uapi/linux/kvmi.h
...
> 
> Errnos values are not portable, I'd prefer to have them defined
> explicitly in the header.

We were thinking that the introspection tool will take care of this,
as with the endianness.

Surely, it will be much more clear to have the errnos defined here.

Thanks again,

Adalbert

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-10 15:32     ` alazar
@ 2017-07-10 17:03       ` Paolo Bonzini
  2017-07-11 16:48         ` Adalbert Lazar
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-10 17:03 UTC (permalink / raw)
  To: alazar, kvm
  Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

On 10/07/2017 17:32, alazar@bitdefender.com wrote:
> Thanks for your review, Paolo.
> Below are some answers.
> Will have to chew on the others.

I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.

The main remaining issue seems to be map/unmap.

Paolo

> On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 07/07/2017 16:34, Adalbert Lazar wrote:
>>> +:Returns: ↴
>>
>> What is this character? :)
> 
> I thought using a nice right-down arrow instead of a boring/inexpressive
> ASCII char in order to avoid trailing spaces will be cool. At least it
> was fun :)
> 
>>> +	struct kvmi_get_version_reply {
>>> +		__s32 err;
>>> +		__u32 version;
>>> +	};
>>> +
>>> +Returns the introspection API version (the KVMI_VERSION constant) and the
>>> +error code (zero). In case of an unlikely error, the version will have an
>>> +undefined value.
>>
>> Would it make sense to return a few more information fields, for example
>> the set of supported KVMI_CONTROL_EVENTS events?
> 
> Sure.
> 
>>> +	struct kvmi_get_guest_info_reply {
>>> +		__s32 err;
>>> +		__u16 vcpu_count;
>>> +		__u16 padding;
>>> +		__u64 tsc_speed;
>>> +	};
>>> +
>>> +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
>>> +by the architecture (otherwise is 0).
>>
>> Can the TSC frequency be specified only if the guest is using TSC
>> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
>> would be the host.
>>
>> Maybe define the second padding to be
>>
>> 	__u16 arch;
>>
>> (0 = x86) followed by an arch-specific payload.
> 
> Sure.
> 
>>> +5. KVMI_SHUTDOWN_GUEST
>>> +----------------------
>>> +
>>> +Ungracefully shutdown the guest.
>>
>> Note that all KVM can do here is pass down a request asynchronously to
>> userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
>> alternative action.
> 
> We were thinking about sending an "ungraceful" TERM signal :D
> 
> Adding another reponse (exit-to-userspace-with-shutdown) on events,
> as you suggested below, makes sense.
> 
>>> +12. KVMI_SET_PAGE_ACCESS
>>> +------------------------
>>> +
>>> +Sets the spte flags (rwx - present, write & user) - access - for the specified
>>> +vCPU and guest physical address.
>>
>> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
>> documentation.
> 
> RWX.
> Will do.
> 
>>> +17. KVMI_CONTROL_EVENTS
>>> +-----------------------
>>> +
>>> +:Architectures: all
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_control_events {
>>> +		__u16 vcpu;
>>> +		__u16 padding;
>>> +		__u32 events;
>>> +	};
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_error_code {
>>> +		__s32 err;
>>> +		__u32 padding;
>>> +	};
>>> +
>>> +Enables/disables vCPU introspection events, by setting/clearing one or more
>>> +of the following bits (see 'Events' below) :
>>> +
>>> +	KVMI_EVENT_CR
>>> +	KVMI_EVENT_MSR
>>> +	KVMI_EVENT_XSETBV
>>> +	KVMI_EVENT_BREAKPOINT
>>> +	KVMI_EVENT_USER_CALL
>>> +	KVMI_EVENT_PAGE_FAULT
>>> +	KVMI_EVENT_TRAP
>>> +
>>> +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
>>> +architecture would fail and -EINVAL will be returned.
>>
>> I would prefer the interface to allow enable (set bits), disable (clear
>> bits) in addition to set (what you have here) the events.
> 
> Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will
> enable CR and MSR events and disable all the other events. It functions like
> a set and clear, in the same time.
> 
> It will be better to have KVMI_ENABLE_CONTROL_EVENTS and
> KVMI_DISABLE_CONTROL_EVENTS instead?
> 
>> The return value could indicate the set of enabled events.
> 
> Indeed.
> 
>>> +Events
>>> +------
>>> +
>>> +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
>>> +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
>>> +
>>> +For x86, the message data starts with a common structure::
>>> +
>>> +	struct kvmi_event_x86 {
>>> +		__u16 vcpu;
>>> +		__u8 mode;
>>> +		__u8 padding1;
>>> +		__u32 event;
>>> +		struct kvm_regs regs;
>>> +		struct kvm_sregs sregs;
>>> +		struct {
>>> +			__u64 sysenter_cs;
>>> +			__u64 sysenter_esp;
>>> +			__u64 sysenter_eip;
>>> +			__u64 efer;
>>> +			__u64 star;
>>> +			__u64 lstar;
>>
>> cstar too, for AMD CPUs.
> 
> Thanks. Will do.
> 
>>> +		} msrs;
>>> +	};
>>> +
>>> +In order to help the introspection tool with the event analysis while
>>> +avoiding unnecessary introspection commands, the message data holds some
>>> +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
>>> +the vCPU id, its mode (in bytes) and the event (one of the flags set
>>> +with the KVMI_CONTROL_EVENTS command).
>>> +
>>> +The replies to events also start with a common structure, having the
>>> +KVMI_EVENT_VCPU_REPLY message id::
>>> +
>>> +	struct kvmi_event_x86_reply {
>>> +		struct kvm_regs regs;
>>
>> Put regs last?
>>
>>> +		__u32 actions;
>>> +		__u32 padding;
>>> +	};
> 
> Right.
> 
>>> +5. KVMI_EVENT_USER_CALL
>>> +-----------------------
>>
>> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.
> 
> Sure.
>  
>>> +
>>> +:Architectures: x86
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_event_x86;
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_event_x86_reply;
>>> +
>>> +This event is sent on a user hypercall and the introspection has already
>>> +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
>>> +
>>> +kvmi_event_x86 is sent to the introspector, which can respond with the
>>> +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
>>> +kernel to override the general purpose registers using the values from
>>> +introspector (regs).
>>
>> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
>> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
>> SKIP/RETRY/ALLOW/CRASH as the actions.
> 
> We've used this as way for guest code to communicate with the introspector.
> KVMI_EVENT_ACTION_ALLOW is implied here.
> 
>>> --- /dev/null
>>> +++ b/include/uapi/linux/kvmi.h
> ...
>>
>> Errnos values are not portable, I'd prefer to have them defined
>> explicitly in the header.
> 
> We were thinking that the introspection tool will take care of this,
> as with the endianness.
> 
> Surely, it will be much more clear to have the errnos defined here.
> 
> Thanks again,
> 
> Adalbert
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-10 17:03       ` Paolo Bonzini
@ 2017-07-11 16:48         ` Adalbert Lazar
  2017-07-11 16:51           ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Adalbert Lazar @ 2017-07-11 16:48 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
> more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
> 
> The main remaining issue seems to be map/unmap.

Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
performance wise. Maybe we could have it as an optional flag for ALLOW? Or
at least for the hot paths?

Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
additional flag:

  * CR, MSR - ALLOW with untouched new_value will let the guest continue,
              but with "new_value = old_value" is a "deny"

  * xsetbv - ALLOW is implied

  * breakpoint - SKIP means the BP is processed by the introspector,
                 ALLOW means let the guest handle it

  * hypercall - ALLOW is implied

  * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
                 ALLOW with adjusted PC is a "skip" (done by the tool
                 for the moment).

Adalbert

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-11 16:48         ` Adalbert Lazar
@ 2017-07-11 16:51           ` Paolo Bonzini
  2017-07-13  5:57             ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-11 16:51 UTC (permalink / raw)
  To: Adalbert Lazar, kvm
  Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

On 11/07/2017 18:48, Adalbert Lazar wrote:
> On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
>> more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
>>
>> The main remaining issue seems to be map/unmap.
> 
> Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
> performance wise. Maybe we could have it as an optional flag for ALLOW?

Actually it makes more sense for SKIP, I think, where the introspector
is actually doing emulation?

But why is KVMI_SET_REGS slower than a set regs command followed by an
action?

> Or at least for the hot paths?
> 
> Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
> additional flag:
> 
>   * CR, MSR - ALLOW with untouched new_value will let the guest continue,
>               but with "new_value = old_value" is a "deny"
> 
>   * xsetbv - ALLOW is implied
> 
>   * breakpoint - SKIP means the BP is processed by the introspector,
>                  ALLOW means let the guest handle it
> 
>   * hypercall - ALLOW is implied
> 
>   * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
>                  ALLOW with adjusted PC is a "skip" (done by the tool
>                  for the moment).

This is more complicated than necessary.  I would just make it simple:

- SKIP, adjust RIP to point to the next instruction and enter the guest

- RETRY, enter the guest

- ALLOW, emulate the instruction with information coming from the
response packet

- CRASH, self-explanatory

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar
  2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar
  2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini
@ 2017-07-12 14:09 ` Konrad Rzeszutek Wilk
  2017-07-13  5:37   ` Mihai Donțu
  2 siblings, 1 reply; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-12 14:09 UTC (permalink / raw)
  To: Adalbert Lazar
  Cc: kvm, Paolo Bonzini, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Mihai Dontu

On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote:
> The following patch adds the documentation for an introspection subsystem for
> KVM (KVMi). It details the purpose and the use case that has shaped the
> proposed API/ABI, as well as the wire protocol.
> 
> During the discussion that has developed around our previous RFC patchset a
> number of TODO-s have been highlighted:
> 
>  * the integration in qemu: a socket-based protocol used to initiate the
>    connection with an introspection tool and then passes control to KVM, the
>    in-kernel mechanism taking over from there;
> 
>  * the integration of the SPT-handling mechanism into the KVM MMU;
> 
>  * the elaboration of a set of policies and a mechanism to better control
>    this feature;
> 
> One bit of code that has passed (maybe) unnoticed in the RFC is a new
> function added to Linux' mm called vm_replace_page() which, much like KSM's
> replace_page(), gets two processes to share a page (read-write, no-COW):
> 
> https://marc.info/?l=kvm&m=149762056518799&w=2
> 
> This is used to quickly scan and patch the guest software.
> 
> The patch following this cover letter does not yet address the points above
> but aims to clear with the community the overall ABI/API, with a focus on
> x86.

Are there thoughts on making this work with libvmi? Or would this
interface be used by a paid product?

Thanks.
> 
> Thank you,
> 
> v2:
>   - add documentation and ABI [Paolo, Jan]
>   - drop all the other patches for now [Paolo]
>   - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
>     and let libvirt/qemu handle this [Stefan, Paolo]
>   - change the license from LGPL to GPL [Jan]
>   - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
>   - make the interface a little more consistent
> 
> Adalbert Lazar (1):
>   kvm: Add documentation and ABI/API header for VM introspection
> 
>  Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvmi.h          | 310 ++++++++++++
>  2 files changed, 1295 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/kvmi.rst
>  create mode 100644 include/uapi/linux/kvmi.h
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-12 14:09 ` Konrad Rzeszutek Wilk
@ 2017-07-13  5:37   ` Mihai Donțu
  2017-07-14 16:13     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-13  5:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: kvm, Paolo Bonzini, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar

Hi Konrad,

On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote:
> > The following patch adds the documentation for an introspection subsystem for
> > KVM (KVMi). It details the purpose and the use case that has shaped the
> > proposed API/ABI, as well as the wire protocol.
> > 
> > During the discussion that has developed around our previous RFC patchset a
> > number of TODO-s have been highlighted:
> > 
> >  * the integration in qemu: a socket-based protocol used to initiate the
> >    connection with an introspection tool and then passes control to KVM, the
> >    in-kernel mechanism taking over from there;
> > 
> >  * the integration of the SPT-handling mechanism into the KVM MMU;
> > 
> >  * the elaboration of a set of policies and a mechanism to better control
> >    this feature;
> > 
> > One bit of code that has passed (maybe) unnoticed in the RFC is a new
> > function added to Linux' mm called vm_replace_page() which, much like KSM's
> > replace_page(), gets two processes to share a page (read-write, no-COW):
> > 
> > https://marc.info/?l=kvm&m=149762056518799&w=2
> > 
> > This is used to quickly scan and patch the guest software.
> > 
> > The patch following this cover letter does not yet address the points above
> > but aims to clear with the community the overall ABI/API, with a focus on
> > x86.
> 
> Are there thoughts on making this work with libvmi? Or would this
> interface be used by a paid product?

We have not looked at how we would go about adding support for this API
in libvmi, but a quick look tells me a driver shouldn't be that hard to
write.

We are, however, looking at publishing the sources for a small library
called libkvmi that would expose an easy-to-use interface and on top of
which we will add support for KVM VMI in:

https://github.com/razvan-cojocaru/libbdvmi

Other than that, the major user will indeed be a commercial product
that, right now, uses Xen's VMI infrastructure:

https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html

Thanks,

> > v2:
> >   - add documentation and ABI [Paolo, Jan]
> >   - drop all the other patches for now [Paolo]
> >   - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
> >     and let libvirt/qemu handle this [Stefan, Paolo]
> >   - change the license from LGPL to GPL [Jan]
> >   - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
> >   - make the interface a little more consistent
> > 
> > Adalbert Lazar (1):
> >   kvm: Add documentation and ABI/API header for VM introspection
> > 
> >  Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvmi.h          | 310 ++++++++++++
> >  2 files changed, 1295 insertions(+)
> >  create mode 100644 Documentation/virtual/kvm/kvmi.rst
> >  create mode 100644 include/uapi/linux/kvmi.h
> > 

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-11 16:51           ` Paolo Bonzini
@ 2017-07-13  5:57             ` Mihai Donțu
  2017-07-13  7:32               ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-13  5:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Tue, 2017-07-11 at 18:51 +0200, Paolo Bonzini wrote:
> On 11/07/2017 18:48, Adalbert Lazar wrote:
> > On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
> > > more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
> > > 
> > > The main remaining issue seems to be map/unmap.
> > 
> > Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
> > performance wise. Maybe we could have it as an optional flag for ALLOW?
> 
> Actually it makes more sense for SKIP, I think, where the introspector
> is actually doing emulation?

I'm afraid I don't undestand your question, however we were looking at
using KVM's x86 emulator rather than putting together our own, as such
software might be fun to write but they take a very long time to get
right. I'd argue that KVM's emulator has already seen a lot of
coverage.

In the future we are looking at maybe moving away from it on Intel-s,
by way of VMFUNC and #VE.

> But why is KVMI_SET_REGS slower than a set regs command followed by an
> action?

To be honest, we just looked at the Xen implementation which gates
writing back the registers to VMCS on them actually having been
changed. I just figured the less VMWRITE-s, the better. I'm also
looking over some older stats that show the registers being written
back quite often. Allow me 1-2 days to gather new ones and followup on
this thread.

> > Or at least for the hot paths?
> > 
> > Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
> > additional flag:
> > 
> >   * CR, MSR - ALLOW with untouched new_value will let the guest continue,
> >               but with "new_value = old_value" is a "deny"
> > 
> >   * xsetbv - ALLOW is implied
> > 
> >   * breakpoint - SKIP means the BP is processed by the introspector,
> >                  ALLOW means let the guest handle it
> > 
> >   * hypercall - ALLOW is implied
> > 
> >   * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
> >                  ALLOW with adjusted PC is a "skip" (done by the tool
> >                  for the moment).
> 
> This is more complicated than necessary.  I would just make it simple:
> 
> - SKIP, adjust RIP to point to the next instruction and enter the guest
> 
> - RETRY, enter the guest
> 
> - ALLOW, emulate the instruction with information coming from the
> response packet
> 
> - CRASH, self-explanatory

I see no major problem with your version. The reason we removed SKIP
from the #PF description is that the RIP adjustment is done by the
introspection tool after it decodes the instruction and computes its
opcode length. So when the event response reaches KVM, it all looks
like ALLOW was requested as the host-side code would be oblivous to the
RIP change.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-13  5:57             ` Mihai Donțu
@ 2017-07-13  7:32               ` Paolo Bonzini
  2017-07-18 11:51                 ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-13  7:32 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 13/07/2017 07:57, Mihai Donțu wrote:
>> Actually it makes more sense for SKIP, I think, where the introspector
>> is actually doing emulation?
>
> I'm afraid I don't undestand your question, however we were looking at
> using KVM's x86 emulator rather than putting together our own, as such
> software might be fun to write but they take a very long time to get
> right. I'd argue that KVM's emulator has already seen a lot of
> coverage.

Of course!  But there could be some special cases (e.g. hypercalls)
where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
is the right thing to do.

> In the future we are looking at maybe moving away from it on Intel-s,
> by way of VMFUNC and #VE.
> 
>> But why is KVMI_SET_REGS slower than a set regs command followed by an
>> action?
> To be honest, we just looked at the Xen implementation which gates
> writing back the registers to VMCS on them actually having been
> changed.

That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
unless the registers have changed.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-07 16:52   ` Paolo Bonzini
  2017-07-10 15:32     ` alazar
@ 2017-07-13  8:36     ` Mihai Donțu
  2017-07-13  9:15       ` Paolo Bonzini
  2017-07-27 17:06     ` Mihai Donțu
  2 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-13  8:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > +3. KVMI_PAUSE_GUEST
> > +-------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: {}
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +This command will pause all vcpus threads, by getting them out of guest mode
> > +and put them in the "waiting introspection commands" state.
> 
> Pausing all threads synchronously is a tricky concept.  The main issue
> is that the guest could be paused at the userspace level.  Then KVM
> would not be running (userspace is stuck in pthread_cond_wait,
> typically) and could not acknowledge the pause.
> 
> Worse, KVM is not able to distinguish userspace that has paused the VM
> from userspace that is doing MMIO or userspace that has a bug and hung
> somewhere.  And even worse, there are cases where userspace wants to
> modify registers while doing port I/O (the awful VMware RPCI port).  So
> I'd rather avoid this.

I should give more details here: we don't need to pause the vCPU-s in
the sense widely understood but just prevent them from entering the
guest for a short period of time. In our particular case, we need this
when we start introspecting a VM that's already running. For this we
kick the vCPU-s out of the guest so that our scan of the memory does
not race with the guest kernel/applications.

Another use case is when we inject applications into a running guest.
We also kick the vCPU-s out while we atomically make changes to kernel
specific structures.

> > +8. KVMI_GET_MTRR_TYPE
> > +---------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrr_type {
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrr_type_reply {
> > +		__s32 err;
> > +		__u32 type;
> > +	};
> > +
> > +Returns the guest memory type for a specific physical address.
> 
> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
> rather avoid it.

We use it do identify cacheable memory which usually indicates device
memory, something we don't want to touch. We are also looking into
making use of the page attributes (PAT) or other PTE-bits instead of
MTRR, but for the time being MTRR-s are still being relied upon.

> > +9. KVMI_GET_MTRRS
> > +-----------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrrs {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrrs_reply {
> > +		__s32 err;
> > +		__u32 padding;
> > +		__u64 pat;
> > +		__u64 cap;
> > +		__u64 type;
> > +	};
> > +
> > +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
> > +vCPU.
> 
> This could use KVM_GET_REGISTERS, couldn't it?

Yes, I belive it can be folded into that command.

> > +14. KVMI_INJECT_BREAKPOINT
> > +--------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_inject_breakpoint {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Injects a breakpoint for the specified vCPU. This command is usually sent in
> > +response to an event and as such the proper GPR-s will be set with the reply.
> 
> What is a "breakpoint" in this context?

A simple INT3. It's what most of our patches consist of. We keep track
of them and handle the ones we own while pass (reinject) the ones used
by the guest (debuggers or whatnot).

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-13  8:36     ` Mihai Donțu
@ 2017-07-13  9:15       ` Paolo Bonzini
  2017-07-27 16:23         ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-13  9:15 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 13/07/2017 10:36, Mihai Donțu wrote:
> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>> Worse, KVM is not able to distinguish userspace that has paused the VM
>> from userspace that is doing MMIO or userspace that has a bug and hung
>> somewhere.  And even worse, there are cases where userspace wants to
>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>> I'd rather avoid this.
> 
> I should give more details here: we don't need to pause the vCPU-s in
> the sense widely understood but just prevent them from entering the
> guest for a short period of time. In our particular case, we need this
> when we start introspecting a VM that's already running. For this we
> kick the vCPU-s out of the guest so that our scan of the memory does
> not race with the guest kernel/applications.
> 
> Another use case is when we inject applications into a running guest.
> We also kick the vCPU-s out while we atomically make changes to kernel
> specific structures.

This is not possible to do in KVM, because KVM doesn't control what
happens to the memory outside KVM_RUN (and of course it doesn't control
devices doing DMA).  You need to talk to QEMU in order to do this.

To do atomic changes to kernel specific structures, I would change the
page tables to inaccessible instead, but that also doesn't protect them
from devices doing DMA into them.

Another issue: say a VM is waiting for a reply from the introspector,
and the reply is delayed so the VM gets a signal and needs to get out to
QEMU with EINTR.  I don't think it is always possible to retry the
instruction on the next KVM_RUN, because the introspector might be
making partial changes.  Add live migration to the mix if you want to
make things even more complicated. :)

I think we need a way to mark a set of commands for atomic application.
That is, the structure of the command stream needs to be

    command 1
    command 2
    event reply 1
    transaction end marker
    command 3
    transaction end marker
    command 4
    event reply 2
    transaction end marker

>>> +8. KVMI_GET_MTRR_TYPE
>>> +---------------------
>>
>> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
>> rather avoid it.
> 
> We use it do identify cacheable memory which usually indicates device
> memory, something we don't want to touch. We are also looking into
> making use of the page attributes (PAT) or other PTE-bits instead of
> MTRR, but for the time being MTRR-s are still being relied upon.

Fair enough.  But you can compute it yourself from the MTRRs, can't you?
 A separate command is just adding attack surface in the hypervisor.

>>> +14. KVMI_INJECT_BREAKPOINT
>>> +--------------------------
>>> +
>>> +:Architectures: all
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_inject_breakpoint {
>>> +		__u16 vcpu;
>>> +		__u16 padding[3];
>>> +	};
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_error_code {
>>> +		__s32 err;
>>> +		__u32 padding;
>>> +	};
>>> +
>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in
>>> +response to an event and as such the proper GPR-s will be set with the reply.
>>
>> What is a "breakpoint" in this context?
> 
> A simple INT3. It's what most of our patches consist of. We keep track
> of them and handle the ones we own while pass (reinject) the ones used
> by the guest (debuggers or whatnot).

Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
those two as they provide a more direct interface than map/unmap, and
they work even with introspectors that are not sibling guests of the
introspected VM).

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-13  5:37   ` Mihai Donțu
@ 2017-07-14 16:13     ` Konrad Rzeszutek Wilk
  2017-07-18  8:55       ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-14 16:13 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Konrad Rzeszutek Wilk, kvm, Paolo Bonzini,
	Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar

On Thu, Jul 13, 2017 at 08:37:07AM +0300, Mihai Donțu wrote:
> Hi Konrad,
> 
> On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote:
> > > The following patch adds the documentation for an introspection subsystem for
> > > KVM (KVMi). It details the purpose and the use case that has shaped the
> > > proposed API/ABI, as well as the wire protocol.
> > > 
> > > During the discussion that has developed around our previous RFC patchset a
> > > number of TODO-s have been highlighted:
> > > 
> > >  * the integration in qemu: a socket-based protocol used to initiate the
> > >    connection with an introspection tool and then passes control to KVM, the
> > >    in-kernel mechanism taking over from there;
> > > 
> > >  * the integration of the SPT-handling mechanism into the KVM MMU;
> > > 
> > >  * the elaboration of a set of policies and a mechanism to better control
> > >    this feature;
> > > 
> > > One bit of code that has passed (maybe) unnoticed in the RFC is a new
> > > function added to Linux' mm called vm_replace_page() which, much like KSM's
> > > replace_page(), gets two processes to share a page (read-write, no-COW):
> > > 
> > > https://marc.info/?l=kvm&m=149762056518799&w=2
> > > 
> > > This is used to quickly scan and patch the guest software.
> > > 
> > > The patch following this cover letter does not yet address the points above
> > > but aims to clear with the community the overall ABI/API, with a focus on
> > > x86.
> > 
> > Are there thoughts on making this work with libvmi? Or would this
> > interface be used by a paid product?
> 
> We have not looked at how we would go about adding support for this API
> in libvmi, but a quick look tells me a driver shouldn't be that hard to
> write.

OK. Asking as the libvmi interface is quite fantastic. It has the drivers
for VMWare, Xen, and others - and a pretty cool library to do most of
the introspection through simple application linking to it.

And more importantly (to me at least), it does allow academia
to test different ideas.
> 
> We are, however, looking at publishing the sources for a small library
> called libkvmi that would expose an easy-to-use interface and on top of
> which we will add support for KVM VMI in:
> 
> https://github.com/razvan-cojocaru/libbdvmi

And thoughts of testing API that will be part of KVM test unit?

> 
> Other than that, the major user will indeed be a commercial product
> that, right now, uses Xen's VMI infrastructure:
> 
> https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html
> 
> Thanks,
> 
> > > v2:
> > >   - add documentation and ABI [Paolo, Jan]
> > >   - drop all the other patches for now [Paolo]
> > >   - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
> > >     and let libvirt/qemu handle this [Stefan, Paolo]
> > >   - change the license from LGPL to GPL [Jan]
> > >   - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
> > >   - make the interface a little more consistent
> > > 
> > > Adalbert Lazar (1):
> > >   kvm: Add documentation and ABI/API header for VM introspection
> > > 
> > >  Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/kvmi.h          | 310 ++++++++++++
> > >  2 files changed, 1295 insertions(+)
> > >  create mode 100644 Documentation/virtual/kvm/kvmi.rst
> > >  create mode 100644 include/uapi/linux/kvmi.h
> > > 
> 
> -- 
> Mihai Donțu
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-14 16:13     ` Konrad Rzeszutek Wilk
@ 2017-07-18  8:55       ` Mihai Donțu
  0 siblings, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-18  8:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, kvm, Paolo Bonzini,
	Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar

On Fri, 2017-07-14 at 12:13 -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jul 13, 2017 at 08:37:07AM +0300, Mihai Donțu wrote:
> > Hi Konrad,
> > 
> > On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote:
> > > > The following patch adds the documentation for an introspection subsystem for
> > > > KVM (KVMi). It details the purpose and the use case that has shaped the
> > > > proposed API/ABI, as well as the wire protocol.
> > > > 
> > > > During the discussion that has developed around our previous RFC patchset a
> > > > number of TODO-s have been highlighted:
> > > > 
> > > >  * the integration in qemu: a socket-based protocol used to initiate the
> > > >    connection with an introspection tool and then passes control to KVM, the
> > > >    in-kernel mechanism taking over from there;
> > > > 
> > > >  * the integration of the SPT-handling mechanism into the KVM MMU;
> > > > 
> > > >  * the elaboration of a set of policies and a mechanism to better control
> > > >    this feature;
> > > > 
> > > > One bit of code that has passed (maybe) unnoticed in the RFC is a new
> > > > function added to Linux' mm called vm_replace_page() which, much like KSM's
> > > > replace_page(), gets two processes to share a page (read-write, no-COW):
> > > > 
> > > > https://marc.info/?l=kvm&m=149762056518799&w=2
> > > > 
> > > > This is used to quickly scan and patch the guest software.
> > > > 
> > > > The patch following this cover letter does not yet address the points above
> > > > but aims to clear with the community the overall ABI/API, with a focus on
> > > > x86.
> > > 
> > > Are there thoughts on making this work with libvmi? Or would this
> > > interface be used by a paid product?
> > 
> > We have not looked at how we would go about adding support for this API
> > in libvmi, but a quick look tells me a driver shouldn't be that hard to
> > write.
> 
> OK. Asking as the libvmi interface is quite fantastic. It has the drivers
> for VMWare, Xen, and others - and a pretty cool library to do most of
> the introspection through simple application linking to it.
> 
> And more importantly (to me at least), it does allow academia
> to test different ideas.
> > 
> > We are, however, looking at publishing the sources for a small library
> > called libkvmi that would expose an easy-to-use interface and on top of
> > which we will add support for KVM VMI in:
> > 
> > https://github.com/razvan-cojocaru/libbdvmi
> 
> And thoughts of testing API that will be part of KVM test unit?

We have not yet put enough though into how we will integrate with KVM's
unit tests, partly because we are still making adjustments to the
design. All we have on the board right now is that we must create unit
tests. :-)

Regards,

> > Other than that, the major user will indeed be a commercial product
> > that, right now, uses Xen's VMI infrastructure:
> > 
> > https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html
> > 
> > > > v2:
> > > >   - add documentation and ABI [Paolo, Jan]
> > > >   - drop all the other patches for now [Paolo]
> > > >   - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF,
> > > >     and let libvirt/qemu handle this [Stefan, Paolo]
> > > >   - change the license from LGPL to GPL [Jan]
> > > >   - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore)
> > > >   - make the interface a little more consistent
> > > > 
> > > > Adalbert Lazar (1):
> > > >   kvm: Add documentation and ABI/API header for VM introspection
> > > > 
> > > >  Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/kvmi.h          | 310 ++++++++++++
> > > >  2 files changed, 1295 insertions(+)
> > > >  create mode 100644 Documentation/virtual/kvm/kvmi.rst
> > > >  create mode 100644 include/uapi/linux/kvmi.h

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-13  7:32               ` Paolo Bonzini
@ 2017-07-18 11:51                 ` Mihai Donțu
  2017-07-18 12:02                   ` Mihai Donțu
  2017-07-23 13:02                   ` Paolo Bonzini
  0 siblings, 2 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-18 11:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> On 13/07/2017 07:57, Mihai Donțu wrote:
> > > Actually it makes more sense for SKIP, I think, where the introspector
> > > is actually doing emulation?
> > 
> > I'm afraid I don't undestand your question, however we were looking at
> > using KVM's x86 emulator rather than putting together our own, as such
> > software might be fun to write but they take a very long time to get
> > right. I'd argue that KVM's emulator has already seen a lot of
> > coverage.
> 
> Of course!  But there could be some special cases (e.g. hypercalls)
> where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> is the right thing to do.

I think I finally understand what you're saying. That SKIP would tell
the introspection subsystem to just write back the registers and enter
the guest, no in-host emulation needed. So, to reiterate, the possible
actions would be:

 * SKIP - re-enter the guest (the introspection tool has adjusted all
   registers)
 * RETRY - re-enter the guest
 * ALLOW - use the emulator
 * CRASH - kill the guest process

It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
sets all registers that might have been affected by the emulation
(control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

> > In the future we are looking at maybe moving away from it on Intel-s,
> > by way of VMFUNC and #VE.
> > 
> > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > action?
> > 
> > To be honest, we just looked at the Xen implementation which gates
> > writing back the registers to VMCS on them actually having been
> > changed.
> 
> That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> unless the registers have changed.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-18 11:51                 ` Mihai Donțu
@ 2017-07-18 12:02                   ` Mihai Donțu
  2017-07-23 13:02                   ` Paolo Bonzini
  1 sibling, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-18 12:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Tue, 2017-07-18 at 14:51 +0300, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> > On 13/07/2017 07:57, Mihai Donțu wrote:
> > > > Actually it makes more sense for SKIP, I think, where the introspector
> > > > is actually doing emulation?
> > > 
> > > I'm afraid I don't undestand your question, however we were looking at
> > > using KVM's x86 emulator rather than putting together our own, as such
> > > software might be fun to write but they take a very long time to get
> > > right. I'd argue that KVM's emulator has already seen a lot of
> > > coverage.
> > 
> > Of course!  But there could be some special cases (e.g. hypercalls)
> > where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> > is the right thing to do.
> 
> I think I finally understand what you're saying. That SKIP would tell
> the introspection subsystem to just write back the registers and enter
> the guest, no in-host emulation needed. So, to reiterate, the possible
> actions would be:
> 
>  * SKIP - re-enter the guest (the introspection tool has adjusted all
>    registers)
>  * RETRY - re-enter the guest
>  * ALLOW - use the emulator
>  * CRASH - kill the guest process
> 
> It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> sets all registers that might have been affected by the emulation
> (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

I mentioned in a previous email something about statistics. The
following are a sort of that:

eventCount: 13.2 eventsMemAccess: 12.4 eventsWriteCtrlReg: 0.8 partialContext: 0.2 partialCpu: 0.2 xcGetMemAccess: 116 xcMapPage: 191.4 xcSetMemAccessMulti: 1.6 
eventCount: 4261.8 eventsBreakPoint: 4170.2 eventsMemAccess: 84.4 eventsWriteCtrlReg: 7.2 setRegisters: 4170.2 xcMapPage: 3.4 xcSetMemAccessMulti: 0.2 
eventCount: 8098.6 eventsBreakPoint: 4783.4 eventsMemAccess: 3302.4 eventsWriteCtrlReg: 12.8 setRegisters: 4783.4 xcMapPage: 2.6 
eventCount: 3495 eventsBreakPoint: 2631.6 eventsMemAccess: 846.6 eventsWriteCtrlReg: 16.8 setRegisters: 2631.6 xcMapPage: 2.6 xcSetMemAccessMulti: 0.4 
eventCount: 538.4 eventsBreakPoint: 248 eventsMemAccess: 279.2 eventsWriteCtrlReg: 11.2 setRegisters: 248 xcMapPage: 1.2 
eventCount: 4876.2 eventsBreakPoint: 3683.4 eventsMemAccess: 1172.8 eventsWriteCtrlReg: 20 setRegisters: 3683.4 xcMapPage: 4.8 
eventCount: 5937.4 eventsBreakPoint: 4403.8 eventsMemAccess: 1507.2 eventsWriteCtrlReg: 26.4 setRegisters: 4403.8 xcMapPage: 4.8 
eventCount: 9992.4 eventsBreakPoint: 7948.6 eventsMemAccess: 2019.8 eventsWriteCtrlReg: 24 setRegisters: 7948.6 xcMapPage: 1.4 xcSetMemAccessMulti: 5 
eventCount: 5150.6 eventsBreakPoint: 2175 eventsMemAccess: 2902.8 eventsWriteCtrlReg: 72.8 setRegisters: 2175 xcGetMemAccess: 0.4 xcMapPage: 8 xcSetMemAccessMulti: 1 
eventCount: 5422.2 eventsBreakPoint: 4362.2 eventsMemAccess: 1012.8 eventsWriteCtrlReg: 47.2 setRegisters: 4362.2 xcGetMemAccess: 2.8 xcMapPage: 10.2 xcSetMemAccessMulti: 3.4 
eventCount: 1910.2 eventsBreakPoint: 1665.6 eventsMemAccess: 231.8 eventsWriteCtrlReg: 12.8 setRegisters: 1665.6 xcGetMemAccess: 0.2 xcMapPage: 2.2 xcSetMemAccessMulti: 0.6 
eventCount: 1834.4 eventsBreakPoint: 1357.6 eventsMemAccess: 462.4 eventsWriteCtrlReg: 14.4 setRegisters: 1357.6 xcGetMemAccess: 0.2 xcMapPage: 4.8 xcSetMemAccessMulti: 0.4 
eventCount: 6081.2 eventsBreakPoint: 4855.6 eventsMemAccess: 1208.8 eventsWriteCtrlReg: 16.8 setRegisters: 4855.6 xcMapPage: 4 
eventCount: 1105.4 eventsBreakPoint: 855 eventsMemAccess: 226.4 eventsWriteCtrlReg: 24 setRegisters: 855 xcMapPage: 1.6 
eventCount: 8362.8 eventsBreakPoint: 4409.2 eventsMemAccess: 3917.6 eventsWriteCtrlReg: 36 setRegisters: 4409.2 xcGetMemAccess: 117.4 xcMapPage: 9 xcSetMemAccessMulti: 254.6 
eventCount: 2222.2 eventsBreakPoint: 32.2 eventsMemAccess: 2169.2 eventsWriteCtrlReg: 20.8 setRegisters: 32.2 xcGetMemAccess: 2.8 xcMapPage: 5 xcSetMemAccessMulti: 104.4 
eventCount: 2889.2 eventsBreakPoint: 1447.8 eventsMemAccess: 1419.8 eventsWriteCtrlReg: 21.6 partialContext: 0.8 partialCpu: 0.8 setRegisters: 1447.8 xcGetMemAccess: 1.4 xcMapPage: 16.6 xcSetMemAccessMulti: 2.2 
eventCount: 1698.8 eventsBreakPoint: 1031.8 eventsMemAccess: 655 eventsWriteCtrlReg: 12 setRegisters: 1031.8 xcGetMemAccess: 0.6 xcMapPage: 5 xcSetMemAccessMulti: 0.8 
eventCount: 691.8 eventsBreakPoint: 250.4 eventsMemAccess: 435.8 eventsWriteCtrlReg: 5.6 setRegisters: 250.4 xcGetMemAccess: 0.4 xcMapPage: 4 xcSetMemAccessMulti: 0.4 
eventCount: 756.8 eventsBreakPoint: 293.2 eventsMemAccess: 454.8 eventsWriteCtrlReg: 8.8 setRegisters: 293.2 xcGetMemAccess: 1.4 xcMapPage: 3.4 xcSetMemAccessMulti: 1.4 

These represent the number events/calls per second during a normal
introspection session (start Windows 10 x64, open Edge). The breakpoint
events correspond to the various API calls invoked and, as it can be
seen, for each of them we do a 'setRegisters'. We were hoping we can
reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
event response.

If I have not managed to convince you, I think we can go ahead and keep
them separate, have an initial implementation and see some actual
performance numbers. Should be no hassle. :-)

> > > In the future we are looking at maybe moving away from it on Intel-s,
> > > by way of VMFUNC and #VE.
> > > 
> > > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > > action?
> > > 
> > > To be honest, we just looked at the Xen implementation which gates
> > > writing back the registers to VMCS on them actually having been
> > > changed.
> > 
> > That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> > unless the registers have changed.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-18 11:51                 ` Mihai Donțu
  2017-07-18 12:02                   ` Mihai Donțu
@ 2017-07-23 13:02                   ` Paolo Bonzini
  2017-07-26 17:04                     ` Mihai Donțu
  1 sibling, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-23 13:02 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 18/07/2017 13:51, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
>> On 13/07/2017 07:57, Mihai Donțu wrote:
>>>> Actually it makes more sense for SKIP, I think, where the introspector
>>>> is actually doing emulation?
>>>
>>> I'm afraid I don't undestand your question, however we were looking at
>>> using KVM's x86 emulator rather than putting together our own, as such
>>> software might be fun to write but they take a very long time to get
>>> right. I'd argue that KVM's emulator has already seen a lot of
>>> coverage.
>>
>> Of course!  But there could be some special cases (e.g. hypercalls)
>> where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
>> is the right thing to do.
> 
> I think I finally understand what you're saying. That SKIP would tell
> the introspection subsystem to just write back the registers and enter
> the guest, no in-host emulation needed. So, to reiterate, the possible
> actions would be:
> 
>  * SKIP - re-enter the guest (the introspection tool has adjusted all
>    registers)
>  * RETRY - re-enter the guest
>  * ALLOW - use the emulator
>  * CRASH - kill the guest process
> 
> It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> sets all registers that might have been affected by the emulation
> (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

One difference that comes to mind between SKIP and RETRY is that SKIP
would inject a singlestep exception if TF was 1 on entry, and clear the
interrupt shadow.  RETRY would not do either of those.

(The name for SKIP comes from the KVM function
kvm_skip_emulated_instruction).

> We were hoping we can
> reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
> event response.
> 
> If I have not managed to convince you, I think we can go ahead and keep
> them separate, have an initial implementation and see some actual
> performance numbers. Should be no hassle. 

I think you should implement transactions in the protocol, so
effectively KVMI_SET_REGISTERS would be bundled with the event response
anyway.

Paolo

>>> In the future we are looking at maybe moving away from it on Intel-s,
>>> by way of VMFUNC and #VE.
>>>
>>>> But why is KVMI_SET_REGS slower than a set regs command followed by an
>>>> action?
>>>
>>> To be honest, we just looked at the Xen implementation which gates
>>> writing back the registers to VMCS on them actually having been
>>> changed.
>>
>> That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
>> unless the registers have changed.
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-23 13:02                   ` Paolo Bonzini
@ 2017-07-26 17:04                     ` Mihai Donțu
  2017-07-26 17:25                       ` Tamas K Lengyel
  2017-07-27 13:33                       ` Paolo Bonzini
  0 siblings, 2 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-26 17:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Sun, 2017-07-23 at 15:02 +0200, Paolo Bonzini wrote:
> On 18/07/2017 13:51, Mihai Donțu wrote:
> > On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> > > On 13/07/2017 07:57, Mihai Donțu wrote:
> > > > > Actually it makes more sense for SKIP, I think, where the introspector
> > > > > is actually doing emulation?
> > > > 
> > > > I'm afraid I don't undestand your question, however we were looking at
> > > > using KVM's x86 emulator rather than putting together our own, as such
> > > > software might be fun to write but they take a very long time to get
> > > > right. I'd argue that KVM's emulator has already seen a lot of
> > > > coverage.
> > > 
> > > Of course!  But there could be some special cases (e.g. hypercalls)
> > > where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> > > is the right thing to do.
> > 
> > I think I finally understand what you're saying. That SKIP would tell
> > the introspection subsystem to just write back the registers and enter
> > the guest, no in-host emulation needed. So, to reiterate, the possible
> > actions would be:
> > 
> >  * SKIP - re-enter the guest (the introspection tool has adjusted all
> >    registers)
> >  * RETRY - re-enter the guest
> >  * ALLOW - use the emulator
> >  * CRASH - kill the guest process
> > 
> > It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> > sets all registers that might have been affected by the emulation
> > (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> > that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.
> 
> One difference that comes to mind between SKIP and RETRY is that SKIP
> would inject a singlestep exception if TF was 1 on entry, and clear the
> interrupt shadow.  RETRY would not do either of those.
> 
> (The name for SKIP comes from the KVM function
> kvm_skip_emulated_instruction).

OK.

> > We were hoping we can
> > reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
> > event response.
> > 
> > If I have not managed to convince you, I think we can go ahead and keep
> > them separate, have an initial implementation and see some actual
> > performance numbers. Should be no hassle. 
> 
> I think you should implement transactions in the protocol, so
> effectively KVMI_SET_REGISTERS would be bundled with the event response
> anyway.

I see. Then maybe we should provide a way for commands to specify an
event ID. If zero, then the command is satisfied using data straight
from the vCPU (when making changes), otherwise a structure associated
with the event will be used as cache for all get-s/set-s and apply them
all in one go when the event reply arrives. This should work nicely
since we read a good deal of the register set anyway when sending the
event.

> > > > In the future we are looking at maybe moving away from it on Intel-s,
> > > > by way of VMFUNC and #VE.
> > > > 
> > > > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > > > action?
> > > > 
> > > > To be honest, we just looked at the Xen implementation which gates
> > > > writing back the registers to VMCS on them actually having been
> > > > changed.
> > > 
> > > That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> > > unless the registers have changed.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-26 17:04                     ` Mihai Donțu
@ 2017-07-26 17:25                       ` Tamas K Lengyel
  2017-07-27 14:41                         ` Mihai Donțu
  2017-07-27 13:33                       ` Paolo Bonzini
  1 sibling, 1 reply; 50+ messages in thread
From: Tamas K Lengyel @ 2017-07-26 17:25 UTC (permalink / raw)
  To: Mihai Donțu; +Cc: kvm

Hi Mihai,
this series was just brought to my attention (and this is the first
message in my inbox I can reply to) and I'm very happy to see you guys
working on this! I'm still reviewing the series and the only question
I have at this point is whether you will be also adding support for
setting the MTF through KVMI?

Thanks,
Tamas

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-26 17:04                     ` Mihai Donțu
  2017-07-26 17:25                       ` Tamas K Lengyel
@ 2017-07-27 13:33                       ` Paolo Bonzini
  2017-07-27 14:46                         ` Mihai Donțu
  1 sibling, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-27 13:33 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 26/07/2017 19:04, Mihai Donțu wrote:
>> I think you should implement transactions in the protocol, so
>> effectively KVMI_SET_REGISTERS would be bundled with the event response
>> anyway.
> 
> I see. Then maybe we should provide a way for commands to specify an
> event ID. If zero, then the command is satisfied using data straight
> from the vCPU (when making changes), otherwise a structure associated
> with the event will be used as cache for all get-s/set-s and apply them
> all in one go when the event reply arrives. This should work nicely
> since we read a good deal of the register set anyway when sending the
> event.

Yes, that is okay.  Just a question, why would the event ID provide more
information than just the vCPU id?  How can there be more than one event
active on the same vCPU?

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-26 17:25                       ` Tamas K Lengyel
@ 2017-07-27 14:41                         ` Mihai Donțu
  0 siblings, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-27 14:41 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: kvm

Hi Tamas,

On Wed, 2017-07-26 at 11:25 -0600, Tamas K Lengyel wrote:
> this series was just brought to my attention (and this is the first
> message in my inbox I can reply to) and I'm very happy to see you guys
> working on this! I'm still reviewing the series and the only question
> I have at this point is whether you will be also adding support for
> setting the MTF through KVMI?

Thanks for taking a look over this.

Currently we have no code in place for guest single stepping, but since
we're looking into making use of Intel's VMFUNC and #VE, I'm sure a
need for a feature like this will arrise. Or _maybe_ we can make it
part of the #PF handling code and have the introspection tool respond
with ALLOW and an EPT view ID and have KVM automatically switch, single
step, switch back and resume, if that makes sense for everybody.

Regards,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-27 13:33                       ` Paolo Bonzini
@ 2017-07-27 14:46                         ` Mihai Donțu
  0 siblings, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-07-27 14:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Thu, 2017-07-27 at 15:33 +0200, Paolo Bonzini wrote:
> On 26/07/2017 19:04, Mihai Donțu wrote:
> > > I think you should implement transactions in the protocol, so
> > > effectively KVMI_SET_REGISTERS would be bundled with the event response
> > > anyway.
> > 
> > I see. Then maybe we should provide a way for commands to specify an
> > event ID. If zero, then the command is satisfied using data straight
> > from the vCPU (when making changes), otherwise a structure associated
> > with the event will be used as cache for all get-s/set-s and apply them
> > all in one go when the event reply arrives. This should work nicely
> > since we read a good deal of the register set anyway when sending the
> > event.
> 
> Yes, that is okay.  Just a question, why would the event ID provide more
> information than just the vCPU id?  How can there be more than one event
> active on the same vCPU?

You're right. There can't. We'll just add some logic that says if the
vCPU has an event pending (or currently being handled), to use that
cache to satisfy all commands.

Thanks,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-13  9:15       ` Paolo Bonzini
@ 2017-07-27 16:23         ` Mihai Donțu
  2017-07-27 16:52           ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-27 16:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
> On 13/07/2017 10:36, Mihai Donțu wrote:
> > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > > Worse, KVM is not able to distinguish userspace that has paused the VM
> > > from userspace that is doing MMIO or userspace that has a bug and hung
> > > somewhere.  And even worse, there are cases where userspace wants to
> > > modify registers while doing port I/O (the awful VMware RPCI port).  So
> > > I'd rather avoid this.
> > 
> > I should give more details here: we don't need to pause the vCPU-s in
> > the sense widely understood but just prevent them from entering the
> > guest for a short period of time. In our particular case, we need this
> > when we start introspecting a VM that's already running. For this we
> > kick the vCPU-s out of the guest so that our scan of the memory does
> > not race with the guest kernel/applications.
> > 
> > Another use case is when we inject applications into a running guest.
> > We also kick the vCPU-s out while we atomically make changes to kernel
> > specific structures.
> 
> This is not possible to do in KVM, because KVM doesn't control what
> happens to the memory outside KVM_RUN (and of course it doesn't control
> devices doing DMA).  You need to talk to QEMU in order to do this.

Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
the already opened file descriptor to /dev/kvm for an event?

> To do atomic changes to kernel specific structures, I would change the
> page tables to inaccessible instead, but that also doesn't protect them
> from devices doing DMA into them.

If we have qemu pull out of the guest all vCPU-s and wait for a sign
from the KVMI subsystem, then that looks sufficient. Devices acessing
the memory (passedthrough devices, I assume) should be no problem as
we're never interested in device memory.

> Another issue: say a VM is waiting for a reply from the introspector,
> and the reply is delayed so the VM gets a signal and needs to get out to
> QEMU with EINTR.  I don't think it is always possible to retry the
> instruction on the next KVM_RUN, because the introspector might be
> making partial changes.  Add live migration to the mix if you want to
> make things even more complicated. :)
> 
> I think we need a way to mark a set of commands for atomic application.
> That is, the structure of the command stream needs to be
> 
>     command 1
>     command 2
>     event reply 1
>     transaction end marker
>     command 3
>     transaction end marker
>     command 4
>     event reply 2
>     transaction end marker

This should be covered by a previous email exchange. Commands targeting
vCPU-s for which events are pending or are currently being handled
should be done via a structure that acts like a cache, so that when the
event reply reaches KVM, all potential modifications are applied in one
go. This affects all register manipulations. Commands targeting memory
are unaffected by the state of the vCPU, though this might change when
we factor EPT views. But, alas, we have no clear view on this last
topic as of yet.

> > > > +8. KVMI_GET_MTRR_TYPE
> > > > +---------------------
> > > 
> > > What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
> > > rather avoid it.
> > 
> > We use it do identify cacheable memory which usually indicates device
> > memory, something we don't want to touch. We are also looking into
> > making use of the page attributes (PAT) or other PTE-bits instead of
> > MTRR, but for the time being MTRR-s are still being relied upon.
> 
> Fair enough.  But you can compute it yourself from the MTRRs, can't you?
> A separate command is just adding attack surface in the hypervisor.

I think we can make some basic MTRR info available via GET_REGISTERS
and do the rest in the introspection tool.

> > > > +14. KVMI_INJECT_BREAKPOINT
> > > > +--------------------------
> > > > +
> > > > +:Architectures: all
> > > > +:Versions: >= 1
> > > > +:Parameters: ↴
> > > > +
> > > > +::
> > > > +
> > > > +	struct kvmi_inject_breakpoint {
> > > > +		__u16 vcpu;
> > > > +		__u16 padding[3];
> > > > +	};
> > > > +
> > > > +:Returns: ↴
> > > > +
> > > > +::
> > > > +
> > > > +	struct kvmi_error_code {
> > > > +		__s32 err;
> > > > +		__u32 padding;
> > > > +	};
> > > > +
> > > > +Injects a breakpoint for the specified vCPU. This command is usually sent in
> > > > +response to an event and as such the proper GPR-s will be set with the reply.
> > > 
> > > What is a "breakpoint" in this context?
> > 
> > A simple INT3. It's what most of our patches consist of. We keep track
> > of them and handle the ones we own while pass (reinject) the ones used
> > by the guest (debuggers or whatnot).
> 
> Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
> those two as they provide a more direct interface than map/unmap, and
> they work even with introspectors that are not sibling guests of the
> introspected VM).

They can, nothing is stopping that. Also, we can keep the plain
read/write interfaces around. It just seemed easier to implement them
on top of an eventual mmap/munmap interface.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-27 16:23         ` Mihai Donțu
@ 2017-07-27 16:52           ` Paolo Bonzini
  2017-07-27 17:19             ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-27 16:52 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 27/07/2017 18:23, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>> I'd rather avoid this.
>>>
>>> I should give more details here: we don't need to pause the vCPU-s in
>>> the sense widely understood but just prevent them from entering the
>>> guest for a short period of time. In our particular case, we need this
>>> when we start introspecting a VM that's already running. For this we
>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>> not race with the guest kernel/applications.
>>>
>>> Another use case is when we inject applications into a running guest.
>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>> specific structures.
>>
>> This is not possible to do in KVM, because KVM doesn't control what
>> happens to the memory outside KVM_RUN (and of course it doesn't control
>> devices doing DMA).  You need to talk to QEMU in order to do this.
> 
> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
> the already opened file descriptor to /dev/kvm for an event?

Nope.  QEMU might be running and writing to memory in another thread.  I
don't see how this can be reliable on other hypervisors too, actually.

>> To do atomic changes to kernel specific structures, I would change the
>> page tables to inaccessible instead, but that also doesn't protect them
>> from devices doing DMA into them.
> 
> If we have qemu pull out of the guest all vCPU-s and wait for a sign
> from the KVMI subsystem, then that looks sufficient. Devices acessing
> the memory (passedthrough devices, I assume) should be no problem as
> we're never interested in device memory.

You're certainly interested in bus-master DMA from those devices though.

>> Another issue: say a VM is waiting for a reply from the introspector,
>> and the reply is delayed so the VM gets a signal and needs to get out to
>> QEMU with EINTR.  I don't think it is always possible to retry the
>> instruction on the next KVM_RUN, because the introspector might be
>> making partial changes.  Add live migration to the mix if you want to
>> make things even more complicated. :)
>>
>> I think we need a way to mark a set of commands for atomic application.
>> That is, the structure of the command stream needs to be
>>
>>     command 1
>>     command 2
>>     event reply 1
>>     transaction end marker
>>     command 3
>>     transaction end marker
>>     command 4
>>     event reply 2
>>     transaction end marker
> 
> This should be covered by a previous email exchange.

Correct.

>>>>> +8. KVMI_GET_MTRR_TYPE
>>>>> +---------------------
>>>>
>>>> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
>>>> rather avoid it.
>>>
>>> We use it do identify cacheable memory which usually indicates device
>>> memory, something we don't want to touch. We are also looking into
>>> making use of the page attributes (PAT) or other PTE-bits instead of
>>> MTRR, but for the time being MTRR-s are still being relied upon.
>>
>> Fair enough.  But you can compute it yourself from the MTRRs, can't you?
>> A separate command is just adding attack surface in the hypervisor.
> 
> I think we can make some basic MTRR info available via GET_REGISTERS
> and do the rest in the introspection tool.

Ok.

>>>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in
>>>>> +response to an event and as such the proper GPR-s will be set with the reply.
>>>>
>>>> What is a "breakpoint" in this context?
>>>
>>> A simple INT3. It's what most of our patches consist of. We keep track
>>> of them and handle the ones we own while pass (reinject) the ones used
>>> by the guest (debuggers or whatnot).
>>
>> Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
>> those two as they provide a more direct interface than map/unmap, and
>> they work even with introspectors that are not sibling guests of the
>> introspected VM).
> 
> They can, nothing is stopping that. Also, we can keep the plain
> read/write interfaces around. It just seemed easier to implement them
> on top of an eventual mmap/munmap interface.

I prefer to keep the simple interface and drop the breakpoint one.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-07 16:52   ` Paolo Bonzini
  2017-07-10 15:32     ` alazar
  2017-07-13  8:36     ` Mihai Donțu
@ 2017-07-27 17:06     ` Mihai Donțu
  2017-07-27 17:18       ` Paolo Bonzini
  2 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-27 17:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > +2. KVMI_GET_GUEST_INFO
> > +----------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: {}
> > +:Returns: ↴
> > +
> > +::
> > +
> > > > +	struct kvmi_get_guest_info_reply {
> > > > +		__s32 err;
> > > > +		__u16 vcpu_count;
> > > > +		__u16 padding;
> > > > +		__u64 tsc_speed;
> > > > +	};
> > +
> > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> > +by the architecture (otherwise is 0).
> 
> Can the TSC frequency be specified only if the guest is using TSC
> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
> would be the host.
> 
> Maybe define the second padding to be
> 
> 	__u16 arch;
> 
> (0 = x86) followed by an arch-specific payload.

We use the TSC to compute some performance numbers, but only when we're
debugging reported issues. Should be OK if the TSC information is not
available. We'll manage in some other way.

> > +10. KVMI_GET_XSAVE_INFO
> > +-----------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_xsave_info {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_xsave_info_reply {
> > +		__s32 err;
> > +		__u32 size;
> > +	};
> > +
> > +Returns the xstate size for the specified vCPU.
> 
> Could this be replaced by a generic CPUID accessor?

Absolutely. I wonder if we should only made certain leafs acessible,
part of the whole "make the least amount of information available"
security philosophy, though I'm not aware of any attacks that can be
mounted on the host just by knowing too many things about a guest.

> > +11. KVMI_GET_PAGE_ACCESS
> > +------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_get_page_access {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_get_page_access_reply {
> > +		__s32 err;
> > +		__u32 access;
> > +	};
> > +
> > +Returns the spte flags (rwx - present, write & user) for the specified
> > +vCPU and guest physical address.
> > +
> > +12. KVMI_SET_PAGE_ACCESS
> > +------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_set_page_access {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 access;
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Sets the spte flags (rwx - present, write & user) - access - for the specified
> > +vCPU and guest physical address.
> 
> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
> documentation.
> 
> Also, it should be noted here that the spte flags are ANDed with
> whatever is provided by userspace, because there could be readonly
> memslots, and that some access combinations could fail (--x) or will
> surely fail (-wx for example).

We are closely looking into how KVM's MMU works and see how we'd go
about extending it so as to make everyone happy (qemu and a possible
introspection tool).

As for the permitted page access flags, we have (thus far) kept away
from invalid combinations on older arch-es (--x on pre-SandyBridge, for
example). However, it would be very nice to have arch-specific code
that does a validation ahead of time (ie. before the guest is re-
entered) so that an introspection tool can use an alternative approach.

> > +13. KVMI_INJECT_PAGE_FAULT
> > +--------------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_page_fault {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 error;
> 
> error_code, I guess?  Why not a generic inject exception message?

OK, we will rework the interface to be used for generic exception
injection. Maybe we can fold the breakpoint injection into it too.

> > +		__u64 gva;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Injects a vCPU page fault with the specified guest virtual address and
> > +error code.
> > +
> > +5. KVMI_EVENT_USER_CALL
> > +-----------------------
> 
> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.
> 
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent on a user hypercall and the introspection has already
> > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +kvmi_event_x86 is sent to the introspector, which can respond with the
> > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> > +kernel to override the general purpose registers using the values from
> > +introspector (regs).
> 
> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
> SKIP/RETRY/ALLOW/CRASH as the actions.

No, we use SET_REGS only as a means to communicate with the guest code
that did the hypercall. We don't specify an action because, in our
specific case, saw no use for it.

Apropos: if possible, we'd like to keep the Xen convention for this
hypercall. It doesn't appear to interfere with either KVM or Hyper-V
(rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24
[HVMOP_guest_request_vm_event] - depends on wether long mode is
enabled).

> > +7. KVMI_EVENT_TRAP
> > +------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +	struct kvmi_event_trap {
> > +		__u32 vector;
> > +		__u32 type;
> > +		__u32 err;
> 
> error_code?
> 
> > +		__u32 padding;
> > +		__u64 cr2;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent if a trap will be delivered to the guest (page fault,
> > +breakpoint, etc.) and the introspection has already enabled the reports
> > +for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +This is used to inform the introspector of all pending traps giving it
> > +a chance to determine if it should try again later in case a previous
> > +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
> > +by an interrupt picked up during guest reentry.
> > +
> > +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
> > +type, exception code (err) and CR2 are sent to the introspector, which can
> > +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
> > +the host kernel to override the general purpose registers using the values
> > +from introspector (regs).
> 
> Here I think the actions could be RETRY/ALLOW/CRASH only, with "set
> regs" done as a separate command.
> 
> Some x86 exceptions are faults, others are traps.  Traps should not
> allow RETRY as an action.  There should be an input telling the
> introspector if retrying is allowed.
> 
> How are dr6/dr7 handled around breakpoints?

Afaik, the debug registers should remain untouched, but I will ask my
colleagues with better knowledge of this.

Thanks,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-27 17:06     ` Mihai Donțu
@ 2017-07-27 17:18       ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2017-07-27 17:18 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 27/07/2017 19:06, Mihai Donțu wrote:
>>> +
>>> +	struct kvmi_xsave_info_reply {
>>> +		__s32 err;
>>> +		__u32 size;
>>> +	};
>>> +
>>> +Returns the xstate size for the specified vCPU.
>> Could this be replaced by a generic CPUID accessor?
> 
> Absolutely. I wonder if we should only made certain leafs acessible,
> part of the whole "make the least amount of information available"
> security philosophy, though I'm not aware of any attacks that can be
> mounted on the host just by knowing too many things about a guest.

That should be safe.

>> Also, it should be noted here that the spte flags are ANDed with
>> whatever is provided by userspace, because there could be readonly
>> memslots, and that some access combinations could fail (--x) or will
>> surely fail (-wx for example).
> We are closely looking into how KVM's MMU works and see how we'd go
> about extending it so as to make everyone happy (qemu and a possible
> introspection tool).
> 
> As for the permitted page access flags, we have (thus far) kept away
> from invalid combinations on older arch-es (--x on pre-SandyBridge, for
> example). However, it would be very nice to have arch-specific code
> that does a validation ahead of time (ie. before the guest is re-
> entered) so that an introspection tool can use an alternative approach.

We can make the command fail if you end up with --x (and it is not
supported) or -wx.

>> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
>> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
>> SKIP/RETRY/ALLOW/CRASH as the actions.
>
> No, we use SET_REGS only as a means to communicate with the guest code
> that did the hypercall. We don't specify an action because, in our
> specific case, saw no use for it.

Ok, I'd prefer the usual set of actions just for consistency.  ALLOW
would fall through to KVM's handling of the hypercall.

> Apropos: if possible, we'd like to keep the Xen convention for this
> hypercall. It doesn't appear to interfere with either KVM or Hyper-V
> (rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24
> [HVMOP_guest_request_vm_event] - depends on wether long mode is
> enabled).

I was actually thinking that you'd trap all hypercalls to the
introspector, hence SKIP/RETRY/ALLOW/CRASH.  Reserving RAX = 34 for you
is fine (but please document the ABI and subfunctions).

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-27 16:52           ` Paolo Bonzini
@ 2017-07-27 17:19             ` Mihai Donțu
  2017-08-01 10:40               ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-07-27 17:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
> On 27/07/2017 18:23, Mihai Donțu wrote:
> > On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
> > > On 13/07/2017 10:36, Mihai Donțu wrote:
> > > > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > > > > Worse, KVM is not able to distinguish userspace that has paused the VM
> > > > > from userspace that is doing MMIO or userspace that has a bug and hung
> > > > > somewhere.  And even worse, there are cases where userspace wants to
> > > > > modify registers while doing port I/O (the awful VMware RPCI port).  So
> > > > > I'd rather avoid this.
> > > > 
> > > > I should give more details here: we don't need to pause the vCPU-s in
> > > > the sense widely understood but just prevent them from entering the
> > > > guest for a short period of time. In our particular case, we need this
> > > > when we start introspecting a VM that's already running. For this we
> > > > kick the vCPU-s out of the guest so that our scan of the memory does
> > > > not race with the guest kernel/applications.
> > > > 
> > > > Another use case is when we inject applications into a running guest.
> > > > We also kick the vCPU-s out while we atomically make changes to kernel
> > > > specific structures.
> > > 
> > > This is not possible to do in KVM, because KVM doesn't control what
> > > happens to the memory outside KVM_RUN (and of course it doesn't control
> > > devices doing DMA).  You need to talk to QEMU in order to do this.
> > 
> > Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
> > the already opened file descriptor to /dev/kvm for an event?
> 
> Nope.  QEMU might be running and writing to memory in another thread.  I
> don't see how this can be reliable on other hypervisors too, actually.

I assume it largely depends on knowing what's possible to do and what
not with the guest memory even while the vCPU-s are suspended. The
price of breaking this rule will be something any KVMI user will have
to be very aware of.

We did quite a bit of fine-tunning to our application to avoid
interferring with the device manager (or passedthrough devices) and
also it helps when the hypervisor refuses to grant access to certain
memory areas that it knows are not safe to touch (even read) from
remote code.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-07-27 17:19             ` Mihai Donțu
@ 2017-08-01 10:40               ` Paolo Bonzini
  2017-08-01 16:33                 ` Tamas K Lengyel
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-01 10:40 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 27/07/2017 19:19, Mihai Donțu wrote:
> On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
>> On 27/07/2017 18:23, Mihai Donțu wrote:
>>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>>>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>>>> I'd rather avoid this.
>>>>>
>>>>> I should give more details here: we don't need to pause the vCPU-s in
>>>>> the sense widely understood but just prevent them from entering the
>>>>> guest for a short period of time. In our particular case, we need this
>>>>> when we start introspecting a VM that's already running. For this we
>>>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>>>> not race with the guest kernel/applications.
>>>>>
>>>>> Another use case is when we inject applications into a running guest.
>>>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>>>> specific structures.
>>>>
>>>> This is not possible to do in KVM, because KVM doesn't control what
>>>> happens to the memory outside KVM_RUN (and of course it doesn't control
>>>> devices doing DMA).  You need to talk to QEMU in order to do this.
>>>
>>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
>>> the already opened file descriptor to /dev/kvm for an event?
>>
>> Nope.  QEMU might be running and writing to memory in another thread.  I
>> don't see how this can be reliable on other hypervisors too, actually.
> 
> I assume it largely depends on knowing what's possible to do and what
> not with the guest memory even while the vCPU-s are suspended. The
> price of breaking this rule will be something any KVMI user will have
> to be very aware of.

If you actually pause the whole VM (through QEMU's monitor commands
"stop" and "cont") everything should be safe.  Of course there can be
bugs and PCI passthrough devices should be problematic, but in general
the device emulation is quiescent.  This however is not the case when
only the VCPUs are paused.

Paolo

> We did quite a bit of fine-tunning to our application to avoid
> interferring with the device manager (or passedthrough devices) and
> also it helps when the hypervisor refuses to grant access to certain
> memory areas that it knows are not safe to touch (even read) from
> remote code.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-01 10:40               ` Paolo Bonzini
@ 2017-08-01 16:33                 ` Tamas K Lengyel
  2017-08-01 20:47                   ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Tamas K Lengyel @ 2017-08-01 16:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mihai Donțu, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On Tue, Aug 1, 2017 at 4:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 27/07/2017 19:19, Mihai Donțu wrote:
>> On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
>>> On 27/07/2017 18:23, Mihai Donțu wrote:
>>>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>>>>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>>>>> I'd rather avoid this.
>>>>>>
>>>>>> I should give more details here: we don't need to pause the vCPU-s in
>>>>>> the sense widely understood but just prevent them from entering the
>>>>>> guest for a short period of time. In our particular case, we need this
>>>>>> when we start introspecting a VM that's already running. For this we
>>>>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>>>>> not race with the guest kernel/applications.
>>>>>>
>>>>>> Another use case is when we inject applications into a running guest.
>>>>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>>>>> specific structures.
>>>>>
>>>>> This is not possible to do in KVM, because KVM doesn't control what
>>>>> happens to the memory outside KVM_RUN (and of course it doesn't control
>>>>> devices doing DMA).  You need to talk to QEMU in order to do this.
>>>>
>>>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
>>>> the already opened file descriptor to /dev/kvm for an event?
>>>
>>> Nope.  QEMU might be running and writing to memory in another thread.  I
>>> don't see how this can be reliable on other hypervisors too, actually.
>>
>> I assume it largely depends on knowing what's possible to do and what
>> not with the guest memory even while the vCPU-s are suspended. The
>> price of breaking this rule will be something any KVMI user will have
>> to be very aware of.
>
> If you actually pause the whole VM (through QEMU's monitor commands
> "stop" and "cont") everything should be safe.  Of course there can be
> bugs and PCI passthrough devices should be problematic, but in general
> the device emulation is quiescent.  This however is not the case when
> only the VCPUs are paused.

IMHO for some use-cases it is sufficient to have the guest itself be
limited in the modifications it makes to memory. So for example if
just a vCPU is paused there are areas of memory that you can interact
with without having to worry about it changing underneath the
introspecting application (ie. thread-specific datastructures like the
KPCR, etc..). If the introspecting application needs access to areas
that non-paused vCPUs may touch, or QEMU, or a pass-through device,
then it should be a decision for the introspecting app whether to
pause the VM completely. It may still choose to instead do some
error-detection on reads/writes to detect inconsistent accesses and
perhaps just re-try the operation till it succeeds. This may have less
of an impact on the performance of the VM as no full VM pause had to
be performed. It is all very application specific, so having options
is always a good thing.

Tamas

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-01 16:33                 ` Tamas K Lengyel
@ 2017-08-01 20:47                   ` Paolo Bonzini
  2017-08-02 11:52                     ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-01 20:47 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Mihai Donțu, Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm

On 01/08/2017 18:33, Tamas K Lengyel wrote:
>> If you actually pause the whole VM (through QEMU's monitor commands
>> "stop" and "cont") everything should be safe.  Of course there can be
>> bugs and PCI passthrough devices should be problematic, but in general
>> the device emulation is quiescent.  This however is not the case when
>> only the VCPUs are paused.
> IMHO for some use-cases it is sufficient to have the guest itself be
> limited in the modifications it makes to memory. So for example if
> just a vCPU is paused there are areas of memory that you can interact
> with without having to worry about it changing underneath the
> introspecting application (ie. thread-specific datastructures like the
> KPCR, etc..). If the introspecting application needs access to areas
> that non-paused vCPUs may touch, or QEMU, or a pass-through device,
> then it should be a decision for the introspecting app whether to
> pause the VM completely. It may still choose to instead do some
> error-detection on reads/writes to detect inconsistent accesses and
> perhaps just re-try the operation till it succeeds. This may have less
> of an impact on the performance of the VM as no full VM pause had to
> be performed. It is all very application specific, so having options
> is always a good thing.

Fair enough.  There is another issue however.

If a guest is runnnig in the kernel, it can be easily paused while KVMI
processes events and the like.

While a guest is outside the kernel, it could be running or paused.

If running, the value of a register might change before the VM reenters
execution (due to a reset, or due to ugly features such as the VMware
magic I/O port 0x5658).  So the introspector would probably prefer
anyway to do any changes while the guest is in the kernel: one idea I
had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED
event---then the introspector can send commands that do the required
patching and then restart the guest by replying to the event.

But if the guest is paused, KVMI_PAUSE_VCPU would never be processed.
So how could the introspector distinguish the two cases and avoid the
KVMI_PAUSE_VCPU if the guest is paused?

(There is another complication: the guest could be running with the APIC
emulated in userspace.  In that case, a VCPU doing "cli;hlt" spends
infinite time in userspace even though it's running, and KVM has no idea
why.  This is less common, but it's worth mentioning too).

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-01 20:47                   ` Paolo Bonzini
@ 2017-08-02 11:52                     ` Mihai Donțu
  2017-08-02 12:27                       ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-08-02 11:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On Tue, 2017-08-01 at 22:47 +0200, Paolo Bonzini wrote:
> On 01/08/2017 18:33, Tamas K Lengyel wrote:
> > > If you actually pause the whole VM (through QEMU's monitor commands
> > > "stop" and "cont") everything should be safe.  Of course there can be
> > > bugs and PCI passthrough devices should be problematic, but in general
> > > the device emulation is quiescent.  This however is not the case when
> > > only the VCPUs are paused.
> > 
> > IMHO for some use-cases it is sufficient to have the guest itself be
> > limited in the modifications it makes to memory. So for example if
> > just a vCPU is paused there are areas of memory that you can interact
> > with without having to worry about it changing underneath the
> > introspecting application (ie. thread-specific datastructures like the
> > KPCR, etc..). If the introspecting application needs access to areas
> > that non-paused vCPUs may touch, or QEMU, or a pass-through device,
> > then it should be a decision for the introspecting app whether to
> > pause the VM completely. It may still choose to instead do some
> > error-detection on reads/writes to detect inconsistent accesses and
> > perhaps just re-try the operation till it succeeds. This may have less
> > of an impact on the performance of the VM as no full VM pause had to
> > be performed. It is all very application specific, so having options
> > is always a good thing.
> 
> Fair enough.  There is another issue however.
> 
> If a guest is runnnig in the kernel, it can be easily paused while KVMI
> processes events and the like.
> 
> While a guest is outside the kernel, it could be running or paused.
> 
> If running, the value of a register might change before the VM reenters
> execution (due to a reset, or due to ugly features such as the VMware
> magic I/O port 0x5658).  So the introspector would probably prefer
> anyway to do any changes while the guest is in the kernel: one idea I
> had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED
> event---then the introspector can send commands that do the required
> patching and then restart the guest by replying to the event.
> 
> But if the guest is paused, KVMI_PAUSE_VCPU would never be processed.
> So how could the introspector distinguish the two cases and avoid the
> KVMI_PAUSE_VCPU if the guest is paused?
> 
> (There is another complication: the guest could be running with the APIC
> emulated in userspace.  In that case, a VCPU doing "cli;hlt" spends
> infinite time in userspace even though it's running, and KVM has no idea
> why.  This is less common, but it's worth mentioning too).

I think it might help to distinguish two situations in which we require
the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
command can be translated into a qemu pause. In our particular usecase
we made special arrangements to call it as few times as possible
assuming it's very costly. The other is needed only by the internal KVM
code for situations similar to:

  kvm_pause_vcpu(vcpu);
  vcpu_load(vcpu);
  kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
  vcpu_put(vcpu);
  kvm_unpause_vcpu(vcpu);

or more generally put, for accesses that involve the vCPU state
(registers, MSR-s, exceptions etc.), no guest memory involved.

Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
so, make it somehow available for quick re-entry with
kvm_unpause_vcpu(). If said vCPU is already out, then the function will
be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
we're currently handling an event or one is pending.

I hope this narrows down further the exact requirements.

One exception that might have a better solution is:

  kvm_pause_all_vcpus(kvm);
  kvm_set_page_access(kvm, gfn); /* pause for get too? */
  kvm_unpause_all_vcpus(kvm);

There might be a way to make the change and then IPI all vCPU-s without
pulling them out of the guest.

Regards,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 11:52                     ` Mihai Donțu
@ 2017-08-02 12:27                       ` Paolo Bonzini
  2017-08-02 13:32                         ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-02 12:27 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On 02/08/2017 13:52, Mihai Donțu wrote:
> I think it might help to distinguish two situations in which we require
> the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> command can be translated into a qemu pause. In our particular usecase
> we made special arrangements to call it as few times as possible
> assuming it's very costly. The other is needed only by the internal KVM
> code for situations similar to:
> 
>   kvm_pause_vcpu(vcpu);
>   vcpu_load(vcpu);
>   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
>   vcpu_put(vcpu);
>   kvm_unpause_vcpu(vcpu);
> 
> or more generally put, for accesses that involve the vCPU state
> (registers, MSR-s, exceptions etc.), no guest memory involved.
> 
> Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> so, make it somehow available for quick re-entry with
> kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> we're currently handling an event or one is pending.

Understood.  The issue is that there is an inherent race between
anything userspace is doing and get_regs.  What are the cases where you
need to get regs or similar outside an event?

> One exception that might have a better solution is:
> 
>   kvm_pause_all_vcpus(kvm);
>   kvm_set_page_access(kvm, gfn); /* pause for get too? */
>   kvm_unpause_all_vcpus(kvm);
> 
> There might be a way to make the change and then IPI all vCPU-s without
> pulling them out of the guest.

For that I think KVMI should define a VM-wide "mask" layered over the
actual memory map permissions.  Such a command can be implemented
relatively easily by hooking into the callers of __gfn_to_pfn_memslot
and kvm_vcpu_gfn_to_hva_prot.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 12:27                       ` Paolo Bonzini
@ 2017-08-02 13:32                         ` Mihai Donțu
  2017-08-02 13:51                           ` Paolo Bonzini
  2017-08-02 13:53                           ` Mihai Donțu
  0 siblings, 2 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-08-02 13:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote:
> On 02/08/2017 13:52, Mihai Donțu wrote:
> > I think it might help to distinguish two situations in which we require
> > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> > command can be translated into a qemu pause. In our particular usecase
> > we made special arrangements to call it as few times as possible
> > assuming it's very costly. The other is needed only by the internal KVM
> > code for situations similar to:
> > 
> >   kvm_pause_vcpu(vcpu);
> >   vcpu_load(vcpu);
> >   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> >   vcpu_put(vcpu);
> >   kvm_unpause_vcpu(vcpu);
> > 
> > or more generally put, for accesses that involve the vCPU state
> > (registers, MSR-s, exceptions etc.), no guest memory involved.
> > 
> > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> > so, make it somehow available for quick re-entry with
> > kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> > we're currently handling an event or one is pending.
> 
> Understood.  The issue is that there is an inherent race between
> anything userspace is doing and get_regs.  What are the cases where you
> need to get regs or similar outside an event?

We have currently identified three cases:

 * initial hooking of a guest
 * periodically checking the integrity of data that is not properly
   placed into a page and thus cannot be efficiently tracked via SPT
 * injecting processes

> > One exception that might have a better solution is:
> > 
> >   kvm_pause_all_vcpus(kvm);
> >   kvm_set_page_access(kvm, gfn); /* pause for get too? */
> >   kvm_unpause_all_vcpus(kvm);
> > 
> > There might be a way to make the change and then IPI all vCPU-s without
> > pulling them out of the guest.
> 
> For that I think KVMI should define a VM-wide "mask" layered over the
> actual memory map permissions.  Such a command can be implemented
> relatively easily by hooking into the callers of __gfn_to_pfn_memslot
> and kvm_vcpu_gfn_to_hva_prot.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 13:32                         ` Mihai Donțu
@ 2017-08-02 13:51                           ` Paolo Bonzini
  2017-08-02 14:17                             ` Mihai Donțu
  2017-08-02 13:53                           ` Mihai Donțu
  1 sibling, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-02 13:51 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On 02/08/2017 15:32, Mihai Donțu wrote:
> We have currently identified three cases:
> 
>  * initial hooking of a guest

What triggers the initial hooking, and how is it done?

>  * periodically checking the integrity of data that is not properly
>    placed into a page and thus cannot be efficiently tracked via SPT

This only needs read memory (and it's okay for it to race against DMA
because it's periodic).

>  * injecting processes

This also doesn't need pause.  IIRC you put a breakpoint somewhere, or
make a page non-executable, to ensure the guest doesn't get in the way.
DMA can still get in the way, but that can happen anyway right after
process injection so it's not an issue.

Have you thought about monitoring hardware registers, for example in
order to check that IOMMU page tables protect from overwriting the kernel?

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 13:32                         ` Mihai Donțu
  2017-08-02 13:51                           ` Paolo Bonzini
@ 2017-08-02 13:53                           ` Mihai Donțu
  1 sibling, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-08-02 13:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On Wed, 2017-08-02 at 16:32 +0300, Mihai Donțu wrote:
> On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote:
> > On 02/08/2017 13:52, Mihai Donțu wrote:
> > > I think it might help to distinguish two situations in which we require
> > > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> > > command can be translated into a qemu pause. In our particular usecase
> > > we made special arrangements to call it as few times as possible
> > > assuming it's very costly. The other is needed only by the internal KVM
> > > code for situations similar to:
> > > 
> > >   kvm_pause_vcpu(vcpu);
> > >   vcpu_load(vcpu);
> > >   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> > >   vcpu_put(vcpu);
> > >   kvm_unpause_vcpu(vcpu);
> > > 
> > > or more generally put, for accesses that involve the vCPU state
> > > (registers, MSR-s, exceptions etc.), no guest memory involved.
> > > 
> > > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> > > so, make it somehow available for quick re-entry with
> > > kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> > > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> > > we're currently handling an event or one is pending.
> > 
> > Understood.  The issue is that there is an inherent race between
> > anything userspace is doing and get_regs.  What are the cases where you
> > need to get regs or similar outside an event?
> 
> We have currently identified three cases:
> 
>  * initial hooking of a guest
>  * periodically checking the integrity of data that is not properly
>    placed into a page and thus cannot be efficiently tracked via SPT
>  * injecting processes

A few more details are required here, taken from our current
application: cases (a) and (c) involve poking the vCPU-s a couple of
times before doing KVMI_PAUSE_GUEST() after which everything should be
OK, though we should make sure qemu sync-s the states before entering
the paused state.

Case (b), as far as I can see, is implemented only via memory maps (eg.
the IDT is mapped and kept around for quick checking), but a need
_might_ arrise to poke the vCPU a few times before concluding that an
area has really been corrupted.

> > > One exception that might have a better solution is:
> > > 
> > >   kvm_pause_all_vcpus(kvm);
> > >   kvm_set_page_access(kvm, gfn); /* pause for get too? */
> > >   kvm_unpause_all_vcpus(kvm);
> > > 
> > > There might be a way to make the change and then IPI all vCPU-s without
> > > pulling them out of the guest.
> > 
> > For that I think KVMI should define a VM-wide "mask" layered over the
> > actual memory map permissions.  Such a command can be implemented
> > relatively easily by hooking into the callers of __gfn_to_pfn_memslot
> > and kvm_vcpu_gfn_to_hva_prot.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 13:51                           ` Paolo Bonzini
@ 2017-08-02 14:17                             ` Mihai Donțu
  2017-08-04  8:35                               ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-08-02 14:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei LUTAS

On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
> On 02/08/2017 15:32, Mihai Donțu wrote:
> > We have currently identified three cases:
> > 
> >  * initial hooking of a guest
> 
> What triggers the initial hooking, and how is it done?

There are two types o hooks: dynamic (the guest is hooked as it boots)
and static (a fully booted guest is being hooked). They both start with
a notification from qemu or some other application that a guest is
available for introspection. After that we poke its vCPU-s a few times
to determine what type of hooking to continue with. I belive the
syscall entry point MSR-s are key here.

> >  * periodically checking the integrity of data that is not properly
> >    placed into a page and thus cannot be efficiently tracked via SPT
> 
> This only needs read memory (and it's okay for it to race against DMA
> because it's periodic).

I just looked through some traces (the logic changed quite a bit since
I last checked) and looks entirely based on memory reads right now.

> >  * injecting processes
> 
> This also doesn't need pause.  IIRC you put a breakpoint somewhere, or
> make a page non-executable, to ensure the guest doesn't get in the way.
> DMA can still get in the way, but that can happen anyway right after
> process injection so it's not an issue.

That might be a very possible approach. The code we have in place now
pauses the guest entirely, though.

I have added in CC a colleague of mine, Andrei Luțaș. He leads the
development of the introspection technology, irrespective of the
hypervisor. Adalbert and I only work on bridging it with KVM. :-) I'll
kindly ask him to help with more details wherever you feel my
explanations were not sufficient.

> Have you thought about monitoring hardware registers, for example in
> order to check that IOMMU page tables protect from overwriting the kernel?

Sorry, but I'm not sure I understand: are you thinking at a way to make
sure none of the IOMMU grups are configured with a "too generous" DMA
window?

Regards,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-02 14:17                             ` Mihai Donțu
@ 2017-08-04  8:35                               ` Paolo Bonzini
  2017-08-04 15:29                                 ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-04  8:35 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei LUTAS

On 02/08/2017 16:17, Mihai Donțu wrote:
> On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
>> On 02/08/2017 15:32, Mihai Donțu wrote:
>>> We have currently identified three cases:
>>>
>>>  * initial hooking of a guest
>>
>> What triggers the initial hooking, and how is it done?
> 
> There are two types o hooks: dynamic (the guest is hooked as it boots)
> and static (a fully booted guest is being hooked). They both start with
> a notification from qemu or some other application that a guest is
> available for introspection. After that we poke its vCPU-s a few times
> to determine what type of hooking to continue with. I belive the
> syscall entry point MSR-s are key here.

Reads need not be transactional here, and the syscall entry point MSRs
are generally immutable so I think it is fine not to pause.

I'm curious how the introspector decides that the guest is ready to be
hooked in, that is, that it's far enough in the boot process.

I think a command to "pause" KVM_RUN is okay to add.  That is, if in
KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
halt the vCPU.  If not in KVM_RUN, the command would return 0 and
trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.

The introspector then can decide whether to wait if the commands return
0.  There is no need for an "unpause" command, as that is achieved
simply by completing the 'paused' event.

>> Have you thought about monitoring hardware registers, for example in
>> order to check that IOMMU page tables protect from overwriting the kernel?
> 
> Sorry, but I'm not sure I understand: are you thinking at a way to make
> sure none of the IOMMU grups are configured with a "too generous" DMA
> window?

Yes, exactly.  Optional, of course.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-04  8:35                               ` Paolo Bonzini
@ 2017-08-04 15:29                                 ` Mihai Donțu
  2017-08-04 15:37                                   ` Paolo Bonzini
  2017-08-05  8:00                                   ` Andrei Vlad LUTAS
  0 siblings, 2 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-08-04 15:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei LUTAS

On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:
> On 02/08/2017 16:17, Mihai Donțu wrote:
> > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
> > > On 02/08/2017 15:32, Mihai Donțu wrote:
> > > > We have currently identified three cases:
> > > > 
> > > >  * initial hooking of a guest
> > > 
> > > What triggers the initial hooking, and how is it done?
> > 
> > There are two types o hooks: dynamic (the guest is hooked as it boots)
> > and static (a fully booted guest is being hooked). They both start with
> > a notification from qemu or some other application that a guest is
> > available for introspection. After that we poke its vCPU-s a few times
> > to determine what type of hooking to continue with. I belive the
> > syscall entry point MSR-s are key here.
> 
> Reads need not be transactional here, and the syscall entry point MSRs
> are generally immutable so I think it is fine not to pause.

I might be misunderstanding. Entry point MSR-s (and maybe others) are
generally immutable in well behaving guests. We are, however, looking
for signs that something is breaking this pattern.

> I'm curious how the introspector decides that the guest is ready to be
> hooked in, that is, that it's far enough in the boot process.

I will let Andrei add some details here.

> I think a command to "pause" KVM_RUN is okay to add.  That is, if in
> KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
> halt the vCPU.  If not in KVM_RUN, the command would return 0 and
> trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.
> 
> The introspector then can decide whether to wait if the commands return
> 0.  There is no need for an "unpause" command, as that is achieved
> simply by completing the 'paused' event.

This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,
something that maybe some future users can leverage. I, however, was
trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),
the latter existing only in KVM (kernel). The event-based mecanism for
pausing a vCPU feels like it has too much overhead for what one usually
needs it (read some registers).

> > > Have you thought about monitoring hardware registers, for example in
> > > order to check that IOMMU page tables protect from overwriting the kernel?
> > 
> > Sorry, but I'm not sure I understand: are you thinking at a way to make
> > sure none of the IOMMU grups are configured with a "too generous" DMA
> > window?
> 
> Yes, exactly.  Optional, of course.

We could add a command that returns the list of DMA ranges which can
then be used by an introspector to check if the OS has made a mistake
and placed sensible data in that rage.

Also, I belive we have refined a number of ideas on this thread and
more remain to clarify. I would like to update the design document with
what we have agreed on so far, add some more details to what felt
'under explained' and continue again from there. Is it OK for you?

Regards,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-04 15:29                                 ` Mihai Donțu
@ 2017-08-04 15:37                                   ` Paolo Bonzini
  2017-08-05  8:00                                   ` Andrei Vlad LUTAS
  1 sibling, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-04 15:37 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei LUTAS

On 04/08/2017 17:29, Mihai Donțu wrote:
> On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:
>> On 02/08/2017 16:17, Mihai Donțu wrote:
>>> On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
>>>> On 02/08/2017 15:32, Mihai Donțu wrote:
>>>>> We have currently identified three cases:
>>>>>
>>>>>  * initial hooking of a guest
>>>>
>>>> What triggers the initial hooking, and how is it done?
>>>
>>> There are two types o hooks: dynamic (the guest is hooked as it boots)
>>> and static (a fully booted guest is being hooked). They both start with
>>> a notification from qemu or some other application that a guest is
>>> available for introspection. After that we poke its vCPU-s a few times
>>> to determine what type of hooking to continue with. I belive the
>>> syscall entry point MSR-s are key here.
>>
>> Reads need not be transactional here, and the syscall entry point MSRs
>> are generally immutable so I think it is fine not to pause.
> 
> I might be misunderstanding. Entry point MSR-s (and maybe others) are
> generally immutable in well behaving guests. We are, however, looking
> for signs that something is breaking this pattern.

Right but you can always set up an MSR write intercept even before the
guest starts.  The MSRs are written rarely-enough that the intercept
doesn't add any noticeable overhead.

Reading the MSR while the guest runs is okay.  If races are important,
the introspectors can detect them through the intercepts it sets up.  In
many cases races are not an issue even (for example, in MTRRs).

>> I think a command to "pause" KVM_RUN is okay to add.  That is, if in
>> KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
>> halt the vCPU.  If not in KVM_RUN, the command would return 0 and
>> trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.
>>
>> The introspector then can decide whether to wait if the commands return
>> 0.  There is no need for an "unpause" command, as that is achieved
>> simply by completing the 'paused' event.
> 
> This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,
> something that maybe some future users can leverage. I, however, was
> trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),
> the latter existing only in KVM (kernel). The event-based mecanism for
> pausing a vCPU feels like it has too much overhead for what one usually
> needs it (read some registers).

My point is that usually even register reads (actually MSRs) do not
require a pause.  The event-based mechanism would be used only for the
initial hooking.

KVMI_PAUSE_VM would require QEMU communication, wouldn't it?

> Also, I belive we have refined a number of ideas on this thread and
> more remain to clarify. I would like to update the design document with
> what we have agreed on so far, add some more details to what felt
> 'under explained' and continue again from there. Is it OK for you?

Yes.  Pausing is the last issue that needs to be clarified.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-04 15:29                                 ` Mihai Donțu
  2017-08-04 15:37                                   ` Paolo Bonzini
@ 2017-08-05  8:00                                   ` Andrei Vlad LUTAS
  2017-08-07 12:18                                     ` Paolo Bonzini
  1 sibling, 1 reply; 50+ messages in thread
From: Andrei Vlad LUTAS @ 2017-08-05  8:00 UTC (permalink / raw)
  To: Mihai Donțu, Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

> -----Original Message-----
> From: Mihai Donțu [mailto:mdontu@bitdefender.com]
> Sent: 4 August, 2017 18:30
> To: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>; Jan Kiszka
> <jan.kiszka@siemens.com>; Stefan Hajnoczi <stefanha@redhat.com>;
> Adalbert Lazar <alazar@bitdefender.com>; kvm@vger.kernel.org; Tamas K
> Lengyel <tamas.k.lengyel@gmail.com>; Andrei Vlad LUTAS
> <vlutas@bitdefender.com>
> Subject: Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API
> header for VM introspection
>
> On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:
> > On 02/08/2017 16:17, Mihai Donțu wrote:
> > > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
> > > > On 02/08/2017 15:32, Mihai Donțu wrote:
> > > > > We have currently identified three cases:
> > > > >
> > > > >  * initial hooking of a guest
> > > >
> > > > What triggers the initial hooking, and how is it done?
> > >
> > > There are two types o hooks: dynamic (the guest is hooked as it
> > > boots) and static (a fully booted guest is being hooked). They both
> > > start with a notification from qemu or some other application that a
> > > guest is available for introspection. After that we poke its vCPU-s
> > > a few times to determine what type of hooking to continue with. I
> > > belive the syscall entry point MSR-s are key here.
> >
> > Reads need not be transactional here, and the syscall entry point MSRs
> > are generally immutable so I think it is fine not to pause.
>
> I might be misunderstanding. Entry point MSR-s (and maybe others) are
> generally immutable in well behaving guests. We are, however, looking for
> signs that something is breaking this pattern.
>
> > I'm curious how the introspector decides that the guest is ready to be
> > hooked in, that is, that it's far enough in the boot process.
>
> I will let Andrei add some details here.
>
The first steps of enabling the protection for a guest depends on hardware events (like Mihai said, MSR writes, to give one example). Depending on what MSRs are written & the values they are written with, HVI will be able to identify, based on analyzing the guest memory, the type  of the running OS (Windows x86, Windows x64, Linux, etc.) - of course, how exactly the protection is enabled from that point on (analyzing the SYSCALL/interrupt handlers, other events such as XSETBV, etc.) is OS dependent, and there isn't a single correct answer.

The necessity to pause the running VCPUs for short intervals of time comes in several cases:
- when receiving the first relevant events, such as the SYSCALL MSR writes, we want to ensure no other VCPU could be running that might interfere with memory that we analyse (that might include not-yet-known writable pages)
- when establishing certain intercepts inside the guest memory, we need to briefly modify it, and to do so safely, we pause all the VCPUs, to make sure we don't race with the guest
- when ensuring the integrity of certain structures that already lie inside writable pages (so intercepting writes via an EPT/NPT is not doable, because the induced performance overhead would be huge), so that no other VCPU modifies the said pages while we make sure they haven't been tampered with

Of course, just how Paolo suggested, we can place finer-grained intercepts (such as execute-protect a page in order to ensure no VCPU runs code from it while we modify it), but this is a more complicated solution and we've never had to think for something other than simply pausing the VCPUs, since that was always available so far.

Hope this piece of info helps.

> > I think a command to "pause" KVM_RUN is okay to add.  That is, if in
> > KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately
> > and halt the vCPU.  If not in KVM_RUN, the command would return 0 and
> > trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.
> >
> > The introspector then can decide whether to wait if the commands
> > return 0.  There is no need for an "unpause" command, as that is
> > achieved simply by completing the 'paused' event.
>
> This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,
> something that maybe some future users can leverage. I, however, was
> trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),
> the latter existing only in KVM (kernel). The event-based mecanism for
> pausing a vCPU feels like it has too much overhead for what one usually
> needs it (read some registers).
>
> > > > Have you thought about monitoring hardware registers, for example
> > > > in order to check that IOMMU page tables protect from overwriting the
> kernel?
> > >
> > > Sorry, but I'm not sure I understand: are you thinking at a way to
> > > make sure none of the IOMMU grups are configured with a "too
> > > generous" DMA window?
> >
> > Yes, exactly.  Optional, of course.
>
> We could add a command that returns the list of DMA ranges which can then
> be used by an introspector to check if the OS has made a mistake and placed
> sensible data in that rage.
>
> Also, I belive we have refined a number of ideas on this thread and more
> remain to clarify. I would like to update the design document with what we
> have agreed on so far, add some more details to what felt 'under explained'
> and continue again from there. Is it OK for you?
>
> Regards,
>
> --
> Mihai Donțu
>
>
> ________________________
> This email was scanned by Bitdefender

Best regards,
Andrei.

________________________
This email was scanned by Bitdefender

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-05  8:00                                   ` Andrei Vlad LUTAS
@ 2017-08-07 12:18                                     ` Paolo Bonzini
  2017-08-07 13:25                                       ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-07 12:18 UTC (permalink / raw)
  To: Andrei Vlad LUTAS, Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel

On 05/08/2017 10:00, Andrei Vlad LUTAS wrote:
> Of course, just how Paolo suggested, we can place finer-grained
> intercepts (such as execute-protect a page in order to ensure no VCPU
> runs code from it while we modify it), but this is a more complicated
> solution and we've never had to think for something other than simply
> pausing the VCPUs, since that was always available so far.
> 
> Hope this piece of info helps.

We can certainly add a "pause the VCPU with a given id" command.  The
command reports its success with an event, and replying to the event
restarts the VCPU.  If the VCPU is currently in userspace, the event
would be delayed until the next time KVM is re-entered, but this should
not be an issue in general.  The introspector can operate as if the VCPU
was paused.

"Pause all VCPUs and stop all DMA" would definitely be a layering
violation, so it cannot be added.

"Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
a given id" commands.  I lean towards omitting it.

However, now that I'm thinking of it, we need a new event for "new VCPU
created".  When the event is enabled, newly-created VCPUs should be in
paused mode.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-07 12:18                                     ` Paolo Bonzini
@ 2017-08-07 13:25                                       ` Mihai Donțu
  2017-08-07 13:49                                         ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-08-07 13:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei Vlad LUTAS

On Mon, 2017-08-07 at 14:18 +0200, Paolo Bonzini wrote:
> On 05/08/2017 10:00, Andrei Vlad LUTAS wrote:
> > Of course, just how Paolo suggested, we can place finer-grained
> > intercepts (such as execute-protect a page in order to ensure no VCPU
> > runs code from it while we modify it), but this is a more complicated
> > solution and we've never had to think for something other than simply
> > pausing the VCPUs, since that was always available so far.
> > 
> > Hope this piece of info helps.
> 
> We can certainly add a "pause the VCPU with a given id" command.  The
> command reports its success with an event, and replying to the event
> restarts the VCPU.  If the VCPU is currently in userspace, the event
> would be delayed until the next time KVM is re-entered, but this should
> not be an issue in general.  The introspector can operate as if the VCPU
> was paused.

I have a plan to modify our application a bit and see how often we
query a vCPU outside an event handler. If it's seldom enough, a command
that pauses the vCPU and triggers an event should be just fine.

> "Pause all VCPUs and stop all DMA" would definitely be a layering
> violation, so it cannot be added.
> 
> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> a given id" commands.  I lean towards omitting it.

The case where the introspector wants to scan the guest memory needs a
KVMI_PAUSE_VM, which as discussed in a previous email, can be the
actual qemu 'pause' command. However, we would like to limit the
communication channels we have with the host and not use qmp (or
libvirt/etc. if qmp is not exposed). Instead, have a command that
triggers a KVM_RUN exit to qemu which in turn will call the underlying
pause function used by qmp. Would that be OK with you?

> However, now that I'm thinking of it, we need a new event for "new VCPU
> created".  When the event is enabled, newly-created VCPUs should be in
> paused mode.

I assume you are thinking about vCPU hotplug here. If so, yes, an event
that gives the introspector the chance to update its internal
bookkeeping would be useful.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-07 13:25                                       ` Mihai Donțu
@ 2017-08-07 13:49                                         ` Paolo Bonzini
  2017-08-07 14:12                                           ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-07 13:49 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei Vlad LUTAS

On 07/08/2017 15:25, Mihai Donțu wrote:
>> "Pause all VCPUs and stop all DMA" would definitely be a layering
>> violation, so it cannot be added.
>>
>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
>> a given id" commands.  I lean towards omitting it.
> 
> The case where the introspector wants to scan the guest memory needs a
> KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> actual qemu 'pause' command.

Do you mean it needs to stop DMA as well?

> However, we would like to limit the
> communication channels we have with the host and not use qmp (or
> libvirt/etc. if qmp is not exposed). Instead, have a command that
> triggers a KVM_RUN exit to qemu which in turn will call the underlying
> pause function used by qmp. Would that be OK with you?

You would have to send back something on completion, and then I am
worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
a really expensive operation, so I don't think it's a good idea to let
the introspector do this.  You can pause all VCPUs, or use memory page
permissions.

>> However, now that I'm thinking of it, we need a new event for "new VCPU
>> created".  When the event is enabled, newly-created VCPUs should be in
>> paused mode.
> 
> I assume you are thinking about vCPU hotplug here. If so, yes, an event
> that gives the introspector the chance to update its internal
> bookkeeping would be useful.

Yes, exactly.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-07 13:49                                         ` Paolo Bonzini
@ 2017-08-07 14:12                                           ` Mihai Donțu
  2017-08-07 15:56                                             ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-08-07 14:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei Vlad LUTAS

On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
> On 07/08/2017 15:25, Mihai Donțu wrote:
> > > "Pause all VCPUs and stop all DMA" would definitely be a layering
> > > violation, so it cannot be added.
> > > 
> > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> > > a given id" commands.  I lean towards omitting it.
> > 
> > The case where the introspector wants to scan the guest memory needs a
> > KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> > actual qemu 'pause' command.
> 
> Do you mean it needs to stop DMA as well?

No, DMA can proceed normally. I remain of the opinion that KVMI users
must know what guest memory ranges are OK to access by looking at MTRR-
s, PAT or guest kernel structures, or a combination of all three.

> > However, we would like to limit the
> > communication channels we have with the host and not use qmp (or
> > libvirt/etc. if qmp is not exposed). Instead, have a command that
> > triggers a KVM_RUN exit to qemu which in turn will call the underlying
> > pause function used by qmp. Would that be OK with you?
> 
> You would have to send back something on completion, and then I am
> worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
> a really expensive operation, so I don't think it's a good idea to let
> the introspector do this.  You can pause all VCPUs, or use memory page
> permissions.

Pausing all vCPU-s was my first thought, I was just trying to follow
your statement: "I lean towards omitting it". :-)

It will take a bit of user-space-fu, in that after issuing N vCPU pause
commands in a row we will have to wait for N events, which might race
with other events (MSR, CRx etc.) which need handling otherwise the
pause ones will not arrive ... I wonder if there's a way to do this
cleanly in kernel (piggyback on the code for pausing a single vCPU and
then somehow 'coalesce' all pause events into a single
KVMI_EVENT_VCPUS_PAUSED).

> > > However, now that I'm thinking of it, we need a new event for "new VCPU
> > > created".  When the event is enabled, newly-created VCPUs should be in
> > > paused mode.
> > 
> > I assume you are thinking about vCPU hotplug here. If so, yes, an event
> > that gives the introspector the chance to update its internal
> > bookkeeping would be useful.
> 
> Yes, exactly.

OK. We will update the design document.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini
@ 2017-08-07 15:28   ` Mihai Donțu
  2017-08-07 15:44     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Mihai Donțu @ 2017-08-07 15:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, KVM list, Andy Lutomirski,
	Adalbert Lazar

On Fri, 2017-07-07 at 19:29 +0200, Paolo Bonzini wrote:
> On 07/07/2017 16:34, Adalbert Lazar wrote:
> > One bit of code that has passed (maybe) unnoticed in the RFC is a new
> > function added to Linux' mm called vm_replace_page() which, much like KSM's
> > replace_page(), gets two processes to share a page (read-write, no-COW):
> > 
> > https://marc.info/?l=kvm&m=149762056518799&w=2
> > 
> > This is used to quickly scan and patch the guest software.
> 
> Thanks for pointing this out.
> 
> In my review of patch 1 I suggested using only read/write, but it's slow.
> 
> I think we need to figure out a safe way to map foreign memory, as I'm
> worried of TOC/TOU races for obvious reasons.

Would it be possible to describe the race you are referring to? I would
imagine that grabbing a hold of the task and/or mm descriptor would be
enough to ensure the gfn is available for mapping (and locking), though
it's true that qemu might (does it ever?) shuffle things around so the
introspector would end up with a mapping pointing to the wrong thing.

> One thing I was thinking about (but didn't have much time to completely
> think through) is a special /dev/kvmmem device, where you could do
> 
>     kvmmem_fd = open("/dev/kvmmem", O_RDWR);
>     ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size });
>     ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size });

The idea with a device solves one of our problems with
KVMI_MAP_PHYSICAL_PAGE_TO_GUEST: the introspector has to undo the
mapping, otherwise the VM will crash when the kernel reuses the gfn.
This does not happen if the introspector crashes ...

We can make it so that when kvmmem_fd is closed, all mappings are
automatically undone.

> The map/unmap memory operation would be a hypercall, not a socket
> command, but the random "token" would be returned on the socket via some
> KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a
> replacement accepting {gpa, size} instead of {gpa, gfn_dest}).  Handles
> can be short lived, e.g. you could have at most a small number tokens
> per host created (and passed back via KVMI) but not yet used by the
> hypercall.  Once it's used by the hypercall, the token is not needed
> anymore, so this is not a strong limitation.

I understand that the token is used to make sure the mapping hypercall
is only used by VM-s and applications doing introspection.

> After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout
> changes (userfaultfd can be used by the introspector to simplify the
> handling and retry). You'd have to re-map the memory explicitly.

"guest memory layout changes" - is this the guest _being_ introspected?
If so, how would an introspector running in a separate VM get a
SIGSEGV? Have KVM inject a page fault if the host encounters a #PF for
the VMA representing the foreign mapping (presumably it knows the gva
as well)?

> Alas I have no idea how to verify the handle securely on the host, since
> the host is not supposed to know which guests are introspectors and
> which host got which token.  But maybe if the token namespace is big
> enough (256 bits?) and random, it's okay to ignore the possibility that
> a guest tries to guess.  (This idea is roughly based on how SCSI
> offloaded copies work).
> 
> Andy, does it look like utter BS or could it have some merit?

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 0/1] VM introspection
  2017-08-07 15:28   ` Mihai Donțu
@ 2017-08-07 15:44     ` Paolo Bonzini
  0 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-07 15:44 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, KVM list, Andy Lutomirski,
	Adalbert Lazar

On 07/08/2017 17:28, Mihai Donțu wrote:
> On Fri, 2017-07-07 at 19:29 +0200, Paolo Bonzini wrote:
>> On 07/07/2017 16:34, Adalbert Lazar wrote:
>>> One bit of code that has passed (maybe) unnoticed in the RFC is a new
>>> function added to Linux' mm called vm_replace_page() which, much like KSM's
>>> replace_page(), gets two processes to share a page (read-write, no-COW):
>>>
>>> https://marc.info/?l=kvm&m=149762056518799&w=2
>>>
>>> This is used to quickly scan and patch the guest software.
>>
>> Thanks for pointing this out.
>>
>> In my review of patch 1 I suggested using only read/write, but it's slow.
>>
>> I think we need to figure out a safe way to map foreign memory, as I'm
>> worried of TOC/TOU races for obvious reasons.
> 
> Would it be possible to describe the race you are referring to? I would
> imagine that grabbing a hold of the task and/or mm descriptor would be
> enough to ensure the gfn is available for mapping (and locking), though
> it's true that qemu might (does it ever?) shuffle things around so the
> introspector would end up with a mapping pointing to the wrong thing.

Yes, QEMU can shuffle things around.  Usually it's BARs that are
shuffled around, but there can be special cases where RAM changes in the
memory map (for example SMRAM).

>> One thing I was thinking about (but didn't have much time to completely
>> think through) is a special /dev/kvmmem device, where you could do
>>
>>     kvmmem_fd = open("/dev/kvmmem", O_RDWR);
>>     ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size });
>>     ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size });
> 
> The idea with a device solves one of our problems with
> KVMI_MAP_PHYSICAL_PAGE_TO_GUEST: the introspector has to undo the
> mapping, otherwise the VM will crash when the kernel reuses the gfn.
> This does not happen if the introspector crashes ...
> 
> We can make it so that when kvmmem_fd is closed, all mappings are
> automatically undone.

Yes, that's a nice side effect.

>> The map/unmap memory operation would be a hypercall, not a socket
>> command, but the random "token" would be returned on the socket via some
>> KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a
>> replacement accepting {gpa, size} instead of {gpa, gfn_dest}).  Handles
>> can be short lived, e.g. you could have at most a small number tokens
>> per host created (and passed back via KVMI) but not yet used by the
>> hypercall.  Once it's used by the hypercall, the token is not needed
>> anymore, so this is not a strong limitation.
> 
> I understand that the token is used to make sure the mapping hypercall
> is only used by VM-s and applications doing introspection.

Yes.

>> After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout
>> changes (userfaultfd can be used by the introspector to simplify the
>> handling and retry). You'd have to re-map the memory explicitly.
> 
> "guest memory layout changes" - is this the guest _being_ introspected?

Yes.

> If so, how would an introspector running in a separate VM get a
> SIGSEGV? Have KVM inject a page fault if the host encounters a #PF for
> the VMA representing the foreign mapping (presumably it knows the gva
> as well)?

When the foreign mapping is invalidated by the actions of the guest
being introspected, the pages become inaccessible in the EPT page tables
of the introspector VM.

Then, an access in the introspector VM will result in an EPT violation,
which KVM can translate to a page fault---or actually a #VE would be
even better.  (Even if you don't modify KVM to use a hardware #VE, KVM's
MMU can choose to inject the exception as a #VE rather than #PF).

That said, I think this should come later, and for the beginning only
read/write should be included.

Paolo

>> Alas I have no idea how to verify the handle securely on the host, since
>> the host is not supposed to know which guests are introspectors and
>> which host got which token.  But maybe if the token namespace is big
>> enough (256 bits?) and random, it's okay to ignore the possibility that
>> a guest tries to guess.  (This idea is roughly based on how SCSI
>> offloaded copies work).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-07 14:12                                           ` Mihai Donțu
@ 2017-08-07 15:56                                             ` Paolo Bonzini
  2017-08-07 16:44                                               ` Mihai Donțu
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2017-08-07 15:56 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei Vlad LUTAS

On 07/08/2017 16:12, Mihai Donțu wrote:
> On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
>> On 07/08/2017 15:25, Mihai Donțu wrote:
>>>> "Pause all VCPUs and stop all DMA" would definitely be a layering
>>>> violation, so it cannot be added.
>>>>
>>>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
>>>> a given id" commands.  I lean towards omitting it.
>>>
>>> The case where the introspector wants to scan the guest memory needs a
>>> KVMI_PAUSE_VM, which as discussed in a previous email, can be the
>>> actual qemu 'pause' command.
>>
>> Do you mean it needs to stop DMA as well?
> 
> No, DMA can proceed normally. I remain of the opinion that KVMI users
> must know what guest memory ranges are OK to access by looking at MTRR-
> s, PAT or guest kernel structures, or a combination of all three.

Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.  (But, I don't
understand your remark about guest memory ranges; the point of
bus-master DMA is that it works on any memory, and cache snooping makes
it even easier for hypothetical malware to do memory writes via
bus-master DMA).

>>> However, we would like to limit the
>>> communication channels we have with the host and not use qmp (or
>>> libvirt/etc. if qmp is not exposed). Instead, have a command that
>>> triggers a KVM_RUN exit to qemu which in turn will call the underlying
>>> pause function used by qmp. Would that be OK with you?
>>
>> You would have to send back something on completion, and then I am
>> worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
>> a really expensive operation, so I don't think it's a good idea to let
>> the introspector do this.  You can pause all VCPUs, or use memory page
>> permissions.
> 
> Pausing all vCPU-s was my first thought, I was just trying to follow
> your statement: "I lean towards omitting it". :-)

Yes, and I still do because a hypothetical "pause all VCPUs" command
still has the issue that you could get other events before the command
completes.  So I am not convinced that a specialized command actually
makes the introspector code much simpler.

I hope you understand that I want to keep the trusted base (not just the
code I maintain, though that is a secondary benefit ;)) as simple as
possible.

> It will take a bit of user-space-fu, in that after issuing N vCPU pause
> commands in a row we will have to wait for N events, which might race
> with other events (MSR, CRx etc.) which need handling otherwise the
> pause ones will not arrive

The same issue would be there in QEMU or KVM though.

If you can always request "pause all vCPUs" from an event handler,
avoiding deadlocks is relatively easy.  If you cannot ensure that, for
example because of work that is scheduled periodically, you can send a
KVM_PAUSE command to ensure the work is done in a safe condition.

Then you get the following pseudocode algorithm:

    // a vCPU is not executing guest code, and it's going to check
    // num_pause_vm_requests before going back to guest code
    vcpu_not_running(id) {
        unmark vCPU "id" as running
        if (num vcpus running == 0)
            cond_broadcast(no_running_cpus)
    }

    pause_vcpu(id) {
        mark vCPU "id" as being-paused
        send KVMI_PAUSED for the vcpu
    }

    // return only when no vCPU is in KVM_RUN
    pause_vm() {
        if this vCPU is running
            if not in an event handler
                // caller should do pause_vcpu and defer the work
                return

            // we know this vCPU is not KVM_RUN
            vcpu_not_running()

        num_pause_vm_requests++
        if (num vcpus running > 0)
            for each vCPU that is running and not being-paused
                pause_vcpu(id)
            while (num vcpus running > 0)
                cond_wait(no_running_vcpus)
    }

    // tell paused vCPUs that they can resume
    resume_vm() {
        num_pause_vm_requests--
        if (num_pause_all_requests == 0)
            cond_broadcast(no_pending_pause_vm_requests)
        // either we're in an event handler, or a "pause" command was
        // sent for this vCPU.  in any case we're guaranteed to do an
        // event_reply sooner or later, which will again mark the vCPU
        // as running
    }

    // after an event reply, the vCPU goes back to KVM_RUN.  therefore
    // an event reply can act as a synchronization point for pause-vm
    // requests: delay the reply if there's such a request
    event_reply(id, data) {
        if (num_pause_vm_requests > 0) {
            if vCPU "id" is running
                vcpu_not_running(id)
            while (num_pause_vm_requests > 0)
                cond_wait(no_pending_pause_vm_requests)
        }
        mark vCPU "id" as running
        send event reply on KVMI socket
    }

    // this is what you do when KVM tells you that the guest is either
    // in userspace, or waiting to be woken up ("paused" event).  from
    // the introspector POV the two are the same.
    vcpu_ack_pause(id) {
        vcpu_not_running(id)
        unmark vCPU "id" as being-paused

        // deferred work presumably calls pause_vm/resumve_vm, and this
        // vCPU is not running now, so this is a nice point to flush it
        if any deferred work exists, do it now
    }

and on the KVMI read handler:

    on reply to "pause" command:
        if reply says the vCPU is currently in userspace
            // we'll get a KVMI_PAUSED event as soon as the host
            // reenters KVM with KVM_RUN, but we can already say the
            // CPU is not running
            vcpu_ack_pause()

    on "paused" event:
        vcpu_ack_pause()
        event_reply()

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection
  2017-08-07 15:56                                             ` Paolo Bonzini
@ 2017-08-07 16:44                                               ` Mihai Donțu
  0 siblings, 0 replies; 50+ messages in thread
From: Mihai Donțu @ 2017-08-07 16:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm,
	Tamas K Lengyel, Andrei Vlad LUTAS

On Mon, 2017-08-07 at 17:56 +0200, Paolo Bonzini wrote:
> On 07/08/2017 16:12, Mihai Donțu wrote:
> > On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
> > > On 07/08/2017 15:25, Mihai Donțu wrote:
> > > > > "Pause all VCPUs and stop all DMA" would definitely be a layering
> > > > > violation, so it cannot be added.
> > > > > 
> > > > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> > > > > a given id" commands.  I lean towards omitting it.
> > > > 
> > > > The case where the introspector wants to scan the guest memory needs a
> > > > KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> > > > actual qemu 'pause' command.
> > > 
> > > Do you mean it needs to stop DMA as well?
> > 
> > No, DMA can proceed normally. I remain of the opinion that KVMI users
> > must know what guest memory ranges are OK to access by looking at MTRR-
> > s, PAT or guest kernel structures, or a combination of all three.
> 
> Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.

I think it's OK to restate things, especially since my (our) view on
these matters might not match the KVM reality that you know far
better.

> (But, I don't understand your remark about guest memory ranges; the point of
> bus-master DMA is that it works on any memory, and cache snooping makes
> it even easier for hypothetical malware to do memory writes via
> bus-master DMA).

This is where I separated things in my head: I merely limited myself to
accessing memory, while leaving the reality of DMA-based attacks a
problem to be solved separately. There is some reasearch work being
tested internally on that, but I have yet to get in touch with the
people involved in it. As soon as I get some details maybe we can
connect something in KVM.

> > > > However, we would like to limit the
> > > > communication channels we have with the host and not use qmp (or
> > > > libvirt/etc. if qmp is not exposed). Instead, have a command that
> > > > triggers a KVM_RUN exit to qemu which in turn will call the underlying
> > > > pause function used by qmp. Would that be OK with you?
> > > 
> > > You would have to send back something on completion, and then I am
> > > worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
> > > a really expensive operation, so I don't think it's a good idea to let
> > > the introspector do this.  You can pause all VCPUs, or use memory page
> > > permissions.
> > 
> > Pausing all vCPU-s was my first thought, I was just trying to follow
> > your statement: "I lean towards omitting it". :-)
> 
> Yes, and I still do because a hypothetical "pause all VCPUs" command
> still has the issue that you could get other events before the command
> completes.  So I am not convinced that a specialized command actually
> makes the introspector code much simpler.
> 
> I hope you understand that I want to keep the trusted base (not just the
> code I maintain, though that is a secondary benefit ;)) as simple as
> possible.
> 
> > It will take a bit of user-space-fu, in that after issuing N vCPU pause
> > commands in a row we will have to wait for N events, which might race
> > with other events (MSR, CRx etc.) which need handling otherwise the
> > pause ones will not arrive
> 
> The same issue would be there in QEMU or KVM though.
> 
> If you can always request "pause all vCPUs" from an event handler,
> avoiding deadlocks is relatively easy.  If you cannot ensure that, for
> example because of work that is scheduled periodically, you can send a
> KVM_PAUSE command to ensure the work is done in a safe condition.
> 
> Then you get the following pseudocode algorithm:
> 
>     // a vCPU is not executing guest code, and it's going to check
>     // num_pause_vm_requests before going back to guest code
>     vcpu_not_running(id) {
>         unmark vCPU "id" as running
>         if (num vcpus running == 0)
>             cond_broadcast(no_running_cpus)
>     }
> 
>     pause_vcpu(id) {
>         mark vCPU "id" as being-paused
>         send KVMI_PAUSED for the vcpu
>     }
> 
>     // return only when no vCPU is in KVM_RUN
>     pause_vm() {
>         if this vCPU is running
>             if not in an event handler
>                 // caller should do pause_vcpu and defer the work
>                 return
> 
>             // we know this vCPU is not KVM_RUN
>             vcpu_not_running()
> 
>         num_pause_vm_requests++
>         if (num vcpus running > 0)
>             for each vCPU that is running and not being-paused
>                 pause_vcpu(id)
>             while (num vcpus running > 0)
>                 cond_wait(no_running_vcpus)
>     }
> 
>     // tell paused vCPUs that they can resume
>     resume_vm() {
>         num_pause_vm_requests--
>         if (num_pause_all_requests == 0)
>             cond_broadcast(no_pending_pause_vm_requests)
>         // either we're in an event handler, or a "pause" command was
>         // sent for this vCPU.  in any case we're guaranteed to do an
>         // event_reply sooner or later, which will again mark the vCPU
>         // as running
>     }
> 
>     // after an event reply, the vCPU goes back to KVM_RUN.  therefore
>     // an event reply can act as a synchronization point for pause-vm
>     // requests: delay the reply if there's such a request
>     event_reply(id, data) {
>         if (num_pause_vm_requests > 0) {
>             if vCPU "id" is running
>                 vcpu_not_running(id)
>             while (num_pause_vm_requests > 0)
>                 cond_wait(no_pending_pause_vm_requests)
>         }
>         mark vCPU "id" as running
>         send event reply on KVMI socket
>     }
> 
>     // this is what you do when KVM tells you that the guest is either
>     // in userspace, or waiting to be woken up ("paused" event).  from
>     // the introspector POV the two are the same.
>     vcpu_ack_pause(id) {
>         vcpu_not_running(id)
>         unmark vCPU "id" as being-paused
> 
>         // deferred work presumably calls pause_vm/resumve_vm, and this
>         // vCPU is not running now, so this is a nice point to flush it
>         if any deferred work exists, do it now
>     }
> 
> and on the KVMI read handler:
> 
>     on reply to "pause" command:
>         if reply says the vCPU is currently in userspace
>             // we'll get a KVMI_PAUSED event as soon as the host
>             // reenters KVM with KVM_RUN, but we can already say the
>             // CPU is not running
>             vcpu_ack_pause()
> 
>     on "paused" event:
>         vcpu_ack_pause()
>         event_reply()

Thank you for this!

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2017-08-07 16:44 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar
2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar
2017-07-07 16:52   ` Paolo Bonzini
2017-07-10 15:32     ` alazar
2017-07-10 17:03       ` Paolo Bonzini
2017-07-11 16:48         ` Adalbert Lazar
2017-07-11 16:51           ` Paolo Bonzini
2017-07-13  5:57             ` Mihai Donțu
2017-07-13  7:32               ` Paolo Bonzini
2017-07-18 11:51                 ` Mihai Donțu
2017-07-18 12:02                   ` Mihai Donțu
2017-07-23 13:02                   ` Paolo Bonzini
2017-07-26 17:04                     ` Mihai Donțu
2017-07-26 17:25                       ` Tamas K Lengyel
2017-07-27 14:41                         ` Mihai Donțu
2017-07-27 13:33                       ` Paolo Bonzini
2017-07-27 14:46                         ` Mihai Donțu
2017-07-13  8:36     ` Mihai Donțu
2017-07-13  9:15       ` Paolo Bonzini
2017-07-27 16:23         ` Mihai Donțu
2017-07-27 16:52           ` Paolo Bonzini
2017-07-27 17:19             ` Mihai Donțu
2017-08-01 10:40               ` Paolo Bonzini
2017-08-01 16:33                 ` Tamas K Lengyel
2017-08-01 20:47                   ` Paolo Bonzini
2017-08-02 11:52                     ` Mihai Donțu
2017-08-02 12:27                       ` Paolo Bonzini
2017-08-02 13:32                         ` Mihai Donțu
2017-08-02 13:51                           ` Paolo Bonzini
2017-08-02 14:17                             ` Mihai Donțu
2017-08-04  8:35                               ` Paolo Bonzini
2017-08-04 15:29                                 ` Mihai Donțu
2017-08-04 15:37                                   ` Paolo Bonzini
2017-08-05  8:00                                   ` Andrei Vlad LUTAS
2017-08-07 12:18                                     ` Paolo Bonzini
2017-08-07 13:25                                       ` Mihai Donțu
2017-08-07 13:49                                         ` Paolo Bonzini
2017-08-07 14:12                                           ` Mihai Donțu
2017-08-07 15:56                                             ` Paolo Bonzini
2017-08-07 16:44                                               ` Mihai Donțu
2017-08-02 13:53                           ` Mihai Donțu
2017-07-27 17:06     ` Mihai Donțu
2017-07-27 17:18       ` Paolo Bonzini
2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini
2017-08-07 15:28   ` Mihai Donțu
2017-08-07 15:44     ` Paolo Bonzini
2017-07-12 14:09 ` Konrad Rzeszutek Wilk
2017-07-13  5:37   ` Mihai Donțu
2017-07-14 16:13     ` Konrad Rzeszutek Wilk
2017-07-18  8:55       ` Mihai Donțu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).