* [RFC PATCH v2 0/1] VM introspection @ 2017-07-07 14:34 Adalbert Lazar 2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Adalbert Lazar @ 2017-07-07 14:34 UTC (permalink / raw) To: kvm Cc: Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu, Adalbert Lazar The following patch adds the documentation for an introspection subsystem for KVM (KVMi). It details the purpose and the use case that has shaped the proposed API/ABI, as well as the wire protocol. During the discussion that has developed around our previous RFC patchset a number of TODO-s have been highlighted: * the integration in qemu: a socket-based protocol used to initiate the connection with an introspection tool and then passes control to KVM, the in-kernel mechanism taking over from there; * the integration of the SPT-handling mechanism into the KVM MMU; * the elaboration of a set of policies and a mechanism to better control this feature; One bit of code that has passed (maybe) unnoticed in the RFC is a new function added to Linux' mm called vm_replace_page() which, much like KSM's replace_page(), gets two processes to share a page (read-write, no-COW): https://marc.info/?l=kvm&m=149762056518799&w=2 This is used to quickly scan and patch the guest software. The patch following this cover letter does not yet address the points above but aims to clear with the community the overall ABI/API, with a focus on x86. Thank you, v2: - add documentation and ABI [Paolo, Jan] - drop all the other patches for now [Paolo] - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, and let libvirt/qemu handle this [Stefan, Paolo] - change the license from LGPL to GPL [Jan] - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore) - make the interface a little more consistent Adalbert Lazar (1): kvm: Add documentation and ABI/API header for VM introspection Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/kvmi.h | 310 ++++++++++++ 2 files changed, 1295 insertions(+) create mode 100644 Documentation/virtual/kvm/kvmi.rst create mode 100644 include/uapi/linux/kvmi.h ^ permalink raw reply [flat|nested] 50+ messages in thread
* [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar @ 2017-07-07 14:34 ` Adalbert Lazar 2017-07-07 16:52 ` Paolo Bonzini 2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini 2017-07-12 14:09 ` Konrad Rzeszutek Wilk 2 siblings, 1 reply; 50+ messages in thread From: Adalbert Lazar @ 2017-07-07 14:34 UTC (permalink / raw) To: kvm Cc: Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu, Adalbert Lazar Signed-off-by: Mihai Dontu <mdontu@bitdefender.com> Signed-off-by: Adalbert Lazar <alazar@bitdefender.com> --- Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/kvmi.h | 310 ++++++++++++ 2 files changed, 1295 insertions(+) create mode 100644 Documentation/virtual/kvm/kvmi.rst create mode 100644 include/uapi/linux/kvmi.h diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst new file mode 100644 index 000000000000..63d3a75d5ffc --- /dev/null +++ b/Documentation/virtual/kvm/kvmi.rst @@ -0,0 +1,985 @@ +========================================================= +KVMi - the kernel virtual machine introspection subsystem +========================================================= + +The KVM introspection subsystem provides a facility for applications running +on the host or in a separate VM, to control the execution of other VM-s +(pause, resume, shutdown), query the state of the vCPU-s (GPR-s, MSR-s etc.), +alter the page access bits in the shadow page tables (only for the hardware +backed ones, eg. Intel's EPT) and receive notifications when events of +interest have taken place (shadow page table level faults, key MSR writes, +hypercalls etc.). Some notifications can be responded to with an action +(like preveting an MSR from being written), others are mere informative +(like breakpoint events which are used for execution tracing), though the +option to alter the GPR-s is common to each of them (usually the program +counter is advanced past the instruction that triggered the guest exit). +All events are optional. An application using this subsystem will explicitly +register for them. + +The use case that gave way for the creation of this subsystem is to monitor +the guest OS and as such the ABI/API is higly influenced by how the guest +software (kernel, applications) see the world. For example, some events +provide information specific for the host CPU architecture +(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software +to implement a critical feature (fast system calls). + +At the moment, the target audience for VMI are security software authors +that wish to perform forensics on newly discovered threats (exploits) or +to implement another layer of security like preventing a large set of +kernel rootkits simply by "locking" the kernel image in the shadow page +tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that +made VMI a separate subsystem, even though many of these features are +available in the device manager (eg. qemu). The ability to build a security +application that does not interfere (in terms of performance) with the +guest software asks for a specialized interface that is designed for minimum +overhead. + +API/ABI +======= + +This chapter describes the VMI interface used to monitor and control local +guests from an user application. + +Overview +-------- + +The interface is socket based, one connection for every VM. One end is in the +host kernel while the other is held by the user application (introspection +tool). + +The initial connection is established by an application running on the host +(eg. qemu) that connects to the introspection tool and after a handshake the +socket is passed to the host kernel making all further communication take +place between it and the introspection tool. The initiating party (qemu) can +close its end so that any potential exploits cannot take a hold of it. + +The socket protocol allows for commands and events to be multiplexed over +the same connection. A such, it is possible for the introspection tool to +receive an event while waiting for the result of a command. Also, it can +send a command while the host kernel is waiting for a reply to an event. + +The kernel side of the socket communication is blocking and will wait for +an answer from its peer indefinitely or until the guest is powered off +(killed) at which point it will wake up and properly cleanup. If the peer +goes away KVM will exit to user space and the device manager will try and +reconnect. If it fails, the device manager will inform KVM to cleanup and +continue normal guest execution as if the introspection subsystem has never +been used on that guest. + +All events have a common header:: + + struct kvmi_socket_hdr { + __u16 msg_id; + __u16 size; + __u32 seq; + }; + +and all need a reply with the same kind of header, having the same +sequence number (seq) and the same message id (msg_id). + +Because events from different vCPU threads can send messages at the same +time and the replies can come in any order, the receiver loop uses the +sequence number (seq) to identify which reply belongs to which vCPU, in +order to dispatch the message to the right thread waiting for it. + +After 'kvmi_socket_hdr', 'msg_id' specific data of 'kvmi_socket_hdr.size' +bytes will follow. + +The message header and its data must be sent with one write() call +to the socket (as a whole). This simplifies the receiver loop and avoids +the recontruction of messages on the other side. + +The wire protocol uses the host native byte-order. The introspection tool +must check this during the handshake and do the necessary conversion. + +Replies to commands have an error code (__s32) at offset 0 in the message +data. Specific message data will follow this. If the error code is not +zero, all the other data members will have undefined content (not random +heap or stack data, but valid results at the time of the failure), unless +otherwise specified. + +In case of an unsupported command, the message data will contain only +the error code (-ENOSYS). + +The error code is related to the processing of the corresponding +message. For all the other errors (socket errrors, incomplete messages, +wrong sequence numbers etc.) the socket must be closed and the connection +can be retried. + +While all commands will have a reply as soon as possible, the replies +to events will probably be delayed until a set of (new) commands will +complete:: + + Host kernel Tool + ----------- -------- + event 1 -> + <- command 1 + command 1 reply -> + <- command 2 + command 2 reply -> + <- event 1 reply + +If both ends send a message "in the same time":: + + KVMi Userland + ---- -------- + event X -> <- command X + +the host kernel should reply to 'command X', regardless of the receive time +(before or after the 'event X' was sent). + +As it can be seen below, the wire protocol specifies occasional padding. This +is to permit working with the data by directly using C structures. The +members should have the 'natural' alignment of the host. + +To describe the commands/events, we reuse some conventions from api.txt: + + - Architectures: which instruction set architectures providing this command/event + + - Versions: which versions provide this command/event + + - Parameters: incoming message data + + - Returns: outgoing/reply message data + +Handshake +--------- + +Allthough this falls out of the scope of the introspection subsystem, below +is a proposal of a handshake that can be used by implementors. + +The device manager will connect to the introspection tool and wait for a +cryptographic hash of a cookie that should be known by both peers. If the +hash is correct (the destination has been "authenticated"), the device +manager will send another cryptographic hash and random salt. The peer +recomputes the hash of the cookie bytes including the salt and if they match, +the device manager has been "authenticated" too. This is a rather crude +system that makes it difficult for device manager exploits to trick the +introspection tool into believing its working OK. + +The cookie would normally be generated by a management tool (eg. libvirt) +and make it available to the device manager and to a properly authenticated +client. It is the job of a third party to retrieve the cookie from the +management application and pass it over a secure channel to the introspection +tool. + +Once the basic "authentication" has taken place, the introspection tool +can receive information on the guest (its UUID) and other flags (endianness +or features supported by the host kernel). + +Introspection capabilities +-------------------------- + +TODO + +Commands +-------- + +The following C structures are meant to be used directly when communicating +over the wire. The peer that detects any size mismatch should simply close +the connection and report the error. + +1. KVMI_GET_VERSION +------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: {} +:Returns: ↴ + +:: + + struct kvmi_get_version_reply { + __s32 err; + __u32 version; + }; + +Returns the introspection API version (the KVMI_VERSION constant) and the +error code (zero). In case of an unlikely error, the version will have an +undefined value. + +2. KVMI_GET_GUEST_INFO +---------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: {} +:Returns: ↴ + +:: + + struct kvmi_get_guest_info_reply { + __s32 err; + __u16 vcpu_count; + __u16 padding; + __u64 tsc_speed; + }; + +Returns the number of online vcpus, and the TSC frequency in HZ, if supported +by the architecture (otherwise is 0). + +3. KVMI_PAUSE_GUEST +------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: {} +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +This command will pause all vcpus threads, by getting them out of guest mode +and put them in the "waiting introspection commands" state. + +4. KVMI_UNPAUSE_GUEST +--------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: {} +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Resume the vcpu threads, or at least get them out of "waiting introspection +commands" state. + +5. KVMI_SHUTDOWN_GUEST +---------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: {} +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Ungracefully shutdown the guest. + +6. KVMI_GET_REGISTERS +--------------------- + +:Architectures: x86 (could be all, but with different input/output) +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_get_registers_x86 { + __u16 vcpu; + __u16 nmsrs; + __u32 msrs_idx[0]; + }; + +:Returns: ↴ + +:: + + struct kvmi_get_registers_x86_reply { + __s32 err; + __u32 mode; + struct kvm_regs regs; + struct kvm_sregs sregs; + struct kvm_msrs msrs; + }; + +For the given vcpu_id and the nmsrs sized array of MSRs registers, returns +the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers, +the special registers and the requested set of MSR-s. + +7. KVMI_SET_REGISTERS +--------------------- + +:Architectures: x86 (could be all, but with different input) +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_set_registers_x86 { + __u16 vcpu; + __u16 padding[3]; + struct kvm_regs regs; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Sets the general purpose registers for the given vcpu_id. + +8. KVMI_GET_MTRR_TYPE +--------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_mtrr_type { + __u64 gpa; + }; + +:Returns: ↴ + +:: + + struct kvmi_mtrr_type_reply { + __s32 err; + __u32 type; + }; + +Returns the guest memory type for a specific physical address. + +9. KVMI_GET_MTRRS +----------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_mtrrs { + __u16 vcpu; + __u16 padding[3]; + }; + +:Returns: ↴ + +:: + + struct kvmi_mtrrs_reply { + __s32 err; + __u32 padding; + __u64 pat; + __u64 cap; + __u64 type; + }; + +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified +vCPU. + +10. KVMI_GET_XSAVE_INFO +----------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_xsave_info { + __u16 vcpu; + __u16 padding[3]; + }; + +:Returns: ↴ + +:: + + struct kvmi_xsave_info_reply { + __s32 err; + __u32 size; + }; + +Returns the xstate size for the specified vCPU. + +11. KVMI_GET_PAGE_ACCESS +------------------------ + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_get_page_access { + __u16 vcpu; + __u16 padding[3]; + __u64 gpa; + }; + +:Returns: ↴ + +:: + + struct kvmi_get_page_access_reply { + __s32 err; + __u32 access; + }; + +Returns the spte flags (rwx - present, write & user) for the specified +vCPU and guest physical address. + +12. KVMI_SET_PAGE_ACCESS +------------------------ + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_set_page_access { + __u16 vcpu; + __u16 padding; + __u32 access; + __u64 gpa; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Sets the spte flags (rwx - present, write & user) - access - for the specified +vCPU and guest physical address. + +13. KVMI_INJECT_PAGE_FAULT +-------------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_page_fault { + __u16 vcpu; + __u16 padding; + __u32 error; + __u64 gva; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Injects a vCPU page fault with the specified guest virtual address and +error code. + +14. KVMI_INJECT_BREAKPOINT +-------------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_inject_breakpoint { + __u16 vcpu; + __u16 padding[3]; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Injects a breakpoint for the specified vCPU. This command is usually sent in +response to an event and as such the proper GPR-s will be set with the reply. + +15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST +----------------------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_map_physical_page_to_guest { + __u64 gpa_src; + __u64 gfn_dest; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Maps a page from an introspected guest memory (gpa_src) to the guest running +the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one +page in size. + +This command is used to "read" the introspected guest memory and potentially +place patches (eg. INT3-s). + +16. KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST +--------------------------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_unmap_physical_page_from_guest { + __u64 gfn_dest; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Unmaps a previously mapped page. + +17. KVMI_CONTROL_EVENTS +----------------------- + +:Architectures: all +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_control_events { + __u16 vcpu; + __u16 padding; + __u32 events; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Enables/disables vCPU introspection events, by setting/clearing one or more +of the following bits (see 'Events' below) : + + KVMI_EVENT_CR + KVMI_EVENT_MSR + KVMI_EVENT_XSETBV + KVMI_EVENT_BREAKPOINT + KVMI_EVENT_USER_CALL + KVMI_EVENT_PAGE_FAULT + KVMI_EVENT_TRAP + +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current +architecture would fail and -EINVAL will be returned. + +18. KVMI_CR_CONTROL +------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_cr_control { + __u8 enable; + __u8 padding[3]; + __u32 cr; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Enables/disables introspection for a specific CR register and must +be used in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_CR bit +flag set. + +Eg. kvmi_cr_control { .enable=1, .cr=3 } will enable introspection +for CR3. + +Currently, trying to set any register but CR0, CR3 and CR4 will return +-EINVAL. + +19. KVMI_MSR_CONTROL +-------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_msr_control { + __u8 enable; + __u8 padding[3]; + __u32 msr; + }; + +:Returns: ↴ + +:: + + struct kvmi_error_code { + __s32 err; + __u32 padding; + }; + +Enables/disables introspection for a specific MSR, and must be used +in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set. + +Currently, only MSRs within the following 3 ranges are supported. Trying +to control any other register will return -EINVAL. :: + + 0 ... 0x00001fff + 0x40000000 ... 0x40001fff + 0xc0000000 ... 0xc0001fff + +Events +------ + +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will +be sent unless enabled with a KVMI_CONTROL_EVENTS command. + +For x86, the message data starts with a common structure:: + + struct kvmi_event_x86 { + __u16 vcpu; + __u8 mode; + __u8 padding1; + __u32 event; + struct kvm_regs regs; + struct kvm_sregs sregs; + struct { + __u64 sysenter_cs; + __u64 sysenter_esp; + __u64 sysenter_eip; + __u64 efer; + __u64 star; + __u64 lstar; + } msrs; + }; + +In order to help the introspection tool with the event analysis while +avoiding unnecessary introspection commands, the message data holds some +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside +the vCPU id, its mode (in bytes) and the event (one of the flags set +with the KVMI_CONTROL_EVENTS command). + +The replies to events also start with a common structure, having the +KVMI_EVENT_VCPU_REPLY message id:: + + struct kvmi_event_x86_reply { + struct kvm_regs regs; + __u32 actions; + __u32 padding; + }; + +The 'actions' member holds one or more flags. For example, if +KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will +be overwritten with the new values (regs) from introspector. + +Specific data can follow these common structures. + +1. KVMI_EVENT_CR +---------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_cr { + __u16 cr; + __u16 padding[3]; + __u64 old_value; + __u64 new_value; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + struct kvmi_event_cr_reply { + __u64 new_val; + }; + +This event is sent when a CR register was modified and the introspection +has already been enabled for this kind of event (KVMI_CONTROL_EVENTS) +and for this specific register (KVMI_CR_CONTROL). + +kvmi_event_x86, the CR number, the old value and the new value are +sent to the introspector, which can respond with one or more action flags: + + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers + using the values from introspector (regs) + + KVMI_EVENT_ACTION_ALLOW - allow the register modification with the + value from introspector (new_val), otherwise deny the modification + but allow the guest to proceed as if the register has been loaded + with the desired value. + +2. KVMI_EVENT_MSR +----------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_msr { + __u32 msr; + __u32 padding; + __u64 old_value; + __u64 new_value; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + struct kvmi_event_msr_reply { + __u64 new_val; + }; + +This event is sent when a MSR was modified and the introspection has already +been enabled for this kind of event (KVMI_CONTROL_EVENTS) and for this +specific register (KVMI_MSR_CONTROL). + +kvmi_event_x86, the MSR number, the old value and the new value are +sent to the introspector, which can respond with one or more action flags: + + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers + using the values from introspector (regs) + + KVMI_EVENT_ACTION_ALLOW - allow the register modification with the + value from introspector (new_val), otherwise deny the modification + but allow the guest to proceed as if the register has been loaded + with the desired value. + +3. KVMI_EVENT_XSETBV +-------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_xsetbv { + __u64 xcr0; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + +This event is sent when the extended control register XCR0 was modified +and the introspection has already been enabled for this kind of event +(KVMI_CONTROL_EVENTS). + +kvmi_event_x86 and the new value are sent to the introspector, which +can respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', +instructing KVMi to override the general purpose registers using the +values from introspector (regs). + +4. KVMI_EVENT_BREAKPOINT +------------------------ + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_breakpoint { + __u64 gpa; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + +This event is sent when a breakpoint was reached and the introspection has +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). + +kvmi_event_x86 and the guest physical address are sent to the introspector, +which can respond with one or more action flags: + + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers + using the values from introspector (regs) + + KVMI_EVENT_ACTION_ALLOW - is implied if not specified + +5. KVMI_EVENT_USER_CALL +----------------------- + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + +This event is sent on a user hypercall and the introspection has already +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). + +kvmi_event_x86 is sent to the introspector, which can respond with the +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host +kernel to override the general purpose registers using the values from +introspector (regs). + +6. KVMI_EVENT_PAGE_FAULT +------------------------ + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_page_fault { + __u64 gva; + __u64 gpa; + __u32 mode; + __u32 padding; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + struct kvmi_event_page_fault_reply { + __u32 ctx_size; + __u8 ctx_data[256]; + }; + +This event is sent if a hypervisor page fault was encountered, the +introspection has already enabled the reports for this kind of event +(KVMI_CONTROL_EVENTS), and it was generated for a page for which the +introspector has shown interest (ie. has previously touched it by +adjusting the permissions). + +kvmi_event_x86, guest virtual address, guest physical address and +the exit qualification (mode) are sent to the introspector, which +can respond with one or more action flags: + + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers + using the values from introspector (regs) + + (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger + the page fault + + (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault + via emulation but with custom input (ctx_data, ctx_size). This is + used to trick the guest software into believing it has read + certain data. In practice it is used to hide the contents of certain + memory areas + + KVMI_EVENT_ALLOW - allow the page fault via emulation + +If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler +which usually implies overwriting any spte page access changes made before. +An introspection tool will always set this flag and prevent unwanted changes +to memory by skipping the instruction. It is up to the tool to adjust the +program counter in order to achieve this result. + +7. KVMI_EVENT_TRAP +------------------ + +:Architectures: x86 +:Versions: >= 1 +:Parameters: ↴ + +:: + + struct kvmi_event_x86; + struct kvmi_event_trap { + __u32 vector; + __u32 type; + __u32 err; + __u32 padding; + __u64 cr2; + }; + +:Returns: ↴ + +:: + + struct kvmi_event_x86_reply; + +This event is sent if a trap will be delivered to the guest (page fault, +breakpoint, etc.) and the introspection has already enabled the reports +for this kind of event (KVMI_CONTROL_EVENTS). + +This is used to inform the introspector of all pending traps giving it +a chance to determine if it should try again later in case a previous +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten +by an interrupt picked up during guest reentry. + +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt +type, exception code (err) and CR2 are sent to the introspector, which can +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing +the host kernel to override the general purpose registers using the values +from introspector (regs). diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h new file mode 100644 index 000000000000..54a2d8ebf649 --- /dev/null +++ b/include/uapi/linux/kvmi.h @@ -0,0 +1,310 @@ +/* + * Copyright (C) 2017 Bitdefender S.R.L. + * + * The KVMI Library is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the named License, or any later version. + * + * The KVMI Library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/> + */ +#ifndef __KVMI_H_INCLUDED__ +#define __KVMI_H_INCLUDED__ + +#include "asm/kvm.h" +#include <linux/types.h> + +#define KVMI_VERSION 0x00000001 + +#define KVMI_EVENT_CR (1 << 1) /* control register was modified */ +#define KVMI_EVENT_MSR (1 << 2) /* model specific reg. was modified */ +#define KVMI_EVENT_XSETBV (1 << 3) /* ext. control register was modified */ +#define KVMI_EVENT_BREAKPOINT (1 << 4) /* breakpoint was reached */ +#define KVMI_EVENT_USER_CALL (1 << 5) /* user hypercall */ +#define KVMI_EVENT_PAGE_FAULT (1 << 6) /* hyp. page fault was encountered */ +#define KVMI_EVENT_TRAP (1 << 7) /* trap was injected */ + +#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \ + KVMI_EVENT_MSR | \ + KVMI_EVENT_XSETBV | \ + KVMI_EVENT_BREAKPOINT | \ + KVMI_EVENT_USER_CALL | \ + KVMI_EVENT_PAGE_FAULT | \ + KVMI_EVENT_TRAP) + +#define KVMI_EVENT_ACTION_ALLOW (1 << 0) /* used in replies */ +#define KVMI_EVENT_ACTION_SET_REGS (1 << 1) /* registers need to be written back */ +#define KVMI_EVENT_ACTION_SET_CTX (1 << 2) /* set the emulation context */ +#define KVMI_EVENT_ACTION_NOEMU (1 << 3) /* return to guest without emulation */ + +#define KVMI_GET_VERSION 1 +#define KVMI_GET_GUESTS 2 /* TODO: remove me */ +#define KVMI_GET_GUEST_INFO 3 +#define KVMI_PAUSE_GUEST 4 +#define KVMI_UNPAUSE_GUEST 5 +#define KVMI_GET_REGISTERS 6 +#define KVMI_SET_REGISTERS 7 +#define KVMI_SHUTDOWN_GUEST 8 +#define KVMI_GET_MTRR_TYPE 9 +#define KVMI_GET_MTRRS 10 +#define KVMI_GET_XSAVE_INFO 11 +#define KVMI_GET_PAGE_ACCESS 12 +#define KVMI_SET_PAGE_ACCESS 13 +#define KVMI_INJECT_PAGE_FAULT 14 +#define KVMI_READ_PHYSICAL 15 /* TODO: remove me */ +#define KVMI_WRITE_PHYSICAL 16 /* TODO: remove me */ +#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST 17 +#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18 +#define KVMI_CONTROL_EVENTS 19 +#define KVMI_CR_CONTROL 20 +#define KVMI_MSR_CONTROL 21 +#define KVMI_INJECT_BREAKPOINT 22 +#define KVMI_EVENT_GUEST_ON 23 /* TODO: remove me */ +#define KVMI_EVENT_GUEST_OFF 24 /* TODO: remove me */ +#define KVMI_EVENT_VCPU 25 +#define KVMI_EVENT_VCPU_REPLY 26 + +/* TODO: remove me */ +struct kvmi_guest { + __u8 uuid[16]; +}; + +/* TODO: remove me */ +struct kvmi_guests { + __u32 size; /* in: the size of the entire structure */ + struct kvmi_guest guests[1]; +}; + +/* TODO: remove me */ +struct kvmi_read_physical { + __u64 gpa; + __u64 size; +}; + +/* TODO: remove me */ +struct kvmi_read_physical_reply { + __s32 err; + __u8 bytes[0]; +}; + +/* TODO: remove me */ +struct kvmi_write_physical { + __u64 gpa; + __u64 size; + __u8 bytes[0]; +}; + + +struct kvmi_socket_hdr { + __u16 msg_id; + __u16 size; + __u32 seq; +}; + +struct kvmi_error_code { + __s32 err; + __u32 padding; +}; + +struct kvmi_get_version_reply { + __s32 err; + __u32 version; +}; + +struct kvmi_get_guest_info_reply { + __s32 err; + __u16 vcpu_count; + __u16 padding; + __u64 tsc_speed; +}; + +struct kvmi_get_registers_x86 { + __u16 vcpu; + __u16 nmsrs; + __u32 msrs_idx[0]; +}; + +struct kvmi_get_registers_x86_reply { + __s32 err; + __u32 mode; + struct kvm_regs regs; + struct kvm_sregs sregs; + struct kvm_msrs msrs; +}; + +struct kvmi_set_registers_x86 { + __u16 vcpu; + __u16 padding[3]; + struct kvm_regs regs; +}; + +struct kvmi_mtrr_type { + __u64 gpa; +}; + +struct kvmi_mtrr_type_reply { + __s32 err; + __u32 padding; + __u64 type; +}; + +struct kvmi_mtrrs { + __u16 vcpu; + __u16 padding[3]; +}; + +struct kvmi_mtrrs_reply { + __s32 err; + __u32 padding; + __u64 pat; + __u64 cap; + __u64 type; +}; + +struct kvmi_xsave_info { + __u16 vcpu; + __u16 padding[3]; +}; + +struct kvmi_xsave_info_reply { + __s32 err; + __u32 size; +}; + +struct kvmi_get_page_access { + __u16 vcpu; + __u16 padding[3]; + __u64 gpa; +}; + +struct kvmi_get_page_access_reply { + __s32 err; + __u32 access; +}; + +struct kvmi_set_page_access { + __u16 vcpu; + __u16 padding; + __u32 access; + __u64 gpa; +}; + +struct kvmi_page_fault { + __u16 vcpu; + __u16 padding; + __u32 error; + __u64 gva; +}; + +struct kvmi_inject_breakpoint { + __u16 vcpu; + __u16 padding[3]; +}; + +struct kvmi_map_physical_page_to_guest { + __u64 gpa_src; + __u64 gfn_dest; +}; + +struct kvmi_unmap_physical_page_from_guest { + __u64 gfn_dest; +}; + +struct kvmi_control_events { + __u16 vcpu; + __u16 padding; + __u32 events; +}; + +struct kvmi_cr_control { + __u8 enable; + __u8 padding[3]; + __u32 cr; +}; + +struct kvmi_msr_control { + __u8 enable; + __u8 padding[3]; + __u32 msr; +}; + +struct kvmi_event_x86 { + __u16 vcpu; + __u8 mode; + __u8 padding1; + __u32 event; + struct kvm_regs regs; + struct kvm_sregs sregs; + struct { + __u64 sysenter_cs; + __u64 sysenter_esp; + __u64 sysenter_eip; + __u64 efer; + __u64 star; + __u64 lstar; + } msrs; +}; + +struct kvmi_event_x86_reply { + struct kvm_regs regs; + __u32 actions; + __u32 padding; +}; + +struct kvmi_event_cr { + __u16 cr; + __u16 padding[3]; + __u64 old_value; + __u64 new_value; +}; + +struct kvmi_event_cr_reply { + __u64 new_val; +}; + +struct kvmi_event_msr { + __u32 msr; + __u32 padding; + __u64 old_value; + __u64 new_value; +}; + +struct kvmi_event_msr_reply { + __u64 new_val; +}; + +struct kvmi_event_xsetbv { + __u64 xcr0; +}; + +struct kvmi_event_breakpoint { + __u64 gpa; +}; + +struct kvmi_event_page_fault { + __u64 gva; + __u64 gpa; + __u32 mode; + __u32 padding; +}; + +struct kvmi_event_page_fault_reply { + __u32 ctx_size; + __u8 ctx_data[256]; +}; + +struct kvmi_event_trap { + __u32 vector; + __u32 type; + __u32 err; + __u32 padding; + __u64 cr2; +}; + +#endif /* __KVMI_H_INCLUDED__ */ ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar @ 2017-07-07 16:52 ` Paolo Bonzini 2017-07-10 15:32 ` alazar ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Paolo Bonzini @ 2017-07-07 16:52 UTC (permalink / raw) To: Adalbert Lazar, kvm Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu I have a lot of comments, but it's a very good start. Thanks for writing down the specification. On 07/07/2017 16:34, Adalbert Lazar wrote: > +1. KVMI_GET_VERSION > +------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: {} > +:Returns: ↴ What is this character? :) > +:: > + > + struct kvmi_get_version_reply { > + __s32 err; > + __u32 version; > + }; > + > +Returns the introspection API version (the KVMI_VERSION constant) and the > +error code (zero). In case of an unlikely error, the version will have an > +undefined value. Would it make sense to return a few more information fields, for example the set of supported KVMI_CONTROL_EVENTS events? > +2. KVMI_GET_GUEST_INFO > +---------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: {} > +:Returns: ↴ > + > +:: > + > + struct kvmi_get_guest_info_reply { > + __s32 err; > + __u16 vcpu_count; > + __u16 padding; > + __u64 tsc_speed; > + }; > + > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported > +by the architecture (otherwise is 0). Can the TSC frequency be specified only if the guest is using TSC scaling? Defining the TSC frequency on older hosts is a bit tricky. 0 would be the host. Maybe define the second padding to be __u16 arch; (0 = x86) followed by an arch-specific payload. > +3. KVMI_PAUSE_GUEST > +------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: {} > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +This command will pause all vcpus threads, by getting them out of guest mode > +and put them in the "waiting introspection commands" state. Pausing all threads synchronously is a tricky concept. The main issue is that the guest could be paused at the userspace level. Then KVM would not be running (userspace is stuck in pthread_cond_wait, typically) and could not acknowledge the pause. Worse, KVM is not able to distinguish userspace that has paused the VM from userspace that is doing MMIO or userspace that has a bug and hung somewhere. And even worse, there are cases where userspace wants to modify registers while doing port I/O (the awful VMware RPCI port). So I'd rather avoid this. > +4. KVMI_UNPAUSE_GUEST > +--------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: {} > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Resume the vcpu threads, or at least get them out of "waiting introspection > +commands" state. And this as well, of course. > +5. KVMI_SHUTDOWN_GUEST > +---------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: {} > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Ungracefully shutdown the guest. Note that all KVM can do here is pass down a request asynchronously to userspace, failing further KVM_RUN ioctls. I'm suggesting below an alternative action. > +6. KVMI_GET_REGISTERS > +--------------------- > + > +:Architectures: x86 (could be all, but with different input/output) > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_get_registers_x86 { > + __u16 vcpu; > + __u16 nmsrs; > + __u32 msrs_idx[0]; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_get_registers_x86_reply { > + __s32 err; > + __u32 mode; > + struct kvm_regs regs; > + struct kvm_sregs sregs; > + struct kvm_msrs msrs; > + }; > + > +For the given vcpu_id and the nmsrs sized array of MSRs registers, returns > +the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers, > +the special registers and the requested set of MSR-s. > + > +7. KVMI_SET_REGISTERS > +--------------------- > + > +:Architectures: x86 (could be all, but with different input) > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_set_registers_x86 { > + __u16 vcpu; > + __u16 padding[3]; > + struct kvm_regs regs; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Sets the general purpose registers for the given vcpu_id. > + > +8. KVMI_GET_MTRR_TYPE > +--------------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_mtrr_type { > + __u64 gpa; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_mtrr_type_reply { > + __s32 err; > + __u32 type; > + }; > + > +Returns the guest memory type for a specific physical address. What is this used for? KVM ignores the guest MTRRs, so if possible I'd rather avoid it. > +9. KVMI_GET_MTRRS > +----------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_mtrrs { > + __u16 vcpu; > + __u16 padding[3]; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_mtrrs_reply { > + __s32 err; > + __u32 padding; > + __u64 pat; > + __u64 cap; > + __u64 type; > + }; > + > +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified > +vCPU. This could use KVM_GET_REGISTERS, couldn't it? > +10. KVMI_GET_XSAVE_INFO > +----------------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_xsave_info { > + __u16 vcpu; > + __u16 padding[3]; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_xsave_info_reply { > + __s32 err; > + __u32 size; > + }; > + > +Returns the xstate size for the specified vCPU. Could this be replaced by a generic CPUID accessor? > +11. KVMI_GET_PAGE_ACCESS > +------------------------ > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_get_page_access { > + __u16 vcpu; > + __u16 padding[3]; > + __u64 gpa; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_get_page_access_reply { > + __s32 err; > + __u32 access; > + }; > + > +Returns the spte flags (rwx - present, write & user) for the specified > +vCPU and guest physical address. > + > +12. KVMI_SET_PAGE_ACCESS > +------------------------ > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_set_page_access { > + __u16 vcpu; > + __u16 padding; > + __u32 access; > + __u64 gpa; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Sets the spte flags (rwx - present, write & user) - access - for the specified > +vCPU and guest physical address. rwx or pwu? I suppose RWX. Maybe #define the constants in the documentation. Also, it should be noted here that the spte flags are ANDed with whatever is provided by userspace, because there could be readonly memslots, and that some access combinations could fail (--x) or will surely fail (-wx for example). > +13. KVMI_INJECT_PAGE_FAULT > +-------------------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_page_fault { > + __u16 vcpu; > + __u16 padding; > + __u32 error; error_code, I guess? Why not a generic inject exception message? > + __u64 gva; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Injects a vCPU page fault with the specified guest virtual address and > +error code. > + > +14. KVMI_INJECT_BREAKPOINT > +-------------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_inject_breakpoint { > + __u16 vcpu; > + __u16 padding[3]; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Injects a breakpoint for the specified vCPU. This command is usually sent in > +response to an event and as such the proper GPR-s will be set with the reply. What is a "breakpoint" in this context? > +15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST > +----------------------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_map_physical_page_to_guest { > + __u64 gpa_src; > + __u64 gfn_dest; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Maps a page from an introspected guest memory (gpa_src) to the guest running > +the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one > +page in size. > + > +This command is used to "read" the introspected guest memory and potentially > +place patches (eg. INT3-s). Two problems here: 1) The gpa->hva mapping can change as the guest runs. I'd prefer a read/write API to map/unmap, where read replies and events provide a "cookie" (generation count). Write (including breakpoints? it's not clear to me who physically writes the 0xCC) then return an error if the incoming generation count doesn't match the current one. 2) There can be multiple address spaces, notably SMM and non-SMM. > +17. KVMI_CONTROL_EVENTS > +----------------------- > + > +:Architectures: all > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_control_events { > + __u16 vcpu; > + __u16 padding; > + __u32 events; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Enables/disables vCPU introspection events, by setting/clearing one or more > +of the following bits (see 'Events' below) : > + > + KVMI_EVENT_CR > + KVMI_EVENT_MSR > + KVMI_EVENT_XSETBV > + KVMI_EVENT_BREAKPOINT > + KVMI_EVENT_USER_CALL > + KVMI_EVENT_PAGE_FAULT > + KVMI_EVENT_TRAP > + > +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current > +architecture would fail and -EINVAL will be returned. I would prefer the interface to allow enable (set bits), disable (clear bits) in addition to set (what you have here) the events. The return value could indicate the set of enabled events. > +19. KVMI_MSR_CONTROL > +-------------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_msr_control { > + __u8 enable; > + __u8 padding[3]; > + __u32 msr; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_error_code { > + __s32 err; > + __u32 padding; > + }; > + > +Enables/disables introspection for a specific MSR, and must be used > +in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set. > + > +Currently, only MSRs within the following 3 ranges are supported. Trying > +to control any other register will return -EINVAL. :: > + > + 0 ... 0x00001fff > + 0x40000000 ... 0x40001fff > + 0xc0000000 ... 0xc0001fff Note that KVM custom MSRs fall in the range 0x4b564d00-0x4b564dff. > +Events > +------ > + > +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will > +be sent unless enabled with a KVMI_CONTROL_EVENTS command. > + > +For x86, the message data starts with a common structure:: > + > + struct kvmi_event_x86 { > + __u16 vcpu; > + __u8 mode; > + __u8 padding1; > + __u32 event; > + struct kvm_regs regs; > + struct kvm_sregs sregs; > + struct { > + __u64 sysenter_cs; > + __u64 sysenter_esp; > + __u64 sysenter_eip; > + __u64 efer; > + __u64 star; > + __u64 lstar; cstar too, for AMD CPUs. > + } msrs; > + }; > + > +In order to help the introspection tool with the event analysis while > +avoiding unnecessary introspection commands, the message data holds some > +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside > +the vCPU id, its mode (in bytes) and the event (one of the flags set > +with the KVMI_CONTROL_EVENTS command). > + > +The replies to events also start with a common structure, having the > +KVMI_EVENT_VCPU_REPLY message id:: > + > + struct kvmi_event_x86_reply { > + struct kvm_regs regs; Put regs last? > + __u32 actions; > + __u32 padding; > + }; > + > +The 'actions' member holds one or more flags. For example, if > +KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will > +be overwritten with the new values (regs) from introspector. > + > +Specific data can follow these common structures. > + > +1. KVMI_EVENT_CR > +---------------- > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_event_x86; > + struct kvmi_event_cr { > + __u16 cr; > + __u16 padding[3]; > + __u64 old_value; > + __u64 new_value; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_event_x86_reply; > + struct kvmi_event_cr_reply { > + __u64 new_val; > + }; > + > +This event is sent when a CR register was modified and the introspection > +has already been enabled for this kind of event (KVMI_CONTROL_EVENTS) > +and for this specific register (KVMI_CR_CONTROL). > + > +kvmi_event_x86, the CR number, the old value and the new value are > +sent to the introspector, which can respond with one or more action flags: > + > + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers > + using the values from introspector (regs) I think the action should be KVMI_EVENT_ACTION_SKIP. Setting registers can be done separately with a command. > + KVMI_EVENT_ACTION_ALLOW - allow the register modification with the > + value from introspector (new_val), otherwise deny the modification > + but allow the guest to proceed as if the register has been loaded > + with the desired value. Maybe add an extra action for exit to userspace (which would presumably shutdown the guest)? This would apply to all events. Also, a retry action can be useful in case e.g. the write cookie becomes stale. > +5. KVMI_EVENT_USER_CALL > +----------------------- Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL. > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_event_x86; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_event_x86_reply; > + > +This event is sent on a user hypercall and the introspection has already > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). > + > +kvmi_event_x86 is sent to the introspector, which can respond with the > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host > +kernel to override the general purpose registers using the values from > +introspector (regs). Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall? Why is KVMI_EVENT_ACTION_ALLOW not allowed? As before, I'd prefer SKIP/RETRY/ALLOW/CRASH as the actions. > +6. KVMI_EVENT_PAGE_FAULT > +------------------------ > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_event_x86; > + struct kvmi_event_page_fault { > + __u64 gva; > + __u64 gpa; > + __u32 mode; > + __u32 padding; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_event_x86_reply; > + struct kvmi_event_page_fault_reply { > + __u32 ctx_size; > + __u8 ctx_data[256]; > + }; > + > +This event is sent if a hypervisor page fault was encountered, the > +introspection has already enabled the reports for this kind of event > +(KVMI_CONTROL_EVENTS), and it was generated for a page for which the > +introspector has shown interest (ie. has previously touched it by > +adjusting the permissions). If the introspector sets permissions to r-- and you get a read page fault because the page is swapped out, should the introspector get the event? > +kvmi_event_x86, guest virtual address, guest physical address and > +the exit qualification (mode) are sent to the introspector, which > +can respond with one or more action flags: > + > + KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers > + using the values from introspector (regs) > + > + (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger > + the page fault "re-execute the instruction"? Maybe call it KVMI_EVENT_RETRY? > + (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault > + via emulation but with custom input (ctx_data, ctx_size). This is > + used to trick the guest software into believing it has read > + certain data. In practice it is used to hide the contents of certain > + memory areas This is tricky. The instruction could be, for example, a string read. I think there should be a separate event for I/O from a trapping page (so perhaps return KVMI_EVENT_ALLOW | KVMI_EVENT_TRAP_ACCESS). > + KVMI_EVENT_ALLOW - allow the page fault via emulation > > +If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler > +which usually implies overwriting any spte page access changes made before. > +An introspection tool will always set this flag and prevent unwanted changes > +to memory by skipping the instruction. It is up to the tool to adjust the > +program counter in order to achieve this result. I don't like this. It should be what KVMI_EVENT_RETRY is for. But maybe I'm missing the difference between "0" and KVMI_EVENT_RETRY. Here, the actions would be SKIP/RETRY/ALLOW/CRASH, with an optional KVMI_EVENT_TRAP_ACCESS flag for "allow". > +7. KVMI_EVENT_TRAP > +------------------ > + > +:Architectures: x86 > +:Versions: >= 1 > +:Parameters: ↴ > + > +:: > + > + struct kvmi_event_x86; > + struct kvmi_event_trap { > + __u32 vector; > + __u32 type; > + __u32 err; error_code? > + __u32 padding; > + __u64 cr2; > + }; > + > +:Returns: ↴ > + > +:: > + > + struct kvmi_event_x86_reply; > + > +This event is sent if a trap will be delivered to the guest (page fault, > +breakpoint, etc.) and the introspection has already enabled the reports > +for this kind of event (KVMI_CONTROL_EVENTS). > + > +This is used to inform the introspector of all pending traps giving it > +a chance to determine if it should try again later in case a previous > +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten > +by an interrupt picked up during guest reentry. > + > +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt > +type, exception code (err) and CR2 are sent to the introspector, which can > +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing > +the host kernel to override the general purpose registers using the values > +from introspector (regs). Here I think the actions could be RETRY/ALLOW/CRASH only, with "set regs" done as a separate command. Some x86 exceptions are faults, others are traps. Traps should not allow RETRY as an action. There should be an input telling the introspector if retrying is allowed. How are dr6/dr7 handled around breakpoints? Thanks, Paolo > diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h > new file mode 100644 > index 000000000000..54a2d8ebf649 > --- /dev/null > +++ b/include/uapi/linux/kvmi.h > @@ -0,0 +1,310 @@ > +/* > + * Copyright (C) 2017 Bitdefender S.R.L. > + * > + * The KVMI Library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; either version 2 > + * of the named License, or any later version. > + * > + * The KVMI Library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/> > + */ > +#ifndef __KVMI_H_INCLUDED__ > +#define __KVMI_H_INCLUDED__ > + > +#include "asm/kvm.h" > +#include <linux/types.h> > + > +#define KVMI_VERSION 0x00000001 > + > +#define KVMI_EVENT_CR (1 << 1) /* control register was modified */ > +#define KVMI_EVENT_MSR (1 << 2) /* model specific reg. was modified */ > +#define KVMI_EVENT_XSETBV (1 << 3) /* ext. control register was modified */ > +#define KVMI_EVENT_BREAKPOINT (1 << 4) /* breakpoint was reached */ > +#define KVMI_EVENT_USER_CALL (1 << 5) /* user hypercall */ > +#define KVMI_EVENT_PAGE_FAULT (1 << 6) /* hyp. page fault was encountered */ > +#define KVMI_EVENT_TRAP (1 << 7) /* trap was injected */ > + > +#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \ > + KVMI_EVENT_MSR | \ > + KVMI_EVENT_XSETBV | \ > + KVMI_EVENT_BREAKPOINT | \ > + KVMI_EVENT_USER_CALL | \ > + KVMI_EVENT_PAGE_FAULT | \ > + KVMI_EVENT_TRAP) > + > +#define KVMI_EVENT_ACTION_ALLOW (1 << 0) /* used in replies */ > +#define KVMI_EVENT_ACTION_SET_REGS (1 << 1) /* registers need to be written back */ > +#define KVMI_EVENT_ACTION_SET_CTX (1 << 2) /* set the emulation context */ > +#define KVMI_EVENT_ACTION_NOEMU (1 << 3) /* return to guest without emulation */ > + > +#define KVMI_GET_VERSION 1 > +#define KVMI_GET_GUESTS 2 /* TODO: remove me */ > +#define KVMI_GET_GUEST_INFO 3 > +#define KVMI_PAUSE_GUEST 4 > +#define KVMI_UNPAUSE_GUEST 5 > +#define KVMI_GET_REGISTERS 6 > +#define KVMI_SET_REGISTERS 7 > +#define KVMI_SHUTDOWN_GUEST 8 > +#define KVMI_GET_MTRR_TYPE 9 > +#define KVMI_GET_MTRRS 10 > +#define KVMI_GET_XSAVE_INFO 11 > +#define KVMI_GET_PAGE_ACCESS 12 > +#define KVMI_SET_PAGE_ACCESS 13 > +#define KVMI_INJECT_PAGE_FAULT 14 > +#define KVMI_READ_PHYSICAL 15 /* TODO: remove me */ > +#define KVMI_WRITE_PHYSICAL 16 /* TODO: remove me */ > +#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST 17 > +#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18 > +#define KVMI_CONTROL_EVENTS 19 > +#define KVMI_CR_CONTROL 20 > +#define KVMI_MSR_CONTROL 21 > +#define KVMI_INJECT_BREAKPOINT 22 > +#define KVMI_EVENT_GUEST_ON 23 /* TODO: remove me */ > +#define KVMI_EVENT_GUEST_OFF 24 /* TODO: remove me */ > +#define KVMI_EVENT_VCPU 25 > +#define KVMI_EVENT_VCPU_REPLY 26 Errnos values are not portable, I'd prefer to have them defined explicitly in the header. Paolo > +/* TODO: remove me */ > +struct kvmi_guest { > + __u8 uuid[16]; > +}; > + > +/* TODO: remove me */ > +struct kvmi_guests { > + __u32 size; /* in: the size of the entire structure */ > + struct kvmi_guest guests[1]; > +}; > + > +/* TODO: remove me */ > +struct kvmi_read_physical { > + __u64 gpa; > + __u64 size; > +}; > + > +/* TODO: remove me */ > +struct kvmi_read_physical_reply { > + __s32 err; > + __u8 bytes[0]; > +}; > + > +/* TODO: remove me */ > +struct kvmi_write_physical { > + __u64 gpa; > + __u64 size; > + __u8 bytes[0]; > +}; > + > + > +struct kvmi_socket_hdr { > + __u16 msg_id; > + __u16 size; > + __u32 seq; > +}; > + > +struct kvmi_error_code { > + __s32 err; > + __u32 padding; > +}; > + > +struct kvmi_get_version_reply { > + __s32 err; > + __u32 version; > +}; > + > +struct kvmi_get_guest_info_reply { > + __s32 err; > + __u16 vcpu_count; > + __u16 padding; > + __u64 tsc_speed; > +}; > + > +struct kvmi_get_registers_x86 { > + __u16 vcpu; > + __u16 nmsrs; > + __u32 msrs_idx[0]; > +}; > + > +struct kvmi_get_registers_x86_reply { > + __s32 err; > + __u32 mode; > + struct kvm_regs regs; > + struct kvm_sregs sregs; > + struct kvm_msrs msrs; > +}; > + > +struct kvmi_set_registers_x86 { > + __u16 vcpu; > + __u16 padding[3]; > + struct kvm_regs regs; > +}; > + > +struct kvmi_mtrr_type { > + __u64 gpa; > +}; > + > +struct kvmi_mtrr_type_reply { > + __s32 err; > + __u32 padding; > + __u64 type; > +}; > + > +struct kvmi_mtrrs { > + __u16 vcpu; > + __u16 padding[3]; > +}; > + > +struct kvmi_mtrrs_reply { > + __s32 err; > + __u32 padding; > + __u64 pat; > + __u64 cap; > + __u64 type; > +}; > + > +struct kvmi_xsave_info { > + __u16 vcpu; > + __u16 padding[3]; > +}; > + > +struct kvmi_xsave_info_reply { > + __s32 err; > + __u32 size; > +}; > + > +struct kvmi_get_page_access { > + __u16 vcpu; > + __u16 padding[3]; > + __u64 gpa; > +}; > + > +struct kvmi_get_page_access_reply { > + __s32 err; > + __u32 access; > +}; > + > +struct kvmi_set_page_access { > + __u16 vcpu; > + __u16 padding; > + __u32 access; > + __u64 gpa; > +}; > + > +struct kvmi_page_fault { > + __u16 vcpu; > + __u16 padding; > + __u32 error; > + __u64 gva; > +}; > + > +struct kvmi_inject_breakpoint { > + __u16 vcpu; > + __u16 padding[3]; > +}; > + > +struct kvmi_map_physical_page_to_guest { > + __u64 gpa_src; > + __u64 gfn_dest; > +}; > + > +struct kvmi_unmap_physical_page_from_guest { > + __u64 gfn_dest; > +}; > + > +struct kvmi_control_events { > + __u16 vcpu; > + __u16 padding; > + __u32 events; > +}; > + > +struct kvmi_cr_control { > + __u8 enable; > + __u8 padding[3]; > + __u32 cr; > +}; > + > +struct kvmi_msr_control { > + __u8 enable; > + __u8 padding[3]; > + __u32 msr; > +}; > + > +struct kvmi_event_x86 { > + __u16 vcpu; > + __u8 mode; > + __u8 padding1; > + __u32 event; > + struct kvm_regs regs; > + struct kvm_sregs sregs; > + struct { > + __u64 sysenter_cs; > + __u64 sysenter_esp; > + __u64 sysenter_eip; > + __u64 efer; > + __u64 star; > + __u64 lstar; > + } msrs; > +}; > + > +struct kvmi_event_x86_reply { > + struct kvm_regs regs; > + __u32 actions; > + __u32 padding; > +}; > + > +struct kvmi_event_cr { > + __u16 cr; > + __u16 padding[3]; > + __u64 old_value; > + __u64 new_value; > +}; > + > +struct kvmi_event_cr_reply { > + __u64 new_val; > +}; > + > +struct kvmi_event_msr { > + __u32 msr; > + __u32 padding; > + __u64 old_value; > + __u64 new_value; > +}; > + > +struct kvmi_event_msr_reply { > + __u64 new_val; > +}; > + > +struct kvmi_event_xsetbv { > + __u64 xcr0; > +}; > + > +struct kvmi_event_breakpoint { > + __u64 gpa; > +}; > + > +struct kvmi_event_page_fault { > + __u64 gva; > + __u64 gpa; > + __u32 mode; > + __u32 padding; > +}; > + > +struct kvmi_event_page_fault_reply { > + __u32 ctx_size; > + __u8 ctx_data[256]; > +}; > + > +struct kvmi_event_trap { > + __u32 vector; > + __u32 type; > + __u32 err; > + __u32 padding; > + __u64 cr2; > +}; > + > +#endif /* __KVMI_H_INCLUDED__ */ ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-07 16:52 ` Paolo Bonzini @ 2017-07-10 15:32 ` alazar 2017-07-10 17:03 ` Paolo Bonzini 2017-07-13 8:36 ` Mihai Donțu 2017-07-27 17:06 ` Mihai Donțu 2 siblings, 1 reply; 50+ messages in thread From: alazar @ 2017-07-10 15:32 UTC (permalink / raw) To: Paolo Bonzini, kvm Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu Thanks for your review, Paolo. Below are some answers. Will have to chew on the others. On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 07/07/2017 16:34, Adalbert Lazar wrote: > > +:Returns: ↴ > > What is this character? :) I thought using a nice right-down arrow instead of a boring/inexpressive ASCII char in order to avoid trailing spaces will be cool. At least it was fun :) > > + struct kvmi_get_version_reply { > > + __s32 err; > > + __u32 version; > > + }; > > + > > +Returns the introspection API version (the KVMI_VERSION constant) and the > > +error code (zero). In case of an unlikely error, the version will have an > > +undefined value. > > Would it make sense to return a few more information fields, for example > the set of supported KVMI_CONTROL_EVENTS events? Sure. > > + struct kvmi_get_guest_info_reply { > > + __s32 err; > > + __u16 vcpu_count; > > + __u16 padding; > > + __u64 tsc_speed; > > + }; > > + > > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported > > +by the architecture (otherwise is 0). > > Can the TSC frequency be specified only if the guest is using TSC > scaling? Defining the TSC frequency on older hosts is a bit tricky. 0 > would be the host. > > Maybe define the second padding to be > > __u16 arch; > > (0 = x86) followed by an arch-specific payload. Sure. > > +5. KVMI_SHUTDOWN_GUEST > > +---------------------- > > + > > +Ungracefully shutdown the guest. > > Note that all KVM can do here is pass down a request asynchronously to > userspace, failing further KVM_RUN ioctls. I'm suggesting below an > alternative action. We were thinking about sending an "ungraceful" TERM signal :D Adding another reponse (exit-to-userspace-with-shutdown) on events, as you suggested below, makes sense. > > +12. KVMI_SET_PAGE_ACCESS > > +------------------------ > > + > > +Sets the spte flags (rwx - present, write & user) - access - for the specified > > +vCPU and guest physical address. > > rwx or pwu? I suppose RWX. Maybe #define the constants in the > documentation. RWX. Will do. > > +17. KVMI_CONTROL_EVENTS > > +----------------------- > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_control_events { > > + __u16 vcpu; > > + __u16 padding; > > + __u32 events; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_error_code { > > + __s32 err; > > + __u32 padding; > > + }; > > + > > +Enables/disables vCPU introspection events, by setting/clearing one or more > > +of the following bits (see 'Events' below) : > > + > > + KVMI_EVENT_CR > > + KVMI_EVENT_MSR > > + KVMI_EVENT_XSETBV > > + KVMI_EVENT_BREAKPOINT > > + KVMI_EVENT_USER_CALL > > + KVMI_EVENT_PAGE_FAULT > > + KVMI_EVENT_TRAP > > + > > +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current > > +architecture would fail and -EINVAL will be returned. > > I would prefer the interface to allow enable (set bits), disable (clear > bits) in addition to set (what you have here) the events. Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will enable CR and MSR events and disable all the other events. It functions like a set and clear, in the same time. It will be better to have KVMI_ENABLE_CONTROL_EVENTS and KVMI_DISABLE_CONTROL_EVENTS instead? > The return value could indicate the set of enabled events. Indeed. > > +Events > > +------ > > + > > +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will > > +be sent unless enabled with a KVMI_CONTROL_EVENTS command. > > + > > +For x86, the message data starts with a common structure:: > > + > > + struct kvmi_event_x86 { > > + __u16 vcpu; > > + __u8 mode; > > + __u8 padding1; > > + __u32 event; > > + struct kvm_regs regs; > > + struct kvm_sregs sregs; > > + struct { > > + __u64 sysenter_cs; > > + __u64 sysenter_esp; > > + __u64 sysenter_eip; > > + __u64 efer; > > + __u64 star; > > + __u64 lstar; > > cstar too, for AMD CPUs. Thanks. Will do. > > + } msrs; > > + }; > > + > > +In order to help the introspection tool with the event analysis while > > +avoiding unnecessary introspection commands, the message data holds some > > +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside > > +the vCPU id, its mode (in bytes) and the event (one of the flags set > > +with the KVMI_CONTROL_EVENTS command). > > + > > +The replies to events also start with a common structure, having the > > +KVMI_EVENT_VCPU_REPLY message id:: > > + > > + struct kvmi_event_x86_reply { > > + struct kvm_regs regs; > > Put regs last? > > > + __u32 actions; > > + __u32 padding; > > + }; Right. > > +5. KVMI_EVENT_USER_CALL > > +----------------------- > > Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL. Sure. > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86_reply; > > + > > +This event is sent on a user hypercall and the introspection has already > > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). > > + > > +kvmi_event_x86 is sent to the introspector, which can respond with the > > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host > > +kernel to override the general purpose registers using the values from > > +introspector (regs). > > Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall? Why is > KVMI_EVENT_ACTION_ALLOW not allowed? As before, I'd prefer > SKIP/RETRY/ALLOW/CRASH as the actions. We've used this as way for guest code to communicate with the introspector. KVMI_EVENT_ACTION_ALLOW is implied here. > > --- /dev/null > > +++ b/include/uapi/linux/kvmi.h ... > > Errnos values are not portable, I'd prefer to have them defined > explicitly in the header. We were thinking that the introspection tool will take care of this, as with the endianness. Surely, it will be much more clear to have the errnos defined here. Thanks again, Adalbert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-10 15:32 ` alazar @ 2017-07-10 17:03 ` Paolo Bonzini 2017-07-11 16:48 ` Adalbert Lazar 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-10 17:03 UTC (permalink / raw) To: alazar, kvm Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu On 10/07/2017 17:32, alazar@bitdefender.com wrote: > Thanks for your review, Paolo. > Below are some answers. > Will have to chew on the others. I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH. The main remaining issue seems to be map/unmap. Paolo > On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 07/07/2017 16:34, Adalbert Lazar wrote: >>> +:Returns: ↴ >> >> What is this character? :) > > I thought using a nice right-down arrow instead of a boring/inexpressive > ASCII char in order to avoid trailing spaces will be cool. At least it > was fun :) > >>> + struct kvmi_get_version_reply { >>> + __s32 err; >>> + __u32 version; >>> + }; >>> + >>> +Returns the introspection API version (the KVMI_VERSION constant) and the >>> +error code (zero). In case of an unlikely error, the version will have an >>> +undefined value. >> >> Would it make sense to return a few more information fields, for example >> the set of supported KVMI_CONTROL_EVENTS events? > > Sure. > >>> + struct kvmi_get_guest_info_reply { >>> + __s32 err; >>> + __u16 vcpu_count; >>> + __u16 padding; >>> + __u64 tsc_speed; >>> + }; >>> + >>> +Returns the number of online vcpus, and the TSC frequency in HZ, if supported >>> +by the architecture (otherwise is 0). >> >> Can the TSC frequency be specified only if the guest is using TSC >> scaling? Defining the TSC frequency on older hosts is a bit tricky. 0 >> would be the host. >> >> Maybe define the second padding to be >> >> __u16 arch; >> >> (0 = x86) followed by an arch-specific payload. > > Sure. > >>> +5. KVMI_SHUTDOWN_GUEST >>> +---------------------- >>> + >>> +Ungracefully shutdown the guest. >> >> Note that all KVM can do here is pass down a request asynchronously to >> userspace, failing further KVM_RUN ioctls. I'm suggesting below an >> alternative action. > > We were thinking about sending an "ungraceful" TERM signal :D > > Adding another reponse (exit-to-userspace-with-shutdown) on events, > as you suggested below, makes sense. > >>> +12. KVMI_SET_PAGE_ACCESS >>> +------------------------ >>> + >>> +Sets the spte flags (rwx - present, write & user) - access - for the specified >>> +vCPU and guest physical address. >> >> rwx or pwu? I suppose RWX. Maybe #define the constants in the >> documentation. > > RWX. > Will do. > >>> +17. KVMI_CONTROL_EVENTS >>> +----------------------- >>> + >>> +:Architectures: all >>> +:Versions: >= 1 >>> +:Parameters: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_control_events { >>> + __u16 vcpu; >>> + __u16 padding; >>> + __u32 events; >>> + }; >>> + >>> +:Returns: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_error_code { >>> + __s32 err; >>> + __u32 padding; >>> + }; >>> + >>> +Enables/disables vCPU introspection events, by setting/clearing one or more >>> +of the following bits (see 'Events' below) : >>> + >>> + KVMI_EVENT_CR >>> + KVMI_EVENT_MSR >>> + KVMI_EVENT_XSETBV >>> + KVMI_EVENT_BREAKPOINT >>> + KVMI_EVENT_USER_CALL >>> + KVMI_EVENT_PAGE_FAULT >>> + KVMI_EVENT_TRAP >>> + >>> +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current >>> +architecture would fail and -EINVAL will be returned. >> >> I would prefer the interface to allow enable (set bits), disable (clear >> bits) in addition to set (what you have here) the events. > > Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will > enable CR and MSR events and disable all the other events. It functions like > a set and clear, in the same time. > > It will be better to have KVMI_ENABLE_CONTROL_EVENTS and > KVMI_DISABLE_CONTROL_EVENTS instead? > >> The return value could indicate the set of enabled events. > > Indeed. > >>> +Events >>> +------ >>> + >>> +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will >>> +be sent unless enabled with a KVMI_CONTROL_EVENTS command. >>> + >>> +For x86, the message data starts with a common structure:: >>> + >>> + struct kvmi_event_x86 { >>> + __u16 vcpu; >>> + __u8 mode; >>> + __u8 padding1; >>> + __u32 event; >>> + struct kvm_regs regs; >>> + struct kvm_sregs sregs; >>> + struct { >>> + __u64 sysenter_cs; >>> + __u64 sysenter_esp; >>> + __u64 sysenter_eip; >>> + __u64 efer; >>> + __u64 star; >>> + __u64 lstar; >> >> cstar too, for AMD CPUs. > > Thanks. Will do. > >>> + } msrs; >>> + }; >>> + >>> +In order to help the introspection tool with the event analysis while >>> +avoiding unnecessary introspection commands, the message data holds some >>> +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside >>> +the vCPU id, its mode (in bytes) and the event (one of the flags set >>> +with the KVMI_CONTROL_EVENTS command). >>> + >>> +The replies to events also start with a common structure, having the >>> +KVMI_EVENT_VCPU_REPLY message id:: >>> + >>> + struct kvmi_event_x86_reply { >>> + struct kvm_regs regs; >> >> Put regs last? >> >>> + __u32 actions; >>> + __u32 padding; >>> + }; > > Right. > >>> +5. KVMI_EVENT_USER_CALL >>> +----------------------- >> >> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL. > > Sure. > >>> + >>> +:Architectures: x86 >>> +:Versions: >= 1 >>> +:Parameters: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_event_x86; >>> + >>> +:Returns: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_event_x86_reply; >>> + >>> +This event is sent on a user hypercall and the introspection has already >>> +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). >>> + >>> +kvmi_event_x86 is sent to the introspector, which can respond with the >>> +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host >>> +kernel to override the general purpose registers using the values from >>> +introspector (regs). >> >> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall? Why is >> KVMI_EVENT_ACTION_ALLOW not allowed? As before, I'd prefer >> SKIP/RETRY/ALLOW/CRASH as the actions. > > We've used this as way for guest code to communicate with the introspector. > KVMI_EVENT_ACTION_ALLOW is implied here. > >>> --- /dev/null >>> +++ b/include/uapi/linux/kvmi.h > ... >> >> Errnos values are not portable, I'd prefer to have them defined >> explicitly in the header. > > We were thinking that the introspection tool will take care of this, > as with the endianness. > > Surely, it will be much more clear to have the errnos defined here. > > Thanks again, > > Adalbert > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-10 17:03 ` Paolo Bonzini @ 2017-07-11 16:48 ` Adalbert Lazar 2017-07-11 16:51 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Adalbert Lazar @ 2017-07-11 16:48 UTC (permalink / raw) To: Paolo Bonzini, kvm Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote: > I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and > more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH. > > The main remaining issue seems to be map/unmap. Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps performance wise. Maybe we could have it as an optional flag for ALLOW? Or at least for the hot paths? Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other additional flag: * CR, MSR - ALLOW with untouched new_value will let the guest continue, but with "new_value = old_value" is a "deny" * xsetbv - ALLOW is implied * breakpoint - SKIP means the BP is processed by the introspector, ALLOW means let the guest handle it * hypercall - ALLOW is implied * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF, ALLOW with adjusted PC is a "skip" (done by the tool for the moment). Adalbert ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-11 16:48 ` Adalbert Lazar @ 2017-07-11 16:51 ` Paolo Bonzini 2017-07-13 5:57 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-11 16:51 UTC (permalink / raw) To: Adalbert Lazar, kvm Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu On 11/07/2017 18:48, Adalbert Lazar wrote: > On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote: >> I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and >> more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH. >> >> The main remaining issue seems to be map/unmap. > > Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps > performance wise. Maybe we could have it as an optional flag for ALLOW? Actually it makes more sense for SKIP, I think, where the introspector is actually doing emulation? But why is KVMI_SET_REGS slower than a set regs command followed by an action? > Or at least for the hot paths? > > Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other > additional flag: > > * CR, MSR - ALLOW with untouched new_value will let the guest continue, > but with "new_value = old_value" is a "deny" > > * xsetbv - ALLOW is implied > > * breakpoint - SKIP means the BP is processed by the introspector, > ALLOW means let the guest handle it > > * hypercall - ALLOW is implied > > * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF, > ALLOW with adjusted PC is a "skip" (done by the tool > for the moment). This is more complicated than necessary. I would just make it simple: - SKIP, adjust RIP to point to the next instruction and enter the guest - RETRY, enter the guest - ALLOW, emulate the instruction with information coming from the response packet - CRASH, self-explanatory Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-11 16:51 ` Paolo Bonzini @ 2017-07-13 5:57 ` Mihai Donțu 2017-07-13 7:32 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-13 5:57 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Tue, 2017-07-11 at 18:51 +0200, Paolo Bonzini wrote: > On 11/07/2017 18:48, Adalbert Lazar wrote: > > On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and > > > more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH. > > > > > > The main remaining issue seems to be map/unmap. > > > > Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps > > performance wise. Maybe we could have it as an optional flag for ALLOW? > > Actually it makes more sense for SKIP, I think, where the introspector > is actually doing emulation? I'm afraid I don't undestand your question, however we were looking at using KVM's x86 emulator rather than putting together our own, as such software might be fun to write but they take a very long time to get right. I'd argue that KVM's emulator has already seen a lot of coverage. In the future we are looking at maybe moving away from it on Intel-s, by way of VMFUNC and #VE. > But why is KVMI_SET_REGS slower than a set regs command followed by an > action? To be honest, we just looked at the Xen implementation which gates writing back the registers to VMCS on them actually having been changed. I just figured the less VMWRITE-s, the better. I'm also looking over some older stats that show the registers being written back quite often. Allow me 1-2 days to gather new ones and followup on this thread. > > Or at least for the hot paths? > > > > Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other > > additional flag: > > > > * CR, MSR - ALLOW with untouched new_value will let the guest continue, > > but with "new_value = old_value" is a "deny" > > > > * xsetbv - ALLOW is implied > > > > * breakpoint - SKIP means the BP is processed by the introspector, > > ALLOW means let the guest handle it > > > > * hypercall - ALLOW is implied > > > > * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF, > > ALLOW with adjusted PC is a "skip" (done by the tool > > for the moment). > > This is more complicated than necessary. I would just make it simple: > > - SKIP, adjust RIP to point to the next instruction and enter the guest > > - RETRY, enter the guest > > - ALLOW, emulate the instruction with information coming from the > response packet > > - CRASH, self-explanatory I see no major problem with your version. The reason we removed SKIP from the #PF description is that the RIP adjustment is done by the introspection tool after it decodes the instruction and computes its opcode length. So when the event response reaches KVM, it all looks like ALLOW was requested as the host-side code would be oblivous to the RIP change. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-13 5:57 ` Mihai Donțu @ 2017-07-13 7:32 ` Paolo Bonzini 2017-07-18 11:51 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-13 7:32 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 13/07/2017 07:57, Mihai Donțu wrote: >> Actually it makes more sense for SKIP, I think, where the introspector >> is actually doing emulation? > > I'm afraid I don't undestand your question, however we were looking at > using KVM's x86 emulator rather than putting together our own, as such > software might be fun to write but they take a very long time to get > right. I'd argue that KVM's emulator has already seen a lot of > coverage. Of course! But there could be some special cases (e.g. hypercalls) where you do emulation on your own. In that case, KVMI_SET_REGS + SKIP is the right thing to do. > In the future we are looking at maybe moving away from it on Intel-s, > by way of VMFUNC and #VE. > >> But why is KVMI_SET_REGS slower than a set regs command followed by an >> action? > To be honest, we just looked at the Xen implementation which gates > writing back the registers to VMCS on them actually having been > changed. That would be possible on KVMI too. Just don't do the KVMI_SET_REGS unless the registers have changed. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-13 7:32 ` Paolo Bonzini @ 2017-07-18 11:51 ` Mihai Donțu 2017-07-18 12:02 ` Mihai Donțu 2017-07-23 13:02 ` Paolo Bonzini 0 siblings, 2 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-18 11:51 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote: > On 13/07/2017 07:57, Mihai Donțu wrote: > > > Actually it makes more sense for SKIP, I think, where the introspector > > > is actually doing emulation? > > > > I'm afraid I don't undestand your question, however we were looking at > > using KVM's x86 emulator rather than putting together our own, as such > > software might be fun to write but they take a very long time to get > > right. I'd argue that KVM's emulator has already seen a lot of > > coverage. > > Of course! But there could be some special cases (e.g. hypercalls) > where you do emulation on your own. In that case, KVMI_SET_REGS + SKIP > is the right thing to do. I think I finally understand what you're saying. That SKIP would tell the introspection subsystem to just write back the registers and enter the guest, no in-host emulation needed. So, to reiterate, the possible actions would be: * SKIP - re-enter the guest (the introspection tool has adjusted all registers) * RETRY - re-enter the guest * ALLOW - use the emulator * CRASH - kill the guest process It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that sets all registers that might have been affected by the emulation (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY. > > In the future we are looking at maybe moving away from it on Intel-s, > > by way of VMFUNC and #VE. > > > > > But why is KVMI_SET_REGS slower than a set regs command followed by an > > > action? > > > > To be honest, we just looked at the Xen implementation which gates > > writing back the registers to VMCS on them actually having been > > changed. > > That would be possible on KVMI too. Just don't do the KVMI_SET_REGS > unless the registers have changed. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-18 11:51 ` Mihai Donțu @ 2017-07-18 12:02 ` Mihai Donțu 2017-07-23 13:02 ` Paolo Bonzini 1 sibling, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-18 12:02 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Tue, 2017-07-18 at 14:51 +0300, Mihai Donțu wrote: > On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote: > > On 13/07/2017 07:57, Mihai Donțu wrote: > > > > Actually it makes more sense for SKIP, I think, where the introspector > > > > is actually doing emulation? > > > > > > I'm afraid I don't undestand your question, however we were looking at > > > using KVM's x86 emulator rather than putting together our own, as such > > > software might be fun to write but they take a very long time to get > > > right. I'd argue that KVM's emulator has already seen a lot of > > > coverage. > > > > Of course! But there could be some special cases (e.g. hypercalls) > > where you do emulation on your own. In that case, KVMI_SET_REGS + SKIP > > is the right thing to do. > > I think I finally understand what you're saying. That SKIP would tell > the introspection subsystem to just write back the registers and enter > the guest, no in-host emulation needed. So, to reiterate, the possible > actions would be: > > * SKIP - re-enter the guest (the introspection tool has adjusted all > registers) > * RETRY - re-enter the guest > * ALLOW - use the emulator > * CRASH - kill the guest process > > It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that > sets all registers that might have been affected by the emulation > (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for > that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY. I mentioned in a previous email something about statistics. The following are a sort of that: eventCount: 13.2 eventsMemAccess: 12.4 eventsWriteCtrlReg: 0.8 partialContext: 0.2 partialCpu: 0.2 xcGetMemAccess: 116 xcMapPage: 191.4 xcSetMemAccessMulti: 1.6 eventCount: 4261.8 eventsBreakPoint: 4170.2 eventsMemAccess: 84.4 eventsWriteCtrlReg: 7.2 setRegisters: 4170.2 xcMapPage: 3.4 xcSetMemAccessMulti: 0.2 eventCount: 8098.6 eventsBreakPoint: 4783.4 eventsMemAccess: 3302.4 eventsWriteCtrlReg: 12.8 setRegisters: 4783.4 xcMapPage: 2.6 eventCount: 3495 eventsBreakPoint: 2631.6 eventsMemAccess: 846.6 eventsWriteCtrlReg: 16.8 setRegisters: 2631.6 xcMapPage: 2.6 xcSetMemAccessMulti: 0.4 eventCount: 538.4 eventsBreakPoint: 248 eventsMemAccess: 279.2 eventsWriteCtrlReg: 11.2 setRegisters: 248 xcMapPage: 1.2 eventCount: 4876.2 eventsBreakPoint: 3683.4 eventsMemAccess: 1172.8 eventsWriteCtrlReg: 20 setRegisters: 3683.4 xcMapPage: 4.8 eventCount: 5937.4 eventsBreakPoint: 4403.8 eventsMemAccess: 1507.2 eventsWriteCtrlReg: 26.4 setRegisters: 4403.8 xcMapPage: 4.8 eventCount: 9992.4 eventsBreakPoint: 7948.6 eventsMemAccess: 2019.8 eventsWriteCtrlReg: 24 setRegisters: 7948.6 xcMapPage: 1.4 xcSetMemAccessMulti: 5 eventCount: 5150.6 eventsBreakPoint: 2175 eventsMemAccess: 2902.8 eventsWriteCtrlReg: 72.8 setRegisters: 2175 xcGetMemAccess: 0.4 xcMapPage: 8 xcSetMemAccessMulti: 1 eventCount: 5422.2 eventsBreakPoint: 4362.2 eventsMemAccess: 1012.8 eventsWriteCtrlReg: 47.2 setRegisters: 4362.2 xcGetMemAccess: 2.8 xcMapPage: 10.2 xcSetMemAccessMulti: 3.4 eventCount: 1910.2 eventsBreakPoint: 1665.6 eventsMemAccess: 231.8 eventsWriteCtrlReg: 12.8 setRegisters: 1665.6 xcGetMemAccess: 0.2 xcMapPage: 2.2 xcSetMemAccessMulti: 0.6 eventCount: 1834.4 eventsBreakPoint: 1357.6 eventsMemAccess: 462.4 eventsWriteCtrlReg: 14.4 setRegisters: 1357.6 xcGetMemAccess: 0.2 xcMapPage: 4.8 xcSetMemAccessMulti: 0.4 eventCount: 6081.2 eventsBreakPoint: 4855.6 eventsMemAccess: 1208.8 eventsWriteCtrlReg: 16.8 setRegisters: 4855.6 xcMapPage: 4 eventCount: 1105.4 eventsBreakPoint: 855 eventsMemAccess: 226.4 eventsWriteCtrlReg: 24 setRegisters: 855 xcMapPage: 1.6 eventCount: 8362.8 eventsBreakPoint: 4409.2 eventsMemAccess: 3917.6 eventsWriteCtrlReg: 36 setRegisters: 4409.2 xcGetMemAccess: 117.4 xcMapPage: 9 xcSetMemAccessMulti: 254.6 eventCount: 2222.2 eventsBreakPoint: 32.2 eventsMemAccess: 2169.2 eventsWriteCtrlReg: 20.8 setRegisters: 32.2 xcGetMemAccess: 2.8 xcMapPage: 5 xcSetMemAccessMulti: 104.4 eventCount: 2889.2 eventsBreakPoint: 1447.8 eventsMemAccess: 1419.8 eventsWriteCtrlReg: 21.6 partialContext: 0.8 partialCpu: 0.8 setRegisters: 1447.8 xcGetMemAccess: 1.4 xcMapPage: 16.6 xcSetMemAccessMulti: 2.2 eventCount: 1698.8 eventsBreakPoint: 1031.8 eventsMemAccess: 655 eventsWriteCtrlReg: 12 setRegisters: 1031.8 xcGetMemAccess: 0.6 xcMapPage: 5 xcSetMemAccessMulti: 0.8 eventCount: 691.8 eventsBreakPoint: 250.4 eventsMemAccess: 435.8 eventsWriteCtrlReg: 5.6 setRegisters: 250.4 xcGetMemAccess: 0.4 xcMapPage: 4 xcSetMemAccessMulti: 0.4 eventCount: 756.8 eventsBreakPoint: 293.2 eventsMemAccess: 454.8 eventsWriteCtrlReg: 8.8 setRegisters: 293.2 xcGetMemAccess: 1.4 xcMapPage: 3.4 xcSetMemAccessMulti: 1.4 These represent the number events/calls per second during a normal introspection session (start Windows 10 x64, open Edge). The breakpoint events correspond to the various API calls invoked and, as it can be seen, for each of them we do a 'setRegisters'. We were hoping we can reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the event response. If I have not managed to convince you, I think we can go ahead and keep them separate, have an initial implementation and see some actual performance numbers. Should be no hassle. :-) > > > In the future we are looking at maybe moving away from it on Intel-s, > > > by way of VMFUNC and #VE. > > > > > > > But why is KVMI_SET_REGS slower than a set regs command followed by an > > > > action? > > > > > > To be honest, we just looked at the Xen implementation which gates > > > writing back the registers to VMCS on them actually having been > > > changed. > > > > That would be possible on KVMI too. Just don't do the KVMI_SET_REGS > > unless the registers have changed. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-18 11:51 ` Mihai Donțu 2017-07-18 12:02 ` Mihai Donțu @ 2017-07-23 13:02 ` Paolo Bonzini 2017-07-26 17:04 ` Mihai Donțu 1 sibling, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-23 13:02 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 18/07/2017 13:51, Mihai Donțu wrote: > On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote: >> On 13/07/2017 07:57, Mihai Donțu wrote: >>>> Actually it makes more sense for SKIP, I think, where the introspector >>>> is actually doing emulation? >>> >>> I'm afraid I don't undestand your question, however we were looking at >>> using KVM's x86 emulator rather than putting together our own, as such >>> software might be fun to write but they take a very long time to get >>> right. I'd argue that KVM's emulator has already seen a lot of >>> coverage. >> >> Of course! But there could be some special cases (e.g. hypercalls) >> where you do emulation on your own. In that case, KVMI_SET_REGS + SKIP >> is the right thing to do. > > I think I finally understand what you're saying. That SKIP would tell > the introspection subsystem to just write back the registers and enter > the guest, no in-host emulation needed. So, to reiterate, the possible > actions would be: > > * SKIP - re-enter the guest (the introspection tool has adjusted all > registers) > * RETRY - re-enter the guest > * ALLOW - use the emulator > * CRASH - kill the guest process > > It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that > sets all registers that might have been affected by the emulation > (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for > that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY. One difference that comes to mind between SKIP and RETRY is that SKIP would inject a singlestep exception if TF was 1 on entry, and clear the interrupt shadow. RETRY would not do either of those. (The name for SKIP comes from the KVM function kvm_skip_emulated_instruction). > We were hoping we can > reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the > event response. > > If I have not managed to convince you, I think we can go ahead and keep > them separate, have an initial implementation and see some actual > performance numbers. Should be no hassle. I think you should implement transactions in the protocol, so effectively KVMI_SET_REGISTERS would be bundled with the event response anyway. Paolo >>> In the future we are looking at maybe moving away from it on Intel-s, >>> by way of VMFUNC and #VE. >>> >>>> But why is KVMI_SET_REGS slower than a set regs command followed by an >>>> action? >>> >>> To be honest, we just looked at the Xen implementation which gates >>> writing back the registers to VMCS on them actually having been >>> changed. >> >> That would be possible on KVMI too. Just don't do the KVMI_SET_REGS >> unless the registers have changed. > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-23 13:02 ` Paolo Bonzini @ 2017-07-26 17:04 ` Mihai Donțu 2017-07-26 17:25 ` Tamas K Lengyel 2017-07-27 13:33 ` Paolo Bonzini 0 siblings, 2 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-26 17:04 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Sun, 2017-07-23 at 15:02 +0200, Paolo Bonzini wrote: > On 18/07/2017 13:51, Mihai Donțu wrote: > > On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote: > > > On 13/07/2017 07:57, Mihai Donțu wrote: > > > > > Actually it makes more sense for SKIP, I think, where the introspector > > > > > is actually doing emulation? > > > > > > > > I'm afraid I don't undestand your question, however we were looking at > > > > using KVM's x86 emulator rather than putting together our own, as such > > > > software might be fun to write but they take a very long time to get > > > > right. I'd argue that KVM's emulator has already seen a lot of > > > > coverage. > > > > > > Of course! But there could be some special cases (e.g. hypercalls) > > > where you do emulation on your own. In that case, KVMI_SET_REGS + SKIP > > > is the right thing to do. > > > > I think I finally understand what you're saying. That SKIP would tell > > the introspection subsystem to just write back the registers and enter > > the guest, no in-host emulation needed. So, to reiterate, the possible > > actions would be: > > > > * SKIP - re-enter the guest (the introspection tool has adjusted all > > registers) > > * RETRY - re-enter the guest > > * ALLOW - use the emulator > > * CRASH - kill the guest process > > > > It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that > > sets all registers that might have been affected by the emulation > > (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for > > that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY. > > One difference that comes to mind between SKIP and RETRY is that SKIP > would inject a singlestep exception if TF was 1 on entry, and clear the > interrupt shadow. RETRY would not do either of those. > > (The name for SKIP comes from the KVM function > kvm_skip_emulated_instruction). OK. > > We were hoping we can > > reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the > > event response. > > > > If I have not managed to convince you, I think we can go ahead and keep > > them separate, have an initial implementation and see some actual > > performance numbers. Should be no hassle. > > I think you should implement transactions in the protocol, so > effectively KVMI_SET_REGISTERS would be bundled with the event response > anyway. I see. Then maybe we should provide a way for commands to specify an event ID. If zero, then the command is satisfied using data straight from the vCPU (when making changes), otherwise a structure associated with the event will be used as cache for all get-s/set-s and apply them all in one go when the event reply arrives. This should work nicely since we read a good deal of the register set anyway when sending the event. > > > > In the future we are looking at maybe moving away from it on Intel-s, > > > > by way of VMFUNC and #VE. > > > > > > > > > But why is KVMI_SET_REGS slower than a set regs command followed by an > > > > > action? > > > > > > > > To be honest, we just looked at the Xen implementation which gates > > > > writing back the registers to VMCS on them actually having been > > > > changed. > > > > > > That would be possible on KVMI too. Just don't do the KVMI_SET_REGS > > > unless the registers have changed. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-26 17:04 ` Mihai Donțu @ 2017-07-26 17:25 ` Tamas K Lengyel 2017-07-27 14:41 ` Mihai Donțu 2017-07-27 13:33 ` Paolo Bonzini 1 sibling, 1 reply; 50+ messages in thread From: Tamas K Lengyel @ 2017-07-26 17:25 UTC (permalink / raw) To: Mihai Donțu; +Cc: kvm Hi Mihai, this series was just brought to my attention (and this is the first message in my inbox I can reply to) and I'm very happy to see you guys working on this! I'm still reviewing the series and the only question I have at this point is whether you will be also adding support for setting the MTF through KVMI? Thanks, Tamas ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-26 17:25 ` Tamas K Lengyel @ 2017-07-27 14:41 ` Mihai Donțu 0 siblings, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-27 14:41 UTC (permalink / raw) To: Tamas K Lengyel; +Cc: kvm Hi Tamas, On Wed, 2017-07-26 at 11:25 -0600, Tamas K Lengyel wrote: > this series was just brought to my attention (and this is the first > message in my inbox I can reply to) and I'm very happy to see you guys > working on this! I'm still reviewing the series and the only question > I have at this point is whether you will be also adding support for > setting the MTF through KVMI? Thanks for taking a look over this. Currently we have no code in place for guest single stepping, but since we're looking into making use of Intel's VMFUNC and #VE, I'm sure a need for a feature like this will arrise. Or _maybe_ we can make it part of the #PF handling code and have the introspection tool respond with ALLOW and an EPT view ID and have KVM automatically switch, single step, switch back and resume, if that makes sense for everybody. Regards, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-26 17:04 ` Mihai Donțu 2017-07-26 17:25 ` Tamas K Lengyel @ 2017-07-27 13:33 ` Paolo Bonzini 2017-07-27 14:46 ` Mihai Donțu 1 sibling, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-27 13:33 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 26/07/2017 19:04, Mihai Donțu wrote: >> I think you should implement transactions in the protocol, so >> effectively KVMI_SET_REGISTERS would be bundled with the event response >> anyway. > > I see. Then maybe we should provide a way for commands to specify an > event ID. If zero, then the command is satisfied using data straight > from the vCPU (when making changes), otherwise a structure associated > with the event will be used as cache for all get-s/set-s and apply them > all in one go when the event reply arrives. This should work nicely > since we read a good deal of the register set anyway when sending the > event. Yes, that is okay. Just a question, why would the event ID provide more information than just the vCPU id? How can there be more than one event active on the same vCPU? Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-27 13:33 ` Paolo Bonzini @ 2017-07-27 14:46 ` Mihai Donțu 0 siblings, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-27 14:46 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Thu, 2017-07-27 at 15:33 +0200, Paolo Bonzini wrote: > On 26/07/2017 19:04, Mihai Donțu wrote: > > > I think you should implement transactions in the protocol, so > > > effectively KVMI_SET_REGISTERS would be bundled with the event response > > > anyway. > > > > I see. Then maybe we should provide a way for commands to specify an > > event ID. If zero, then the command is satisfied using data straight > > from the vCPU (when making changes), otherwise a structure associated > > with the event will be used as cache for all get-s/set-s and apply them > > all in one go when the event reply arrives. This should work nicely > > since we read a good deal of the register set anyway when sending the > > event. > > Yes, that is okay. Just a question, why would the event ID provide more > information than just the vCPU id? How can there be more than one event > active on the same vCPU? You're right. There can't. We'll just add some logic that says if the vCPU has an event pending (or currently being handled), to use that cache to satisfy all commands. Thanks, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-07 16:52 ` Paolo Bonzini 2017-07-10 15:32 ` alazar @ 2017-07-13 8:36 ` Mihai Donțu 2017-07-13 9:15 ` Paolo Bonzini 2017-07-27 17:06 ` Mihai Donțu 2 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-13 8:36 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: > > +3. KVMI_PAUSE_GUEST > > +------------------- > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: {} > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_error_code { > > + __s32 err; > > + __u32 padding; > > + }; > > + > > +This command will pause all vcpus threads, by getting them out of guest mode > > +and put them in the "waiting introspection commands" state. > > Pausing all threads synchronously is a tricky concept. The main issue > is that the guest could be paused at the userspace level. Then KVM > would not be running (userspace is stuck in pthread_cond_wait, > typically) and could not acknowledge the pause. > > Worse, KVM is not able to distinguish userspace that has paused the VM > from userspace that is doing MMIO or userspace that has a bug and hung > somewhere. And even worse, there are cases where userspace wants to > modify registers while doing port I/O (the awful VMware RPCI port). So > I'd rather avoid this. I should give more details here: we don't need to pause the vCPU-s in the sense widely understood but just prevent them from entering the guest for a short period of time. In our particular case, we need this when we start introspecting a VM that's already running. For this we kick the vCPU-s out of the guest so that our scan of the memory does not race with the guest kernel/applications. Another use case is when we inject applications into a running guest. We also kick the vCPU-s out while we atomically make changes to kernel specific structures. > > +8. KVMI_GET_MTRR_TYPE > > +--------------------- > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_mtrr_type { > > + __u64 gpa; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_mtrr_type_reply { > > + __s32 err; > > + __u32 type; > > + }; > > + > > +Returns the guest memory type for a specific physical address. > > What is this used for? KVM ignores the guest MTRRs, so if possible I'd > rather avoid it. We use it do identify cacheable memory which usually indicates device memory, something we don't want to touch. We are also looking into making use of the page attributes (PAT) or other PTE-bits instead of MTRR, but for the time being MTRR-s are still being relied upon. > > +9. KVMI_GET_MTRRS > > +----------------- > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_mtrrs { > > + __u16 vcpu; > > + __u16 padding[3]; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_mtrrs_reply { > > + __s32 err; > > + __u32 padding; > > + __u64 pat; > > + __u64 cap; > > + __u64 type; > > + }; > > + > > +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified > > +vCPU. > > This could use KVM_GET_REGISTERS, couldn't it? Yes, I belive it can be folded into that command. > > +14. KVMI_INJECT_BREAKPOINT > > +-------------------------- > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_inject_breakpoint { > > + __u16 vcpu; > > + __u16 padding[3]; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_error_code { > > + __s32 err; > > + __u32 padding; > > + }; > > + > > +Injects a breakpoint for the specified vCPU. This command is usually sent in > > +response to an event and as such the proper GPR-s will be set with the reply. > > What is a "breakpoint" in this context? A simple INT3. It's what most of our patches consist of. We keep track of them and handle the ones we own while pass (reinject) the ones used by the guest (debuggers or whatnot). -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-13 8:36 ` Mihai Donțu @ 2017-07-13 9:15 ` Paolo Bonzini 2017-07-27 16:23 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-13 9:15 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 13/07/2017 10:36, Mihai Donțu wrote: > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >> Worse, KVM is not able to distinguish userspace that has paused the VM >> from userspace that is doing MMIO or userspace that has a bug and hung >> somewhere. And even worse, there are cases where userspace wants to >> modify registers while doing port I/O (the awful VMware RPCI port). So >> I'd rather avoid this. > > I should give more details here: we don't need to pause the vCPU-s in > the sense widely understood but just prevent them from entering the > guest for a short period of time. In our particular case, we need this > when we start introspecting a VM that's already running. For this we > kick the vCPU-s out of the guest so that our scan of the memory does > not race with the guest kernel/applications. > > Another use case is when we inject applications into a running guest. > We also kick the vCPU-s out while we atomically make changes to kernel > specific structures. This is not possible to do in KVM, because KVM doesn't control what happens to the memory outside KVM_RUN (and of course it doesn't control devices doing DMA). You need to talk to QEMU in order to do this. To do atomic changes to kernel specific structures, I would change the page tables to inaccessible instead, but that also doesn't protect them from devices doing DMA into them. Another issue: say a VM is waiting for a reply from the introspector, and the reply is delayed so the VM gets a signal and needs to get out to QEMU with EINTR. I don't think it is always possible to retry the instruction on the next KVM_RUN, because the introspector might be making partial changes. Add live migration to the mix if you want to make things even more complicated. :) I think we need a way to mark a set of commands for atomic application. That is, the structure of the command stream needs to be command 1 command 2 event reply 1 transaction end marker command 3 transaction end marker command 4 event reply 2 transaction end marker >>> +8. KVMI_GET_MTRR_TYPE >>> +--------------------- >> >> What is this used for? KVM ignores the guest MTRRs, so if possible I'd >> rather avoid it. > > We use it do identify cacheable memory which usually indicates device > memory, something we don't want to touch. We are also looking into > making use of the page attributes (PAT) or other PTE-bits instead of > MTRR, but for the time being MTRR-s are still being relied upon. Fair enough. But you can compute it yourself from the MTRRs, can't you? A separate command is just adding attack surface in the hypervisor. >>> +14. KVMI_INJECT_BREAKPOINT >>> +-------------------------- >>> + >>> +:Architectures: all >>> +:Versions: >= 1 >>> +:Parameters: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_inject_breakpoint { >>> + __u16 vcpu; >>> + __u16 padding[3]; >>> + }; >>> + >>> +:Returns: ↴ >>> + >>> +:: >>> + >>> + struct kvmi_error_code { >>> + __s32 err; >>> + __u32 padding; >>> + }; >>> + >>> +Injects a breakpoint for the specified vCPU. This command is usually sent in >>> +response to an event and as such the proper GPR-s will be set with the reply. >> >> What is a "breakpoint" in this context? > > A simple INT3. It's what most of our patches consist of. We keep track > of them and handle the ones we own while pass (reinject) the ones used > by the guest (debuggers or whatnot). Why can't they be written with KVMI_READ/WRITE_PHYSICAL? (I would keep those two as they provide a more direct interface than map/unmap, and they work even with introspectors that are not sibling guests of the introspected VM). Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-13 9:15 ` Paolo Bonzini @ 2017-07-27 16:23 ` Mihai Donțu 2017-07-27 16:52 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-27 16:23 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: > On 13/07/2017 10:36, Mihai Donțu wrote: > > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: > > > Worse, KVM is not able to distinguish userspace that has paused the VM > > > from userspace that is doing MMIO or userspace that has a bug and hung > > > somewhere. And even worse, there are cases where userspace wants to > > > modify registers while doing port I/O (the awful VMware RPCI port). So > > > I'd rather avoid this. > > > > I should give more details here: we don't need to pause the vCPU-s in > > the sense widely understood but just prevent them from entering the > > guest for a short period of time. In our particular case, we need this > > when we start introspecting a VM that's already running. For this we > > kick the vCPU-s out of the guest so that our scan of the memory does > > not race with the guest kernel/applications. > > > > Another use case is when we inject applications into a running guest. > > We also kick the vCPU-s out while we atomically make changes to kernel > > specific structures. > > This is not possible to do in KVM, because KVM doesn't control what > happens to the memory outside KVM_RUN (and of course it doesn't control > devices doing DMA). You need to talk to QEMU in order to do this. Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on the already opened file descriptor to /dev/kvm for an event? > To do atomic changes to kernel specific structures, I would change the > page tables to inaccessible instead, but that also doesn't protect them > from devices doing DMA into them. If we have qemu pull out of the guest all vCPU-s and wait for a sign from the KVMI subsystem, then that looks sufficient. Devices acessing the memory (passedthrough devices, I assume) should be no problem as we're never interested in device memory. > Another issue: say a VM is waiting for a reply from the introspector, > and the reply is delayed so the VM gets a signal and needs to get out to > QEMU with EINTR. I don't think it is always possible to retry the > instruction on the next KVM_RUN, because the introspector might be > making partial changes. Add live migration to the mix if you want to > make things even more complicated. :) > > I think we need a way to mark a set of commands for atomic application. > That is, the structure of the command stream needs to be > > command 1 > command 2 > event reply 1 > transaction end marker > command 3 > transaction end marker > command 4 > event reply 2 > transaction end marker This should be covered by a previous email exchange. Commands targeting vCPU-s for which events are pending or are currently being handled should be done via a structure that acts like a cache, so that when the event reply reaches KVM, all potential modifications are applied in one go. This affects all register manipulations. Commands targeting memory are unaffected by the state of the vCPU, though this might change when we factor EPT views. But, alas, we have no clear view on this last topic as of yet. > > > > +8. KVMI_GET_MTRR_TYPE > > > > +--------------------- > > > > > > What is this used for? KVM ignores the guest MTRRs, so if possible I'd > > > rather avoid it. > > > > We use it do identify cacheable memory which usually indicates device > > memory, something we don't want to touch. We are also looking into > > making use of the page attributes (PAT) or other PTE-bits instead of > > MTRR, but for the time being MTRR-s are still being relied upon. > > Fair enough. But you can compute it yourself from the MTRRs, can't you? > A separate command is just adding attack surface in the hypervisor. I think we can make some basic MTRR info available via GET_REGISTERS and do the rest in the introspection tool. > > > > +14. KVMI_INJECT_BREAKPOINT > > > > +-------------------------- > > > > + > > > > +:Architectures: all > > > > +:Versions: >= 1 > > > > +:Parameters: ↴ > > > > + > > > > +:: > > > > + > > > > + struct kvmi_inject_breakpoint { > > > > + __u16 vcpu; > > > > + __u16 padding[3]; > > > > + }; > > > > + > > > > +:Returns: ↴ > > > > + > > > > +:: > > > > + > > > > + struct kvmi_error_code { > > > > + __s32 err; > > > > + __u32 padding; > > > > + }; > > > > + > > > > +Injects a breakpoint for the specified vCPU. This command is usually sent in > > > > +response to an event and as such the proper GPR-s will be set with the reply. > > > > > > What is a "breakpoint" in this context? > > > > A simple INT3. It's what most of our patches consist of. We keep track > > of them and handle the ones we own while pass (reinject) the ones used > > by the guest (debuggers or whatnot). > > Why can't they be written with KVMI_READ/WRITE_PHYSICAL? (I would keep > those two as they provide a more direct interface than map/unmap, and > they work even with introspectors that are not sibling guests of the > introspected VM). They can, nothing is stopping that. Also, we can keep the plain read/write interfaces around. It just seemed easier to implement them on top of an eventual mmap/munmap interface. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-27 16:23 ` Mihai Donțu @ 2017-07-27 16:52 ` Paolo Bonzini 2017-07-27 17:19 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-27 16:52 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 27/07/2017 18:23, Mihai Donțu wrote: > On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: >> On 13/07/2017 10:36, Mihai Donțu wrote: >>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >>>> Worse, KVM is not able to distinguish userspace that has paused the VM >>>> from userspace that is doing MMIO or userspace that has a bug and hung >>>> somewhere. And even worse, there are cases where userspace wants to >>>> modify registers while doing port I/O (the awful VMware RPCI port). So >>>> I'd rather avoid this. >>> >>> I should give more details here: we don't need to pause the vCPU-s in >>> the sense widely understood but just prevent them from entering the >>> guest for a short period of time. In our particular case, we need this >>> when we start introspecting a VM that's already running. For this we >>> kick the vCPU-s out of the guest so that our scan of the memory does >>> not race with the guest kernel/applications. >>> >>> Another use case is when we inject applications into a running guest. >>> We also kick the vCPU-s out while we atomically make changes to kernel >>> specific structures. >> >> This is not possible to do in KVM, because KVM doesn't control what >> happens to the memory outside KVM_RUN (and of course it doesn't control >> devices doing DMA). You need to talk to QEMU in order to do this. > > Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on > the already opened file descriptor to /dev/kvm for an event? Nope. QEMU might be running and writing to memory in another thread. I don't see how this can be reliable on other hypervisors too, actually. >> To do atomic changes to kernel specific structures, I would change the >> page tables to inaccessible instead, but that also doesn't protect them >> from devices doing DMA into them. > > If we have qemu pull out of the guest all vCPU-s and wait for a sign > from the KVMI subsystem, then that looks sufficient. Devices acessing > the memory (passedthrough devices, I assume) should be no problem as > we're never interested in device memory. You're certainly interested in bus-master DMA from those devices though. >> Another issue: say a VM is waiting for a reply from the introspector, >> and the reply is delayed so the VM gets a signal and needs to get out to >> QEMU with EINTR. I don't think it is always possible to retry the >> instruction on the next KVM_RUN, because the introspector might be >> making partial changes. Add live migration to the mix if you want to >> make things even more complicated. :) >> >> I think we need a way to mark a set of commands for atomic application. >> That is, the structure of the command stream needs to be >> >> command 1 >> command 2 >> event reply 1 >> transaction end marker >> command 3 >> transaction end marker >> command 4 >> event reply 2 >> transaction end marker > > This should be covered by a previous email exchange. Correct. >>>>> +8. KVMI_GET_MTRR_TYPE >>>>> +--------------------- >>>> >>>> What is this used for? KVM ignores the guest MTRRs, so if possible I'd >>>> rather avoid it. >>> >>> We use it do identify cacheable memory which usually indicates device >>> memory, something we don't want to touch. We are also looking into >>> making use of the page attributes (PAT) or other PTE-bits instead of >>> MTRR, but for the time being MTRR-s are still being relied upon. >> >> Fair enough. But you can compute it yourself from the MTRRs, can't you? >> A separate command is just adding attack surface in the hypervisor. > > I think we can make some basic MTRR info available via GET_REGISTERS > and do the rest in the introspection tool. Ok. >>>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in >>>>> +response to an event and as such the proper GPR-s will be set with the reply. >>>> >>>> What is a "breakpoint" in this context? >>> >>> A simple INT3. It's what most of our patches consist of. We keep track >>> of them and handle the ones we own while pass (reinject) the ones used >>> by the guest (debuggers or whatnot). >> >> Why can't they be written with KVMI_READ/WRITE_PHYSICAL? (I would keep >> those two as they provide a more direct interface than map/unmap, and >> they work even with introspectors that are not sibling guests of the >> introspected VM). > > They can, nothing is stopping that. Also, we can keep the plain > read/write interfaces around. It just seemed easier to implement them > on top of an eventual mmap/munmap interface. I prefer to keep the simple interface and drop the breakpoint one. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-27 16:52 ` Paolo Bonzini @ 2017-07-27 17:19 ` Mihai Donțu 2017-08-01 10:40 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-27 17:19 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote: > On 27/07/2017 18:23, Mihai Donțu wrote: > > On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: > > > On 13/07/2017 10:36, Mihai Donțu wrote: > > > > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: > > > > > Worse, KVM is not able to distinguish userspace that has paused the VM > > > > > from userspace that is doing MMIO or userspace that has a bug and hung > > > > > somewhere. And even worse, there are cases where userspace wants to > > > > > modify registers while doing port I/O (the awful VMware RPCI port). So > > > > > I'd rather avoid this. > > > > > > > > I should give more details here: we don't need to pause the vCPU-s in > > > > the sense widely understood but just prevent them from entering the > > > > guest for a short period of time. In our particular case, we need this > > > > when we start introspecting a VM that's already running. For this we > > > > kick the vCPU-s out of the guest so that our scan of the memory does > > > > not race with the guest kernel/applications. > > > > > > > > Another use case is when we inject applications into a running guest. > > > > We also kick the vCPU-s out while we atomically make changes to kernel > > > > specific structures. > > > > > > This is not possible to do in KVM, because KVM doesn't control what > > > happens to the memory outside KVM_RUN (and of course it doesn't control > > > devices doing DMA). You need to talk to QEMU in order to do this. > > > > Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on > > the already opened file descriptor to /dev/kvm for an event? > > Nope. QEMU might be running and writing to memory in another thread. I > don't see how this can be reliable on other hypervisors too, actually. I assume it largely depends on knowing what's possible to do and what not with the guest memory even while the vCPU-s are suspended. The price of breaking this rule will be something any KVMI user will have to be very aware of. We did quite a bit of fine-tunning to our application to avoid interferring with the device manager (or passedthrough devices) and also it helps when the hypervisor refuses to grant access to certain memory areas that it knows are not safe to touch (even read) from remote code. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-27 17:19 ` Mihai Donțu @ 2017-08-01 10:40 ` Paolo Bonzini 2017-08-01 16:33 ` Tamas K Lengyel 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-01 10:40 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 27/07/2017 19:19, Mihai Donțu wrote: > On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote: >> On 27/07/2017 18:23, Mihai Donțu wrote: >>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: >>>> On 13/07/2017 10:36, Mihai Donțu wrote: >>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM >>>>>> from userspace that is doing MMIO or userspace that has a bug and hung >>>>>> somewhere. And even worse, there are cases where userspace wants to >>>>>> modify registers while doing port I/O (the awful VMware RPCI port). So >>>>>> I'd rather avoid this. >>>>> >>>>> I should give more details here: we don't need to pause the vCPU-s in >>>>> the sense widely understood but just prevent them from entering the >>>>> guest for a short period of time. In our particular case, we need this >>>>> when we start introspecting a VM that's already running. For this we >>>>> kick the vCPU-s out of the guest so that our scan of the memory does >>>>> not race with the guest kernel/applications. >>>>> >>>>> Another use case is when we inject applications into a running guest. >>>>> We also kick the vCPU-s out while we atomically make changes to kernel >>>>> specific structures. >>>> >>>> This is not possible to do in KVM, because KVM doesn't control what >>>> happens to the memory outside KVM_RUN (and of course it doesn't control >>>> devices doing DMA). You need to talk to QEMU in order to do this. >>> >>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on >>> the already opened file descriptor to /dev/kvm for an event? >> >> Nope. QEMU might be running and writing to memory in another thread. I >> don't see how this can be reliable on other hypervisors too, actually. > > I assume it largely depends on knowing what's possible to do and what > not with the guest memory even while the vCPU-s are suspended. The > price of breaking this rule will be something any KVMI user will have > to be very aware of. If you actually pause the whole VM (through QEMU's monitor commands "stop" and "cont") everything should be safe. Of course there can be bugs and PCI passthrough devices should be problematic, but in general the device emulation is quiescent. This however is not the case when only the VCPUs are paused. Paolo > We did quite a bit of fine-tunning to our application to avoid > interferring with the device manager (or passedthrough devices) and > also it helps when the hypervisor refuses to grant access to certain > memory areas that it knows are not safe to touch (even read) from > remote code. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-01 10:40 ` Paolo Bonzini @ 2017-08-01 16:33 ` Tamas K Lengyel 2017-08-01 20:47 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Tamas K Lengyel @ 2017-08-01 16:33 UTC (permalink / raw) To: Paolo Bonzini Cc: Mihai Donțu, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Tue, Aug 1, 2017 at 4:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 27/07/2017 19:19, Mihai Donțu wrote: >> On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote: >>> On 27/07/2017 18:23, Mihai Donțu wrote: >>>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote: >>>>> On 13/07/2017 10:36, Mihai Donțu wrote: >>>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: >>>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM >>>>>>> from userspace that is doing MMIO or userspace that has a bug and hung >>>>>>> somewhere. And even worse, there are cases where userspace wants to >>>>>>> modify registers while doing port I/O (the awful VMware RPCI port). So >>>>>>> I'd rather avoid this. >>>>>> >>>>>> I should give more details here: we don't need to pause the vCPU-s in >>>>>> the sense widely understood but just prevent them from entering the >>>>>> guest for a short period of time. In our particular case, we need this >>>>>> when we start introspecting a VM that's already running. For this we >>>>>> kick the vCPU-s out of the guest so that our scan of the memory does >>>>>> not race with the guest kernel/applications. >>>>>> >>>>>> Another use case is when we inject applications into a running guest. >>>>>> We also kick the vCPU-s out while we atomically make changes to kernel >>>>>> specific structures. >>>>> >>>>> This is not possible to do in KVM, because KVM doesn't control what >>>>> happens to the memory outside KVM_RUN (and of course it doesn't control >>>>> devices doing DMA). You need to talk to QEMU in order to do this. >>>> >>>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on >>>> the already opened file descriptor to /dev/kvm for an event? >>> >>> Nope. QEMU might be running and writing to memory in another thread. I >>> don't see how this can be reliable on other hypervisors too, actually. >> >> I assume it largely depends on knowing what's possible to do and what >> not with the guest memory even while the vCPU-s are suspended. The >> price of breaking this rule will be something any KVMI user will have >> to be very aware of. > > If you actually pause the whole VM (through QEMU's monitor commands > "stop" and "cont") everything should be safe. Of course there can be > bugs and PCI passthrough devices should be problematic, but in general > the device emulation is quiescent. This however is not the case when > only the VCPUs are paused. IMHO for some use-cases it is sufficient to have the guest itself be limited in the modifications it makes to memory. So for example if just a vCPU is paused there are areas of memory that you can interact with without having to worry about it changing underneath the introspecting application (ie. thread-specific datastructures like the KPCR, etc..). If the introspecting application needs access to areas that non-paused vCPUs may touch, or QEMU, or a pass-through device, then it should be a decision for the introspecting app whether to pause the VM completely. It may still choose to instead do some error-detection on reads/writes to detect inconsistent accesses and perhaps just re-try the operation till it succeeds. This may have less of an impact on the performance of the VM as no full VM pause had to be performed. It is all very application specific, so having options is always a good thing. Tamas ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-01 16:33 ` Tamas K Lengyel @ 2017-08-01 20:47 ` Paolo Bonzini 2017-08-02 11:52 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-01 20:47 UTC (permalink / raw) To: Tamas K Lengyel Cc: Mihai Donțu, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 01/08/2017 18:33, Tamas K Lengyel wrote: >> If you actually pause the whole VM (through QEMU's monitor commands >> "stop" and "cont") everything should be safe. Of course there can be >> bugs and PCI passthrough devices should be problematic, but in general >> the device emulation is quiescent. This however is not the case when >> only the VCPUs are paused. > IMHO for some use-cases it is sufficient to have the guest itself be > limited in the modifications it makes to memory. So for example if > just a vCPU is paused there are areas of memory that you can interact > with without having to worry about it changing underneath the > introspecting application (ie. thread-specific datastructures like the > KPCR, etc..). If the introspecting application needs access to areas > that non-paused vCPUs may touch, or QEMU, or a pass-through device, > then it should be a decision for the introspecting app whether to > pause the VM completely. It may still choose to instead do some > error-detection on reads/writes to detect inconsistent accesses and > perhaps just re-try the operation till it succeeds. This may have less > of an impact on the performance of the VM as no full VM pause had to > be performed. It is all very application specific, so having options > is always a good thing. Fair enough. There is another issue however. If a guest is runnnig in the kernel, it can be easily paused while KVMI processes events and the like. While a guest is outside the kernel, it could be running or paused. If running, the value of a register might change before the VM reenters execution (due to a reset, or due to ugly features such as the VMware magic I/O port 0x5658). So the introspector would probably prefer anyway to do any changes while the guest is in the kernel: one idea I had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED event---then the introspector can send commands that do the required patching and then restart the guest by replying to the event. But if the guest is paused, KVMI_PAUSE_VCPU would never be processed. So how could the introspector distinguish the two cases and avoid the KVMI_PAUSE_VCPU if the guest is paused? (There is another complication: the guest could be running with the APIC emulated in userspace. In that case, a VCPU doing "cli;hlt" spends infinite time in userspace even though it's running, and KVM has no idea why. This is less common, but it's worth mentioning too). Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-01 20:47 ` Paolo Bonzini @ 2017-08-02 11:52 ` Mihai Donțu 2017-08-02 12:27 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-08-02 11:52 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On Tue, 2017-08-01 at 22:47 +0200, Paolo Bonzini wrote: > On 01/08/2017 18:33, Tamas K Lengyel wrote: > > > If you actually pause the whole VM (through QEMU's monitor commands > > > "stop" and "cont") everything should be safe. Of course there can be > > > bugs and PCI passthrough devices should be problematic, but in general > > > the device emulation is quiescent. This however is not the case when > > > only the VCPUs are paused. > > > > IMHO for some use-cases it is sufficient to have the guest itself be > > limited in the modifications it makes to memory. So for example if > > just a vCPU is paused there are areas of memory that you can interact > > with without having to worry about it changing underneath the > > introspecting application (ie. thread-specific datastructures like the > > KPCR, etc..). If the introspecting application needs access to areas > > that non-paused vCPUs may touch, or QEMU, or a pass-through device, > > then it should be a decision for the introspecting app whether to > > pause the VM completely. It may still choose to instead do some > > error-detection on reads/writes to detect inconsistent accesses and > > perhaps just re-try the operation till it succeeds. This may have less > > of an impact on the performance of the VM as no full VM pause had to > > be performed. It is all very application specific, so having options > > is always a good thing. > > Fair enough. There is another issue however. > > If a guest is runnnig in the kernel, it can be easily paused while KVMI > processes events and the like. > > While a guest is outside the kernel, it could be running or paused. > > If running, the value of a register might change before the VM reenters > execution (due to a reset, or due to ugly features such as the VMware > magic I/O port 0x5658). So the introspector would probably prefer > anyway to do any changes while the guest is in the kernel: one idea I > had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED > event---then the introspector can send commands that do the required > patching and then restart the guest by replying to the event. > > But if the guest is paused, KVMI_PAUSE_VCPU would never be processed. > So how could the introspector distinguish the two cases and avoid the > KVMI_PAUSE_VCPU if the guest is paused? > > (There is another complication: the guest could be running with the APIC > emulated in userspace. In that case, a VCPU doing "cli;hlt" spends > infinite time in userspace even though it's running, and KVM has no idea > why. This is less common, but it's worth mentioning too). I think it might help to distinguish two situations in which we require the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST command can be translated into a qemu pause. In our particular usecase we made special arrangements to call it as few times as possible assuming it's very costly. The other is needed only by the internal KVM code for situations similar to: kvm_pause_vcpu(vcpu); vcpu_load(vcpu); kvm_arch_vcpu_ioctl_get_regs(vcpu, regs); vcpu_put(vcpu); kvm_unpause_vcpu(vcpu); or more generally put, for accesses that involve the vCPU state (registers, MSR-s, exceptions etc.), no guest memory involved. Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if so, make it somehow available for quick re-entry with kvm_unpause_vcpu(). If said vCPU is already out, then the function will be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if we're currently handling an event or one is pending. I hope this narrows down further the exact requirements. One exception that might have a better solution is: kvm_pause_all_vcpus(kvm); kvm_set_page_access(kvm, gfn); /* pause for get too? */ kvm_unpause_all_vcpus(kvm); There might be a way to make the change and then IPI all vCPU-s without pulling them out of the guest. Regards, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 11:52 ` Mihai Donțu @ 2017-08-02 12:27 ` Paolo Bonzini 2017-08-02 13:32 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-02 12:27 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On 02/08/2017 13:52, Mihai Donțu wrote: > I think it might help to distinguish two situations in which we require > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST > command can be translated into a qemu pause. In our particular usecase > we made special arrangements to call it as few times as possible > assuming it's very costly. The other is needed only by the internal KVM > code for situations similar to: > > kvm_pause_vcpu(vcpu); > vcpu_load(vcpu); > kvm_arch_vcpu_ioctl_get_regs(vcpu, regs); > vcpu_put(vcpu); > kvm_unpause_vcpu(vcpu); > > or more generally put, for accesses that involve the vCPU state > (registers, MSR-s, exceptions etc.), no guest memory involved. > > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if > so, make it somehow available for quick re-entry with > kvm_unpause_vcpu(). If said vCPU is already out, then the function will > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if > we're currently handling an event or one is pending. Understood. The issue is that there is an inherent race between anything userspace is doing and get_regs. What are the cases where you need to get regs or similar outside an event? > One exception that might have a better solution is: > > kvm_pause_all_vcpus(kvm); > kvm_set_page_access(kvm, gfn); /* pause for get too? */ > kvm_unpause_all_vcpus(kvm); > > There might be a way to make the change and then IPI all vCPU-s without > pulling them out of the guest. For that I think KVMI should define a VM-wide "mask" layered over the actual memory map permissions. Such a command can be implemented relatively easily by hooking into the callers of __gfn_to_pfn_memslot and kvm_vcpu_gfn_to_hva_prot. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 12:27 ` Paolo Bonzini @ 2017-08-02 13:32 ` Mihai Donțu 2017-08-02 13:51 ` Paolo Bonzini 2017-08-02 13:53 ` Mihai Donțu 0 siblings, 2 replies; 50+ messages in thread From: Mihai Donțu @ 2017-08-02 13:32 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote: > On 02/08/2017 13:52, Mihai Donțu wrote: > > I think it might help to distinguish two situations in which we require > > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST > > command can be translated into a qemu pause. In our particular usecase > > we made special arrangements to call it as few times as possible > > assuming it's very costly. The other is needed only by the internal KVM > > code for situations similar to: > > > > kvm_pause_vcpu(vcpu); > > vcpu_load(vcpu); > > kvm_arch_vcpu_ioctl_get_regs(vcpu, regs); > > vcpu_put(vcpu); > > kvm_unpause_vcpu(vcpu); > > > > or more generally put, for accesses that involve the vCPU state > > (registers, MSR-s, exceptions etc.), no guest memory involved. > > > > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if > > so, make it somehow available for quick re-entry with > > kvm_unpause_vcpu(). If said vCPU is already out, then the function will > > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if > > we're currently handling an event or one is pending. > > Understood. The issue is that there is an inherent race between > anything userspace is doing and get_regs. What are the cases where you > need to get regs or similar outside an event? We have currently identified three cases: * initial hooking of a guest * periodically checking the integrity of data that is not properly placed into a page and thus cannot be efficiently tracked via SPT * injecting processes > > One exception that might have a better solution is: > > > > kvm_pause_all_vcpus(kvm); > > kvm_set_page_access(kvm, gfn); /* pause for get too? */ > > kvm_unpause_all_vcpus(kvm); > > > > There might be a way to make the change and then IPI all vCPU-s without > > pulling them out of the guest. > > For that I think KVMI should define a VM-wide "mask" layered over the > actual memory map permissions. Such a command can be implemented > relatively easily by hooking into the callers of __gfn_to_pfn_memslot > and kvm_vcpu_gfn_to_hva_prot. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 13:32 ` Mihai Donțu @ 2017-08-02 13:51 ` Paolo Bonzini 2017-08-02 14:17 ` Mihai Donțu 2017-08-02 13:53 ` Mihai Donțu 1 sibling, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-02 13:51 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On 02/08/2017 15:32, Mihai Donțu wrote: > We have currently identified three cases: > > * initial hooking of a guest What triggers the initial hooking, and how is it done? > * periodically checking the integrity of data that is not properly > placed into a page and thus cannot be efficiently tracked via SPT This only needs read memory (and it's okay for it to race against DMA because it's periodic). > * injecting processes This also doesn't need pause. IIRC you put a breakpoint somewhere, or make a page non-executable, to ensure the guest doesn't get in the way. DMA can still get in the way, but that can happen anyway right after process injection so it's not an issue. Have you thought about monitoring hardware registers, for example in order to check that IOMMU page tables protect from overwriting the kernel? Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 13:51 ` Paolo Bonzini @ 2017-08-02 14:17 ` Mihai Donțu 2017-08-04 8:35 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-08-02 14:17 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei LUTAS On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote: > On 02/08/2017 15:32, Mihai Donțu wrote: > > We have currently identified three cases: > > > > * initial hooking of a guest > > What triggers the initial hooking, and how is it done? There are two types o hooks: dynamic (the guest is hooked as it boots) and static (a fully booted guest is being hooked). They both start with a notification from qemu or some other application that a guest is available for introspection. After that we poke its vCPU-s a few times to determine what type of hooking to continue with. I belive the syscall entry point MSR-s are key here. > > * periodically checking the integrity of data that is not properly > > placed into a page and thus cannot be efficiently tracked via SPT > > This only needs read memory (and it's okay for it to race against DMA > because it's periodic). I just looked through some traces (the logic changed quite a bit since I last checked) and looks entirely based on memory reads right now. > > * injecting processes > > This also doesn't need pause. IIRC you put a breakpoint somewhere, or > make a page non-executable, to ensure the guest doesn't get in the way. > DMA can still get in the way, but that can happen anyway right after > process injection so it's not an issue. That might be a very possible approach. The code we have in place now pauses the guest entirely, though. I have added in CC a colleague of mine, Andrei Luțaș. He leads the development of the introspection technology, irrespective of the hypervisor. Adalbert and I only work on bridging it with KVM. :-) I'll kindly ask him to help with more details wherever you feel my explanations were not sufficient. > Have you thought about monitoring hardware registers, for example in > order to check that IOMMU page tables protect from overwriting the kernel? Sorry, but I'm not sure I understand: are you thinking at a way to make sure none of the IOMMU grups are configured with a "too generous" DMA window? Regards, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 14:17 ` Mihai Donțu @ 2017-08-04 8:35 ` Paolo Bonzini 2017-08-04 15:29 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-04 8:35 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei LUTAS On 02/08/2017 16:17, Mihai Donțu wrote: > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote: >> On 02/08/2017 15:32, Mihai Donțu wrote: >>> We have currently identified three cases: >>> >>> * initial hooking of a guest >> >> What triggers the initial hooking, and how is it done? > > There are two types o hooks: dynamic (the guest is hooked as it boots) > and static (a fully booted guest is being hooked). They both start with > a notification from qemu or some other application that a guest is > available for introspection. After that we poke its vCPU-s a few times > to determine what type of hooking to continue with. I belive the > syscall entry point MSR-s are key here. Reads need not be transactional here, and the syscall entry point MSRs are generally immutable so I think it is fine not to pause. I'm curious how the introspector decides that the guest is ready to be hooked in, that is, that it's far enough in the boot process. I think a command to "pause" KVM_RUN is okay to add. That is, if in KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and halt the vCPU. If not in KVM_RUN, the command would return 0 and trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN. The introspector then can decide whether to wait if the commands return 0. There is no need for an "unpause" command, as that is achieved simply by completing the 'paused' event. >> Have you thought about monitoring hardware registers, for example in >> order to check that IOMMU page tables protect from overwriting the kernel? > > Sorry, but I'm not sure I understand: are you thinking at a way to make > sure none of the IOMMU grups are configured with a "too generous" DMA > window? Yes, exactly. Optional, of course. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-04 8:35 ` Paolo Bonzini @ 2017-08-04 15:29 ` Mihai Donțu 2017-08-04 15:37 ` Paolo Bonzini 2017-08-05 8:00 ` Andrei Vlad LUTAS 0 siblings, 2 replies; 50+ messages in thread From: Mihai Donțu @ 2017-08-04 15:29 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei LUTAS On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote: > On 02/08/2017 16:17, Mihai Donțu wrote: > > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote: > > > On 02/08/2017 15:32, Mihai Donțu wrote: > > > > We have currently identified three cases: > > > > > > > > * initial hooking of a guest > > > > > > What triggers the initial hooking, and how is it done? > > > > There are two types o hooks: dynamic (the guest is hooked as it boots) > > and static (a fully booted guest is being hooked). They both start with > > a notification from qemu or some other application that a guest is > > available for introspection. After that we poke its vCPU-s a few times > > to determine what type of hooking to continue with. I belive the > > syscall entry point MSR-s are key here. > > Reads need not be transactional here, and the syscall entry point MSRs > are generally immutable so I think it is fine not to pause. I might be misunderstanding. Entry point MSR-s (and maybe others) are generally immutable in well behaving guests. We are, however, looking for signs that something is breaking this pattern. > I'm curious how the introspector decides that the guest is ready to be > hooked in, that is, that it's far enough in the boot process. I will let Andrei add some details here. > I think a command to "pause" KVM_RUN is okay to add. That is, if in > KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and > halt the vCPU. If not in KVM_RUN, the command would return 0 and > trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN. > > The introspector then can decide whether to wait if the commands return > 0. There is no need for an "unpause" command, as that is achieved > simply by completing the 'paused' event. This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors, something that maybe some future users can leverage. I, however, was trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(), the latter existing only in KVM (kernel). The event-based mecanism for pausing a vCPU feels like it has too much overhead for what one usually needs it (read some registers). > > > Have you thought about monitoring hardware registers, for example in > > > order to check that IOMMU page tables protect from overwriting the kernel? > > > > Sorry, but I'm not sure I understand: are you thinking at a way to make > > sure none of the IOMMU grups are configured with a "too generous" DMA > > window? > > Yes, exactly. Optional, of course. We could add a command that returns the list of DMA ranges which can then be used by an introspector to check if the OS has made a mistake and placed sensible data in that rage. Also, I belive we have refined a number of ideas on this thread and more remain to clarify. I would like to update the design document with what we have agreed on so far, add some more details to what felt 'under explained' and continue again from there. Is it OK for you? Regards, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-04 15:29 ` Mihai Donțu @ 2017-08-04 15:37 ` Paolo Bonzini 2017-08-05 8:00 ` Andrei Vlad LUTAS 1 sibling, 0 replies; 50+ messages in thread From: Paolo Bonzini @ 2017-08-04 15:37 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei LUTAS On 04/08/2017 17:29, Mihai Donțu wrote: > On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote: >> On 02/08/2017 16:17, Mihai Donțu wrote: >>> On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote: >>>> On 02/08/2017 15:32, Mihai Donțu wrote: >>>>> We have currently identified three cases: >>>>> >>>>> * initial hooking of a guest >>>> >>>> What triggers the initial hooking, and how is it done? >>> >>> There are two types o hooks: dynamic (the guest is hooked as it boots) >>> and static (a fully booted guest is being hooked). They both start with >>> a notification from qemu or some other application that a guest is >>> available for introspection. After that we poke its vCPU-s a few times >>> to determine what type of hooking to continue with. I belive the >>> syscall entry point MSR-s are key here. >> >> Reads need not be transactional here, and the syscall entry point MSRs >> are generally immutable so I think it is fine not to pause. > > I might be misunderstanding. Entry point MSR-s (and maybe others) are > generally immutable in well behaving guests. We are, however, looking > for signs that something is breaking this pattern. Right but you can always set up an MSR write intercept even before the guest starts. The MSRs are written rarely-enough that the intercept doesn't add any noticeable overhead. Reading the MSR while the guest runs is okay. If races are important, the introspectors can detect them through the intercepts it sets up. In many cases races are not an issue even (for example, in MTRRs). >> I think a command to "pause" KVM_RUN is okay to add. That is, if in >> KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and >> halt the vCPU. If not in KVM_RUN, the command would return 0 and >> trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN. >> >> The introspector then can decide whether to wait if the commands return >> 0. There is no need for an "unpause" command, as that is achieved >> simply by completing the 'paused' event. > > This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors, > something that maybe some future users can leverage. I, however, was > trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(), > the latter existing only in KVM (kernel). The event-based mecanism for > pausing a vCPU feels like it has too much overhead for what one usually > needs it (read some registers). My point is that usually even register reads (actually MSRs) do not require a pause. The event-based mechanism would be used only for the initial hooking. KVMI_PAUSE_VM would require QEMU communication, wouldn't it? > Also, I belive we have refined a number of ideas on this thread and > more remain to clarify. I would like to update the design document with > what we have agreed on so far, add some more details to what felt > 'under explained' and continue again from there. Is it OK for you? Yes. Pausing is the last issue that needs to be clarified. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* RE: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-04 15:29 ` Mihai Donțu 2017-08-04 15:37 ` Paolo Bonzini @ 2017-08-05 8:00 ` Andrei Vlad LUTAS 2017-08-07 12:18 ` Paolo Bonzini 1 sibling, 1 reply; 50+ messages in thread From: Andrei Vlad LUTAS @ 2017-08-05 8:00 UTC (permalink / raw) To: Mihai Donțu, Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel > -----Original Message----- > From: Mihai Donțu [mailto:mdontu@bitdefender.com] > Sent: 4 August, 2017 18:30 > To: Paolo Bonzini <pbonzini@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com>; Jan Kiszka > <jan.kiszka@siemens.com>; Stefan Hajnoczi <stefanha@redhat.com>; > Adalbert Lazar <alazar@bitdefender.com>; kvm@vger.kernel.org; Tamas K > Lengyel <tamas.k.lengyel@gmail.com>; Andrei Vlad LUTAS > <vlutas@bitdefender.com> > Subject: Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API > header for VM introspection > > On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote: > > On 02/08/2017 16:17, Mihai Donțu wrote: > > > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote: > > > > On 02/08/2017 15:32, Mihai Donțu wrote: > > > > > We have currently identified three cases: > > > > > > > > > > * initial hooking of a guest > > > > > > > > What triggers the initial hooking, and how is it done? > > > > > > There are two types o hooks: dynamic (the guest is hooked as it > > > boots) and static (a fully booted guest is being hooked). They both > > > start with a notification from qemu or some other application that a > > > guest is available for introspection. After that we poke its vCPU-s > > > a few times to determine what type of hooking to continue with. I > > > belive the syscall entry point MSR-s are key here. > > > > Reads need not be transactional here, and the syscall entry point MSRs > > are generally immutable so I think it is fine not to pause. > > I might be misunderstanding. Entry point MSR-s (and maybe others) are > generally immutable in well behaving guests. We are, however, looking for > signs that something is breaking this pattern. > > > I'm curious how the introspector decides that the guest is ready to be > > hooked in, that is, that it's far enough in the boot process. > > I will let Andrei add some details here. > The first steps of enabling the protection for a guest depends on hardware events (like Mihai said, MSR writes, to give one example). Depending on what MSRs are written & the values they are written with, HVI will be able to identify, based on analyzing the guest memory, the type of the running OS (Windows x86, Windows x64, Linux, etc.) - of course, how exactly the protection is enabled from that point on (analyzing the SYSCALL/interrupt handlers, other events such as XSETBV, etc.) is OS dependent, and there isn't a single correct answer. The necessity to pause the running VCPUs for short intervals of time comes in several cases: - when receiving the first relevant events, such as the SYSCALL MSR writes, we want to ensure no other VCPU could be running that might interfere with memory that we analyse (that might include not-yet-known writable pages) - when establishing certain intercepts inside the guest memory, we need to briefly modify it, and to do so safely, we pause all the VCPUs, to make sure we don't race with the guest - when ensuring the integrity of certain structures that already lie inside writable pages (so intercepting writes via an EPT/NPT is not doable, because the induced performance overhead would be huge), so that no other VCPU modifies the said pages while we make sure they haven't been tampered with Of course, just how Paolo suggested, we can place finer-grained intercepts (such as execute-protect a page in order to ensure no VCPU runs code from it while we modify it), but this is a more complicated solution and we've never had to think for something other than simply pausing the VCPUs, since that was always available so far. Hope this piece of info helps. > > I think a command to "pause" KVM_RUN is okay to add. That is, if in > > KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately > > and halt the vCPU. If not in KVM_RUN, the command would return 0 and > > trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN. > > > > The introspector then can decide whether to wait if the commands > > return 0. There is no need for an "unpause" command, as that is > > achieved simply by completing the 'paused' event. > > This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors, > something that maybe some future users can leverage. I, however, was > trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(), > the latter existing only in KVM (kernel). The event-based mecanism for > pausing a vCPU feels like it has too much overhead for what one usually > needs it (read some registers). > > > > > Have you thought about monitoring hardware registers, for example > > > > in order to check that IOMMU page tables protect from overwriting the > kernel? > > > > > > Sorry, but I'm not sure I understand: are you thinking at a way to > > > make sure none of the IOMMU grups are configured with a "too > > > generous" DMA window? > > > > Yes, exactly. Optional, of course. > > We could add a command that returns the list of DMA ranges which can then > be used by an introspector to check if the OS has made a mistake and placed > sensible data in that rage. > > Also, I belive we have refined a number of ideas on this thread and more > remain to clarify. I would like to update the design document with what we > have agreed on so far, add some more details to what felt 'under explained' > and continue again from there. Is it OK for you? > > Regards, > > -- > Mihai Donțu > > > ________________________ > This email was scanned by Bitdefender Best regards, Andrei. ________________________ This email was scanned by Bitdefender ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-05 8:00 ` Andrei Vlad LUTAS @ 2017-08-07 12:18 ` Paolo Bonzini 2017-08-07 13:25 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-07 12:18 UTC (permalink / raw) To: Andrei Vlad LUTAS, Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On 05/08/2017 10:00, Andrei Vlad LUTAS wrote: > Of course, just how Paolo suggested, we can place finer-grained > intercepts (such as execute-protect a page in order to ensure no VCPU > runs code from it while we modify it), but this is a more complicated > solution and we've never had to think for something other than simply > pausing the VCPUs, since that was always available so far. > > Hope this piece of info helps. We can certainly add a "pause the VCPU with a given id" command. The command reports its success with an event, and replying to the event restarts the VCPU. If the VCPU is currently in userspace, the event would be delayed until the next time KVM is re-entered, but this should not be an issue in general. The introspector can operate as if the VCPU was paused. "Pause all VCPUs and stop all DMA" would definitely be a layering violation, so it cannot be added. "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with a given id" commands. I lean towards omitting it. However, now that I'm thinking of it, we need a new event for "new VCPU created". When the event is enabled, newly-created VCPUs should be in paused mode. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-07 12:18 ` Paolo Bonzini @ 2017-08-07 13:25 ` Mihai Donțu 2017-08-07 13:49 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-08-07 13:25 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei Vlad LUTAS On Mon, 2017-08-07 at 14:18 +0200, Paolo Bonzini wrote: > On 05/08/2017 10:00, Andrei Vlad LUTAS wrote: > > Of course, just how Paolo suggested, we can place finer-grained > > intercepts (such as execute-protect a page in order to ensure no VCPU > > runs code from it while we modify it), but this is a more complicated > > solution and we've never had to think for something other than simply > > pausing the VCPUs, since that was always available so far. > > > > Hope this piece of info helps. > > We can certainly add a "pause the VCPU with a given id" command. The > command reports its success with an event, and replying to the event > restarts the VCPU. If the VCPU is currently in userspace, the event > would be delayed until the next time KVM is re-entered, but this should > not be an issue in general. The introspector can operate as if the VCPU > was paused. I have a plan to modify our application a bit and see how often we query a vCPU outside an event handler. If it's seldom enough, a command that pauses the vCPU and triggers an event should be just fine. > "Pause all VCPUs and stop all DMA" would definitely be a layering > violation, so it cannot be added. > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with > a given id" commands. I lean towards omitting it. The case where the introspector wants to scan the guest memory needs a KVMI_PAUSE_VM, which as discussed in a previous email, can be the actual qemu 'pause' command. However, we would like to limit the communication channels we have with the host and not use qmp (or libvirt/etc. if qmp is not exposed). Instead, have a command that triggers a KVM_RUN exit to qemu which in turn will call the underlying pause function used by qmp. Would that be OK with you? > However, now that I'm thinking of it, we need a new event for "new VCPU > created". When the event is enabled, newly-created VCPUs should be in > paused mode. I assume you are thinking about vCPU hotplug here. If so, yes, an event that gives the introspector the chance to update its internal bookkeeping would be useful. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-07 13:25 ` Mihai Donțu @ 2017-08-07 13:49 ` Paolo Bonzini 2017-08-07 14:12 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-07 13:49 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei Vlad LUTAS On 07/08/2017 15:25, Mihai Donțu wrote: >> "Pause all VCPUs and stop all DMA" would definitely be a layering >> violation, so it cannot be added. >> >> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with >> a given id" commands. I lean towards omitting it. > > The case where the introspector wants to scan the guest memory needs a > KVMI_PAUSE_VM, which as discussed in a previous email, can be the > actual qemu 'pause' command. Do you mean it needs to stop DMA as well? > However, we would like to limit the > communication channels we have with the host and not use qmp (or > libvirt/etc. if qmp is not exposed). Instead, have a command that > triggers a KVM_RUN exit to qemu which in turn will call the underlying > pause function used by qmp. Would that be OK with you? You would have to send back something on completion, and then I am worried of races and deadlocks. Plus, pausing a VM at the QEMU level is a really expensive operation, so I don't think it's a good idea to let the introspector do this. You can pause all VCPUs, or use memory page permissions. >> However, now that I'm thinking of it, we need a new event for "new VCPU >> created". When the event is enabled, newly-created VCPUs should be in >> paused mode. > > I assume you are thinking about vCPU hotplug here. If so, yes, an event > that gives the introspector the chance to update its internal > bookkeeping would be useful. Yes, exactly. Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-07 13:49 ` Paolo Bonzini @ 2017-08-07 14:12 ` Mihai Donțu 2017-08-07 15:56 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-08-07 14:12 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei Vlad LUTAS On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote: > On 07/08/2017 15:25, Mihai Donțu wrote: > > > "Pause all VCPUs and stop all DMA" would definitely be a layering > > > violation, so it cannot be added. > > > > > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with > > > a given id" commands. I lean towards omitting it. > > > > The case where the introspector wants to scan the guest memory needs a > > KVMI_PAUSE_VM, which as discussed in a previous email, can be the > > actual qemu 'pause' command. > > Do you mean it needs to stop DMA as well? No, DMA can proceed normally. I remain of the opinion that KVMI users must know what guest memory ranges are OK to access by looking at MTRR- s, PAT or guest kernel structures, or a combination of all three. > > However, we would like to limit the > > communication channels we have with the host and not use qmp (or > > libvirt/etc. if qmp is not exposed). Instead, have a command that > > triggers a KVM_RUN exit to qemu which in turn will call the underlying > > pause function used by qmp. Would that be OK with you? > > You would have to send back something on completion, and then I am > worried of races and deadlocks. Plus, pausing a VM at the QEMU level is > a really expensive operation, so I don't think it's a good idea to let > the introspector do this. You can pause all VCPUs, or use memory page > permissions. Pausing all vCPU-s was my first thought, I was just trying to follow your statement: "I lean towards omitting it". :-) It will take a bit of user-space-fu, in that after issuing N vCPU pause commands in a row we will have to wait for N events, which might race with other events (MSR, CRx etc.) which need handling otherwise the pause ones will not arrive ... I wonder if there's a way to do this cleanly in kernel (piggyback on the code for pausing a single vCPU and then somehow 'coalesce' all pause events into a single KVMI_EVENT_VCPUS_PAUSED). > > > However, now that I'm thinking of it, we need a new event for "new VCPU > > > created". When the event is enabled, newly-created VCPUs should be in > > > paused mode. > > > > I assume you are thinking about vCPU hotplug here. If so, yes, an event > > that gives the introspector the chance to update its internal > > bookkeeping would be useful. > > Yes, exactly. OK. We will update the design document. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-07 14:12 ` Mihai Donțu @ 2017-08-07 15:56 ` Paolo Bonzini 2017-08-07 16:44 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-08-07 15:56 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei Vlad LUTAS On 07/08/2017 16:12, Mihai Donțu wrote: > On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote: >> On 07/08/2017 15:25, Mihai Donțu wrote: >>>> "Pause all VCPUs and stop all DMA" would definitely be a layering >>>> violation, so it cannot be added. >>>> >>>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with >>>> a given id" commands. I lean towards omitting it. >>> >>> The case where the introspector wants to scan the guest memory needs a >>> KVMI_PAUSE_VM, which as discussed in a previous email, can be the >>> actual qemu 'pause' command. >> >> Do you mean it needs to stop DMA as well? > > No, DMA can proceed normally. I remain of the opinion that KVMI users > must know what guest memory ranges are OK to access by looking at MTRR- > s, PAT or guest kernel structures, or a combination of all three. Ok, good. Sorry if I am dense on the DMA/no-DMA cases. (But, I don't understand your remark about guest memory ranges; the point of bus-master DMA is that it works on any memory, and cache snooping makes it even easier for hypothetical malware to do memory writes via bus-master DMA). >>> However, we would like to limit the >>> communication channels we have with the host and not use qmp (or >>> libvirt/etc. if qmp is not exposed). Instead, have a command that >>> triggers a KVM_RUN exit to qemu which in turn will call the underlying >>> pause function used by qmp. Would that be OK with you? >> >> You would have to send back something on completion, and then I am >> worried of races and deadlocks. Plus, pausing a VM at the QEMU level is >> a really expensive operation, so I don't think it's a good idea to let >> the introspector do this. You can pause all VCPUs, or use memory page >> permissions. > > Pausing all vCPU-s was my first thought, I was just trying to follow > your statement: "I lean towards omitting it". :-) Yes, and I still do because a hypothetical "pause all VCPUs" command still has the issue that you could get other events before the command completes. So I am not convinced that a specialized command actually makes the introspector code much simpler. I hope you understand that I want to keep the trusted base (not just the code I maintain, though that is a secondary benefit ;)) as simple as possible. > It will take a bit of user-space-fu, in that after issuing N vCPU pause > commands in a row we will have to wait for N events, which might race > with other events (MSR, CRx etc.) which need handling otherwise the > pause ones will not arrive The same issue would be there in QEMU or KVM though. If you can always request "pause all vCPUs" from an event handler, avoiding deadlocks is relatively easy. If you cannot ensure that, for example because of work that is scheduled periodically, you can send a KVM_PAUSE command to ensure the work is done in a safe condition. Then you get the following pseudocode algorithm: // a vCPU is not executing guest code, and it's going to check // num_pause_vm_requests before going back to guest code vcpu_not_running(id) { unmark vCPU "id" as running if (num vcpus running == 0) cond_broadcast(no_running_cpus) } pause_vcpu(id) { mark vCPU "id" as being-paused send KVMI_PAUSED for the vcpu } // return only when no vCPU is in KVM_RUN pause_vm() { if this vCPU is running if not in an event handler // caller should do pause_vcpu and defer the work return // we know this vCPU is not KVM_RUN vcpu_not_running() num_pause_vm_requests++ if (num vcpus running > 0) for each vCPU that is running and not being-paused pause_vcpu(id) while (num vcpus running > 0) cond_wait(no_running_vcpus) } // tell paused vCPUs that they can resume resume_vm() { num_pause_vm_requests-- if (num_pause_all_requests == 0) cond_broadcast(no_pending_pause_vm_requests) // either we're in an event handler, or a "pause" command was // sent for this vCPU. in any case we're guaranteed to do an // event_reply sooner or later, which will again mark the vCPU // as running } // after an event reply, the vCPU goes back to KVM_RUN. therefore // an event reply can act as a synchronization point for pause-vm // requests: delay the reply if there's such a request event_reply(id, data) { if (num_pause_vm_requests > 0) { if vCPU "id" is running vcpu_not_running(id) while (num_pause_vm_requests > 0) cond_wait(no_pending_pause_vm_requests) } mark vCPU "id" as running send event reply on KVMI socket } // this is what you do when KVM tells you that the guest is either // in userspace, or waiting to be woken up ("paused" event). from // the introspector POV the two are the same. vcpu_ack_pause(id) { vcpu_not_running(id) unmark vCPU "id" as being-paused // deferred work presumably calls pause_vm/resumve_vm, and this // vCPU is not running now, so this is a nice point to flush it if any deferred work exists, do it now } and on the KVMI read handler: on reply to "pause" command: if reply says the vCPU is currently in userspace // we'll get a KVMI_PAUSED event as soon as the host // reenters KVM with KVM_RUN, but we can already say the // CPU is not running vcpu_ack_pause() on "paused" event: vcpu_ack_pause() event_reply() Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-07 15:56 ` Paolo Bonzini @ 2017-08-07 16:44 ` Mihai Donțu 0 siblings, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-08-07 16:44 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel, Andrei Vlad LUTAS On Mon, 2017-08-07 at 17:56 +0200, Paolo Bonzini wrote: > On 07/08/2017 16:12, Mihai Donțu wrote: > > On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote: > > > On 07/08/2017 15:25, Mihai Donțu wrote: > > > > > "Pause all VCPUs and stop all DMA" would definitely be a layering > > > > > violation, so it cannot be added. > > > > > > > > > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with > > > > > a given id" commands. I lean towards omitting it. > > > > > > > > The case where the introspector wants to scan the guest memory needs a > > > > KVMI_PAUSE_VM, which as discussed in a previous email, can be the > > > > actual qemu 'pause' command. > > > > > > Do you mean it needs to stop DMA as well? > > > > No, DMA can proceed normally. I remain of the opinion that KVMI users > > must know what guest memory ranges are OK to access by looking at MTRR- > > s, PAT or guest kernel structures, or a combination of all three. > > Ok, good. Sorry if I am dense on the DMA/no-DMA cases. I think it's OK to restate things, especially since my (our) view on these matters might not match the KVM reality that you know far better. > (But, I don't understand your remark about guest memory ranges; the point of > bus-master DMA is that it works on any memory, and cache snooping makes > it even easier for hypothetical malware to do memory writes via > bus-master DMA). This is where I separated things in my head: I merely limited myself to accessing memory, while leaving the reality of DMA-based attacks a problem to be solved separately. There is some reasearch work being tested internally on that, but I have yet to get in touch with the people involved in it. As soon as I get some details maybe we can connect something in KVM. > > > > However, we would like to limit the > > > > communication channels we have with the host and not use qmp (or > > > > libvirt/etc. if qmp is not exposed). Instead, have a command that > > > > triggers a KVM_RUN exit to qemu which in turn will call the underlying > > > > pause function used by qmp. Would that be OK with you? > > > > > > You would have to send back something on completion, and then I am > > > worried of races and deadlocks. Plus, pausing a VM at the QEMU level is > > > a really expensive operation, so I don't think it's a good idea to let > > > the introspector do this. You can pause all VCPUs, or use memory page > > > permissions. > > > > Pausing all vCPU-s was my first thought, I was just trying to follow > > your statement: "I lean towards omitting it". :-) > > Yes, and I still do because a hypothetical "pause all VCPUs" command > still has the issue that you could get other events before the command > completes. So I am not convinced that a specialized command actually > makes the introspector code much simpler. > > I hope you understand that I want to keep the trusted base (not just the > code I maintain, though that is a secondary benefit ;)) as simple as > possible. > > > It will take a bit of user-space-fu, in that after issuing N vCPU pause > > commands in a row we will have to wait for N events, which might race > > with other events (MSR, CRx etc.) which need handling otherwise the > > pause ones will not arrive > > The same issue would be there in QEMU or KVM though. > > If you can always request "pause all vCPUs" from an event handler, > avoiding deadlocks is relatively easy. If you cannot ensure that, for > example because of work that is scheduled periodically, you can send a > KVM_PAUSE command to ensure the work is done in a safe condition. > > Then you get the following pseudocode algorithm: > > // a vCPU is not executing guest code, and it's going to check > // num_pause_vm_requests before going back to guest code > vcpu_not_running(id) { > unmark vCPU "id" as running > if (num vcpus running == 0) > cond_broadcast(no_running_cpus) > } > > pause_vcpu(id) { > mark vCPU "id" as being-paused > send KVMI_PAUSED for the vcpu > } > > // return only when no vCPU is in KVM_RUN > pause_vm() { > if this vCPU is running > if not in an event handler > // caller should do pause_vcpu and defer the work > return > > // we know this vCPU is not KVM_RUN > vcpu_not_running() > > num_pause_vm_requests++ > if (num vcpus running > 0) > for each vCPU that is running and not being-paused > pause_vcpu(id) > while (num vcpus running > 0) > cond_wait(no_running_vcpus) > } > > // tell paused vCPUs that they can resume > resume_vm() { > num_pause_vm_requests-- > if (num_pause_all_requests == 0) > cond_broadcast(no_pending_pause_vm_requests) > // either we're in an event handler, or a "pause" command was > // sent for this vCPU. in any case we're guaranteed to do an > // event_reply sooner or later, which will again mark the vCPU > // as running > } > > // after an event reply, the vCPU goes back to KVM_RUN. therefore > // an event reply can act as a synchronization point for pause-vm > // requests: delay the reply if there's such a request > event_reply(id, data) { > if (num_pause_vm_requests > 0) { > if vCPU "id" is running > vcpu_not_running(id) > while (num_pause_vm_requests > 0) > cond_wait(no_pending_pause_vm_requests) > } > mark vCPU "id" as running > send event reply on KVMI socket > } > > // this is what you do when KVM tells you that the guest is either > // in userspace, or waiting to be woken up ("paused" event). from > // the introspector POV the two are the same. > vcpu_ack_pause(id) { > vcpu_not_running(id) > unmark vCPU "id" as being-paused > > // deferred work presumably calls pause_vm/resumve_vm, and this > // vCPU is not running now, so this is a nice point to flush it > if any deferred work exists, do it now > } > > and on the KVMI read handler: > > on reply to "pause" command: > if reply says the vCPU is currently in userspace > // we'll get a KVMI_PAUSED event as soon as the host > // reenters KVM with KVM_RUN, but we can already say the > // CPU is not running > vcpu_ack_pause() > > on "paused" event: > vcpu_ack_pause() > event_reply() Thank you for this! -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-08-02 13:32 ` Mihai Donțu 2017-08-02 13:51 ` Paolo Bonzini @ 2017-08-02 13:53 ` Mihai Donțu 1 sibling, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-08-02 13:53 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm, Tamas K Lengyel On Wed, 2017-08-02 at 16:32 +0300, Mihai Donțu wrote: > On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote: > > On 02/08/2017 13:52, Mihai Donțu wrote: > > > I think it might help to distinguish two situations in which we require > > > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST > > > command can be translated into a qemu pause. In our particular usecase > > > we made special arrangements to call it as few times as possible > > > assuming it's very costly. The other is needed only by the internal KVM > > > code for situations similar to: > > > > > > kvm_pause_vcpu(vcpu); > > > vcpu_load(vcpu); > > > kvm_arch_vcpu_ioctl_get_regs(vcpu, regs); > > > vcpu_put(vcpu); > > > kvm_unpause_vcpu(vcpu); > > > > > > or more generally put, for accesses that involve the vCPU state > > > (registers, MSR-s, exceptions etc.), no guest memory involved. > > > > > > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if > > > so, make it somehow available for quick re-entry with > > > kvm_unpause_vcpu(). If said vCPU is already out, then the function will > > > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if > > > we're currently handling an event or one is pending. > > > > Understood. The issue is that there is an inherent race between > > anything userspace is doing and get_regs. What are the cases where you > > need to get regs or similar outside an event? > > We have currently identified three cases: > > * initial hooking of a guest > * periodically checking the integrity of data that is not properly > placed into a page and thus cannot be efficiently tracked via SPT > * injecting processes A few more details are required here, taken from our current application: cases (a) and (c) involve poking the vCPU-s a couple of times before doing KVMI_PAUSE_GUEST() after which everything should be OK, though we should make sure qemu sync-s the states before entering the paused state. Case (b), as far as I can see, is implemented only via memory maps (eg. the IDT is mapped and kept around for quick checking), but a need _might_ arrise to poke the vCPU a few times before concluding that an area has really been corrupted. > > > One exception that might have a better solution is: > > > > > > kvm_pause_all_vcpus(kvm); > > > kvm_set_page_access(kvm, gfn); /* pause for get too? */ > > > kvm_unpause_all_vcpus(kvm); > > > > > > There might be a way to make the change and then IPI all vCPU-s without > > > pulling them out of the guest. > > > > For that I think KVMI should define a VM-wide "mask" layered over the > > actual memory map permissions. Such a command can be implemented > > relatively easily by hooking into the callers of __gfn_to_pfn_memslot > > and kvm_vcpu_gfn_to_hva_prot. -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-07 16:52 ` Paolo Bonzini 2017-07-10 15:32 ` alazar 2017-07-13 8:36 ` Mihai Donțu @ 2017-07-27 17:06 ` Mihai Donțu 2017-07-27 17:18 ` Paolo Bonzini 2 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-27 17:06 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote: > > +2. KVMI_GET_GUEST_INFO > > +---------------------- > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: {} > > +:Returns: ↴ > > + > > +:: > > + > > > > + struct kvmi_get_guest_info_reply { > > > > + __s32 err; > > > > + __u16 vcpu_count; > > > > + __u16 padding; > > > > + __u64 tsc_speed; > > > > + }; > > + > > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported > > +by the architecture (otherwise is 0). > > Can the TSC frequency be specified only if the guest is using TSC > scaling? Defining the TSC frequency on older hosts is a bit tricky. 0 > would be the host. > > Maybe define the second padding to be > > __u16 arch; > > (0 = x86) followed by an arch-specific payload. We use the TSC to compute some performance numbers, but only when we're debugging reported issues. Should be OK if the TSC information is not available. We'll manage in some other way. > > +10. KVMI_GET_XSAVE_INFO > > +----------------------- > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_xsave_info { > > + __u16 vcpu; > > + __u16 padding[3]; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_xsave_info_reply { > > + __s32 err; > > + __u32 size; > > + }; > > + > > +Returns the xstate size for the specified vCPU. > > Could this be replaced by a generic CPUID accessor? Absolutely. I wonder if we should only made certain leafs acessible, part of the whole "make the least amount of information available" security philosophy, though I'm not aware of any attacks that can be mounted on the host just by knowing too many things about a guest. > > +11. KVMI_GET_PAGE_ACCESS > > +------------------------ > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_get_page_access { > > + __u16 vcpu; > > + __u16 padding[3]; > > + __u64 gpa; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_get_page_access_reply { > > + __s32 err; > > + __u32 access; > > + }; > > + > > +Returns the spte flags (rwx - present, write & user) for the specified > > +vCPU and guest physical address. > > + > > +12. KVMI_SET_PAGE_ACCESS > > +------------------------ > > + > > +:Architectures: all > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_set_page_access { > > + __u16 vcpu; > > + __u16 padding; > > + __u32 access; > > + __u64 gpa; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_error_code { > > + __s32 err; > > + __u32 padding; > > + }; > > + > > +Sets the spte flags (rwx - present, write & user) - access - for the specified > > +vCPU and guest physical address. > > rwx or pwu? I suppose RWX. Maybe #define the constants in the > documentation. > > Also, it should be noted here that the spte flags are ANDed with > whatever is provided by userspace, because there could be readonly > memslots, and that some access combinations could fail (--x) or will > surely fail (-wx for example). We are closely looking into how KVM's MMU works and see how we'd go about extending it so as to make everyone happy (qemu and a possible introspection tool). As for the permitted page access flags, we have (thus far) kept away from invalid combinations on older arch-es (--x on pre-SandyBridge, for example). However, it would be very nice to have arch-specific code that does a validation ahead of time (ie. before the guest is re- entered) so that an introspection tool can use an alternative approach. > > +13. KVMI_INJECT_PAGE_FAULT > > +-------------------------- > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_page_fault { > > + __u16 vcpu; > > + __u16 padding; > > + __u32 error; > > error_code, I guess? Why not a generic inject exception message? OK, we will rework the interface to be used for generic exception injection. Maybe we can fold the breakpoint injection into it too. > > + __u64 gva; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_error_code { > > + __s32 err; > > + __u32 padding; > > + }; > > + > > +Injects a vCPU page fault with the specified guest virtual address and > > +error code. > > + > > +5. KVMI_EVENT_USER_CALL > > +----------------------- > > Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL. > > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86_reply; > > + > > +This event is sent on a user hypercall and the introspection has already > > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS). > > + > > +kvmi_event_x86 is sent to the introspector, which can respond with the > > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host > > +kernel to override the general purpose registers using the values from > > +introspector (regs). > > Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall? Why is > KVMI_EVENT_ACTION_ALLOW not allowed? As before, I'd prefer > SKIP/RETRY/ALLOW/CRASH as the actions. No, we use SET_REGS only as a means to communicate with the guest code that did the hypercall. We don't specify an action because, in our specific case, saw no use for it. Apropos: if possible, we'd like to keep the Xen convention for this hypercall. It doesn't appear to interfere with either KVM or Hyper-V (rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24 [HVMOP_guest_request_vm_event] - depends on wether long mode is enabled). > > +7. KVMI_EVENT_TRAP > > +------------------ > > + > > +:Architectures: x86 > > +:Versions: >= 1 > > +:Parameters: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86; > > + struct kvmi_event_trap { > > + __u32 vector; > > + __u32 type; > > + __u32 err; > > error_code? > > > + __u32 padding; > > + __u64 cr2; > > + }; > > + > > +:Returns: ↴ > > + > > +:: > > + > > + struct kvmi_event_x86_reply; > > + > > +This event is sent if a trap will be delivered to the guest (page fault, > > +breakpoint, etc.) and the introspection has already enabled the reports > > +for this kind of event (KVMI_CONTROL_EVENTS). > > + > > +This is used to inform the introspector of all pending traps giving it > > +a chance to determine if it should try again later in case a previous > > +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten > > +by an interrupt picked up during guest reentry. > > + > > +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt > > +type, exception code (err) and CR2 are sent to the introspector, which can > > +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing > > +the host kernel to override the general purpose registers using the values > > +from introspector (regs). > > Here I think the actions could be RETRY/ALLOW/CRASH only, with "set > regs" done as a separate command. > > Some x86 exceptions are faults, others are traps. Traps should not > allow RETRY as an action. There should be an input telling the > introspector if retrying is allowed. > > How are dr6/dr7 handled around breakpoints? Afaik, the debug registers should remain untouched, but I will ask my colleagues with better knowledge of this. Thanks, -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for VM introspection 2017-07-27 17:06 ` Mihai Donțu @ 2017-07-27 17:18 ` Paolo Bonzini 0 siblings, 0 replies; 50+ messages in thread From: Paolo Bonzini @ 2017-07-27 17:18 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar, kvm On 27/07/2017 19:06, Mihai Donțu wrote: >>> + >>> + struct kvmi_xsave_info_reply { >>> + __s32 err; >>> + __u32 size; >>> + }; >>> + >>> +Returns the xstate size for the specified vCPU. >> Could this be replaced by a generic CPUID accessor? > > Absolutely. I wonder if we should only made certain leafs acessible, > part of the whole "make the least amount of information available" > security philosophy, though I'm not aware of any attacks that can be > mounted on the host just by knowing too many things about a guest. That should be safe. >> Also, it should be noted here that the spte flags are ANDed with >> whatever is provided by userspace, because there could be readonly >> memslots, and that some access combinations could fail (--x) or will >> surely fail (-wx for example). > We are closely looking into how KVM's MMU works and see how we'd go > about extending it so as to make everyone happy (qemu and a possible > introspection tool). > > As for the permitted page access flags, we have (thus far) kept away > from invalid combinations on older arch-es (--x on pre-SandyBridge, for > example). However, it would be very nice to have arch-specific code > that does a validation ahead of time (ie. before the guest is re- > entered) so that an introspection tool can use an alternative approach. We can make the command fail if you end up with --x (and it is not supported) or -wx. >> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall? Why is >> KVMI_EVENT_ACTION_ALLOW not allowed? As before, I'd prefer >> SKIP/RETRY/ALLOW/CRASH as the actions. > > No, we use SET_REGS only as a means to communicate with the guest code > that did the hypercall. We don't specify an action because, in our > specific case, saw no use for it. Ok, I'd prefer the usual set of actions just for consistency. ALLOW would fall through to KVM's handling of the hypercall. > Apropos: if possible, we'd like to keep the Xen convention for this > hypercall. It doesn't appear to interfere with either KVM or Hyper-V > (rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24 > [HVMOP_guest_request_vm_event] - depends on wether long mode is > enabled). I was actually thinking that you'd trap all hypercalls to the introspector, hence SKIP/RETRY/ALLOW/CRASH. Reserving RAX = 34 for you is fine (but please document the ABI and subfunctions). Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar 2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar @ 2017-07-07 17:29 ` Paolo Bonzini 2017-08-07 15:28 ` Mihai Donțu 2017-07-12 14:09 ` Konrad Rzeszutek Wilk 2 siblings, 1 reply; 50+ messages in thread From: Paolo Bonzini @ 2017-07-07 17:29 UTC (permalink / raw) To: Adalbert Lazar Cc: Radim Krčmář, Mihai Dontu, Jan Kiszka, Stefan Hajnoczi, KVM list, Andy Lutomirski On 07/07/2017 16:34, Adalbert Lazar wrote: > One bit of code that has passed (maybe) unnoticed in the RFC is a new > function added to Linux' mm called vm_replace_page() which, much like KSM's > replace_page(), gets two processes to share a page (read-write, no-COW): > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > This is used to quickly scan and patch the guest software. Thanks for pointing this out. In my review of patch 1 I suggested using only read/write, but it's slow. I think we need to figure out a safe way to map foreign memory, as I'm worried of TOC/TOU races for obvious reasons. One thing I was thinking about (but didn't have much time to completely think through) is a special /dev/kvmmem device, where you could do kvmmem_fd = open("/dev/kvmmem", O_RDWR); ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size }); ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size }); The map/unmap memory operation would be a hypercall, not a socket command, but the random "token" would be returned on the socket via some KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a replacement accepting {gpa, size} instead of {gpa, gfn_dest}). Handles can be short lived, e.g. you could have at most a small number tokens per host created (and passed back via KVMI) but not yet used by the hypercall. Once it's used by the hypercall, the token is not needed anymore, so this is not a strong limitation. After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout changes (userfaultfd can be used by the introspector to simplify the handling and retry). You'd have to re-map the memory explicitly. Alas I have no idea how to verify the handle securely on the host, since the host is not supposed to know which guests are introspectors and which host got which token. But maybe if the token namespace is big enough (256 bits?) and random, it's okay to ignore the possibility that a guest tries to guess. (This idea is roughly based on how SCSI offloaded copies work). Andy, does it look like utter BS or could it have some merit? Paolo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini @ 2017-08-07 15:28 ` Mihai Donțu 2017-08-07 15:44 ` Paolo Bonzini 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-08-07 15:28 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, KVM list, Andy Lutomirski, Adalbert Lazar On Fri, 2017-07-07 at 19:29 +0200, Paolo Bonzini wrote: > On 07/07/2017 16:34, Adalbert Lazar wrote: > > One bit of code that has passed (maybe) unnoticed in the RFC is a new > > function added to Linux' mm called vm_replace_page() which, much like KSM's > > replace_page(), gets two processes to share a page (read-write, no-COW): > > > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > > > This is used to quickly scan and patch the guest software. > > Thanks for pointing this out. > > In my review of patch 1 I suggested using only read/write, but it's slow. > > I think we need to figure out a safe way to map foreign memory, as I'm > worried of TOC/TOU races for obvious reasons. Would it be possible to describe the race you are referring to? I would imagine that grabbing a hold of the task and/or mm descriptor would be enough to ensure the gfn is available for mapping (and locking), though it's true that qemu might (does it ever?) shuffle things around so the introspector would end up with a mapping pointing to the wrong thing. > One thing I was thinking about (but didn't have much time to completely > think through) is a special /dev/kvmmem device, where you could do > > kvmmem_fd = open("/dev/kvmmem", O_RDWR); > ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size }); > ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size }); The idea with a device solves one of our problems with KVMI_MAP_PHYSICAL_PAGE_TO_GUEST: the introspector has to undo the mapping, otherwise the VM will crash when the kernel reuses the gfn. This does not happen if the introspector crashes ... We can make it so that when kvmmem_fd is closed, all mappings are automatically undone. > The map/unmap memory operation would be a hypercall, not a socket > command, but the random "token" would be returned on the socket via some > KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a > replacement accepting {gpa, size} instead of {gpa, gfn_dest}). Handles > can be short lived, e.g. you could have at most a small number tokens > per host created (and passed back via KVMI) but not yet used by the > hypercall. Once it's used by the hypercall, the token is not needed > anymore, so this is not a strong limitation. I understand that the token is used to make sure the mapping hypercall is only used by VM-s and applications doing introspection. > After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout > changes (userfaultfd can be used by the introspector to simplify the > handling and retry). You'd have to re-map the memory explicitly. "guest memory layout changes" - is this the guest _being_ introspected? If so, how would an introspector running in a separate VM get a SIGSEGV? Have KVM inject a page fault if the host encounters a #PF for the VMA representing the foreign mapping (presumably it knows the gva as well)? > Alas I have no idea how to verify the handle securely on the host, since > the host is not supposed to know which guests are introspectors and > which host got which token. But maybe if the token namespace is big > enough (256 bits?) and random, it's okay to ignore the possibility that > a guest tries to guess. (This idea is roughly based on how SCSI > offloaded copies work). > > Andy, does it look like utter BS or could it have some merit? -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-08-07 15:28 ` Mihai Donțu @ 2017-08-07 15:44 ` Paolo Bonzini 0 siblings, 0 replies; 50+ messages in thread From: Paolo Bonzini @ 2017-08-07 15:44 UTC (permalink / raw) To: Mihai Donțu Cc: Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, KVM list, Andy Lutomirski, Adalbert Lazar On 07/08/2017 17:28, Mihai Donțu wrote: > On Fri, 2017-07-07 at 19:29 +0200, Paolo Bonzini wrote: >> On 07/07/2017 16:34, Adalbert Lazar wrote: >>> One bit of code that has passed (maybe) unnoticed in the RFC is a new >>> function added to Linux' mm called vm_replace_page() which, much like KSM's >>> replace_page(), gets two processes to share a page (read-write, no-COW): >>> >>> https://marc.info/?l=kvm&m=149762056518799&w=2 >>> >>> This is used to quickly scan and patch the guest software. >> >> Thanks for pointing this out. >> >> In my review of patch 1 I suggested using only read/write, but it's slow. >> >> I think we need to figure out a safe way to map foreign memory, as I'm >> worried of TOC/TOU races for obvious reasons. > > Would it be possible to describe the race you are referring to? I would > imagine that grabbing a hold of the task and/or mm descriptor would be > enough to ensure the gfn is available for mapping (and locking), though > it's true that qemu might (does it ever?) shuffle things around so the > introspector would end up with a mapping pointing to the wrong thing. Yes, QEMU can shuffle things around. Usually it's BARs that are shuffled around, but there can be special cases where RAM changes in the memory map (for example SMRAM). >> One thing I was thinking about (but didn't have much time to completely >> think through) is a special /dev/kvmmem device, where you could do >> >> kvmmem_fd = open("/dev/kvmmem", O_RDWR); >> ptr = ioctl(kvmmem_fd, KVMMEM_MAP_MEMORY, { token, size }); >> ioctl(kvmmem_fd, KVMMEM_UNMAP_MEMORY, { ptr, size }); > > The idea with a device solves one of our problems with > KVMI_MAP_PHYSICAL_PAGE_TO_GUEST: the introspector has to undo the > mapping, otherwise the VM will crash when the kernel reuses the gfn. > This does not happen if the introspector crashes ... > > We can make it so that when kvmmem_fd is closed, all mappings are > automatically undone. Yes, that's a nice side effect. >> The map/unmap memory operation would be a hypercall, not a socket >> command, but the random "token" would be returned on the socket via some >> KVMI_MAP_PHYSICAL_PAGE_TO_GUEST command (or more accurately, a >> replacement accepting {gpa, size} instead of {gpa, gfn_dest}). Handles >> can be short lived, e.g. you could have at most a small number tokens >> per host created (and passed back via KVMI) but not yet used by the >> hypercall. Once it's used by the hypercall, the token is not needed >> anymore, so this is not a strong limitation. > > I understand that the token is used to make sure the mapping hypercall > is only used by VM-s and applications doing introspection. Yes. >> After KVMMEM_MAP_MEMORY, you'd get a SIGSEGV if the guest memory layout >> changes (userfaultfd can be used by the introspector to simplify the >> handling and retry). You'd have to re-map the memory explicitly. > > "guest memory layout changes" - is this the guest _being_ introspected? Yes. > If so, how would an introspector running in a separate VM get a > SIGSEGV? Have KVM inject a page fault if the host encounters a #PF for > the VMA representing the foreign mapping (presumably it knows the gva > as well)? When the foreign mapping is invalidated by the actions of the guest being introspected, the pages become inaccessible in the EPT page tables of the introspector VM. Then, an access in the introspector VM will result in an EPT violation, which KVM can translate to a page fault---or actually a #VE would be even better. (Even if you don't modify KVM to use a hardware #VE, KVM's MMU can choose to inject the exception as a #VE rather than #PF). That said, I think this should come later, and for the beginning only read/write should be included. Paolo >> Alas I have no idea how to verify the handle securely on the host, since >> the host is not supposed to know which guests are introspectors and >> which host got which token. But maybe if the token namespace is big >> enough (256 bits?) and random, it's okay to ignore the possibility that >> a guest tries to guess. (This idea is roughly based on how SCSI >> offloaded copies work). ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar 2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar 2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini @ 2017-07-12 14:09 ` Konrad Rzeszutek Wilk 2017-07-13 5:37 ` Mihai Donțu 2 siblings, 1 reply; 50+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-07-12 14:09 UTC (permalink / raw) To: Adalbert Lazar Cc: kvm, Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Mihai Dontu On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote: > The following patch adds the documentation for an introspection subsystem for > KVM (KVMi). It details the purpose and the use case that has shaped the > proposed API/ABI, as well as the wire protocol. > > During the discussion that has developed around our previous RFC patchset a > number of TODO-s have been highlighted: > > * the integration in qemu: a socket-based protocol used to initiate the > connection with an introspection tool and then passes control to KVM, the > in-kernel mechanism taking over from there; > > * the integration of the SPT-handling mechanism into the KVM MMU; > > * the elaboration of a set of policies and a mechanism to better control > this feature; > > One bit of code that has passed (maybe) unnoticed in the RFC is a new > function added to Linux' mm called vm_replace_page() which, much like KSM's > replace_page(), gets two processes to share a page (read-write, no-COW): > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > This is used to quickly scan and patch the guest software. > > The patch following this cover letter does not yet address the points above > but aims to clear with the community the overall ABI/API, with a focus on > x86. Are there thoughts on making this work with libvmi? Or would this interface be used by a paid product? Thanks. > > Thank you, > > v2: > - add documentation and ABI [Paolo, Jan] > - drop all the other patches for now [Paolo] > - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, > and let libvirt/qemu handle this [Stefan, Paolo] > - change the license from LGPL to GPL [Jan] > - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore) > - make the interface a little more consistent > > Adalbert Lazar (1): > kvm: Add documentation and ABI/API header for VM introspection > > Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ > include/uapi/linux/kvmi.h | 310 ++++++++++++ > 2 files changed, 1295 insertions(+) > create mode 100644 Documentation/virtual/kvm/kvmi.rst > create mode 100644 include/uapi/linux/kvmi.h > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-12 14:09 ` Konrad Rzeszutek Wilk @ 2017-07-13 5:37 ` Mihai Donțu 2017-07-14 16:13 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 50+ messages in thread From: Mihai Donțu @ 2017-07-13 5:37 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: kvm, Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar Hi Konrad, On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote: > On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote: > > The following patch adds the documentation for an introspection subsystem for > > KVM (KVMi). It details the purpose and the use case that has shaped the > > proposed API/ABI, as well as the wire protocol. > > > > During the discussion that has developed around our previous RFC patchset a > > number of TODO-s have been highlighted: > > > > * the integration in qemu: a socket-based protocol used to initiate the > > connection with an introspection tool and then passes control to KVM, the > > in-kernel mechanism taking over from there; > > > > * the integration of the SPT-handling mechanism into the KVM MMU; > > > > * the elaboration of a set of policies and a mechanism to better control > > this feature; > > > > One bit of code that has passed (maybe) unnoticed in the RFC is a new > > function added to Linux' mm called vm_replace_page() which, much like KSM's > > replace_page(), gets two processes to share a page (read-write, no-COW): > > > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > > > This is used to quickly scan and patch the guest software. > > > > The patch following this cover letter does not yet address the points above > > but aims to clear with the community the overall ABI/API, with a focus on > > x86. > > Are there thoughts on making this work with libvmi? Or would this > interface be used by a paid product? We have not looked at how we would go about adding support for this API in libvmi, but a quick look tells me a driver shouldn't be that hard to write. We are, however, looking at publishing the sources for a small library called libkvmi that would expose an easy-to-use interface and on top of which we will add support for KVM VMI in: https://github.com/razvan-cojocaru/libbdvmi Other than that, the major user will indeed be a commercial product that, right now, uses Xen's VMI infrastructure: https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html Thanks, > > v2: > > - add documentation and ABI [Paolo, Jan] > > - drop all the other patches for now [Paolo] > > - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, > > and let libvirt/qemu handle this [Stefan, Paolo] > > - change the license from LGPL to GPL [Jan] > > - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore) > > - make the interface a little more consistent > > > > Adalbert Lazar (1): > > kvm: Add documentation and ABI/API header for VM introspection > > > > Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/kvmi.h | 310 ++++++++++++ > > 2 files changed, 1295 insertions(+) > > create mode 100644 Documentation/virtual/kvm/kvmi.rst > > create mode 100644 include/uapi/linux/kvmi.h > > -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-13 5:37 ` Mihai Donțu @ 2017-07-14 16:13 ` Konrad Rzeszutek Wilk 2017-07-18 8:55 ` Mihai Donțu 0 siblings, 1 reply; 50+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-07-14 16:13 UTC (permalink / raw) To: Mihai Donțu Cc: Konrad Rzeszutek Wilk, kvm, Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar On Thu, Jul 13, 2017 at 08:37:07AM +0300, Mihai Donțu wrote: > Hi Konrad, > > On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote: > > On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote: > > > The following patch adds the documentation for an introspection subsystem for > > > KVM (KVMi). It details the purpose and the use case that has shaped the > > > proposed API/ABI, as well as the wire protocol. > > > > > > During the discussion that has developed around our previous RFC patchset a > > > number of TODO-s have been highlighted: > > > > > > * the integration in qemu: a socket-based protocol used to initiate the > > > connection with an introspection tool and then passes control to KVM, the > > > in-kernel mechanism taking over from there; > > > > > > * the integration of the SPT-handling mechanism into the KVM MMU; > > > > > > * the elaboration of a set of policies and a mechanism to better control > > > this feature; > > > > > > One bit of code that has passed (maybe) unnoticed in the RFC is a new > > > function added to Linux' mm called vm_replace_page() which, much like KSM's > > > replace_page(), gets two processes to share a page (read-write, no-COW): > > > > > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > > > > > This is used to quickly scan and patch the guest software. > > > > > > The patch following this cover letter does not yet address the points above > > > but aims to clear with the community the overall ABI/API, with a focus on > > > x86. > > > > Are there thoughts on making this work with libvmi? Or would this > > interface be used by a paid product? > > We have not looked at how we would go about adding support for this API > in libvmi, but a quick look tells me a driver shouldn't be that hard to > write. OK. Asking as the libvmi interface is quite fantastic. It has the drivers for VMWare, Xen, and others - and a pretty cool library to do most of the introspection through simple application linking to it. And more importantly (to me at least), it does allow academia to test different ideas. > > We are, however, looking at publishing the sources for a small library > called libkvmi that would expose an easy-to-use interface and on top of > which we will add support for KVM VMI in: > > https://github.com/razvan-cojocaru/libbdvmi And thoughts of testing API that will be part of KVM test unit? > > Other than that, the major user will indeed be a commercial product > that, right now, uses Xen's VMI infrastructure: > > https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html > > Thanks, > > > > v2: > > > - add documentation and ABI [Paolo, Jan] > > > - drop all the other patches for now [Paolo] > > > - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, > > > and let libvirt/qemu handle this [Stefan, Paolo] > > > - change the license from LGPL to GPL [Jan] > > > - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore) > > > - make the interface a little more consistent > > > > > > Adalbert Lazar (1): > > > kvm: Add documentation and ABI/API header for VM introspection > > > > > > Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ > > > include/uapi/linux/kvmi.h | 310 ++++++++++++ > > > 2 files changed, 1295 insertions(+) > > > create mode 100644 Documentation/virtual/kvm/kvmi.rst > > > create mode 100644 include/uapi/linux/kvmi.h > > > > > -- > Mihai Donțu > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC PATCH v2 0/1] VM introspection 2017-07-14 16:13 ` Konrad Rzeszutek Wilk @ 2017-07-18 8:55 ` Mihai Donțu 0 siblings, 0 replies; 50+ messages in thread From: Mihai Donțu @ 2017-07-18 8:55 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Konrad Rzeszutek Wilk, kvm, Paolo Bonzini, Radim Krčmář, Jan Kiszka, Stefan Hajnoczi, Adalbert Lazar On Fri, 2017-07-14 at 12:13 -0400, Konrad Rzeszutek Wilk wrote: > On Thu, Jul 13, 2017 at 08:37:07AM +0300, Mihai Donțu wrote: > > Hi Konrad, > > > > On Wed, 2017-07-12 at 10:09 -0400, Konrad Rzeszutek Wilk wrote: > > > On Fri, Jul 07, 2017 at 05:34:15PM +0300, Adalbert Lazar wrote: > > > > The following patch adds the documentation for an introspection subsystem for > > > > KVM (KVMi). It details the purpose and the use case that has shaped the > > > > proposed API/ABI, as well as the wire protocol. > > > > > > > > During the discussion that has developed around our previous RFC patchset a > > > > number of TODO-s have been highlighted: > > > > > > > > * the integration in qemu: a socket-based protocol used to initiate the > > > > connection with an introspection tool and then passes control to KVM, the > > > > in-kernel mechanism taking over from there; > > > > > > > > * the integration of the SPT-handling mechanism into the KVM MMU; > > > > > > > > * the elaboration of a set of policies and a mechanism to better control > > > > this feature; > > > > > > > > One bit of code that has passed (maybe) unnoticed in the RFC is a new > > > > function added to Linux' mm called vm_replace_page() which, much like KSM's > > > > replace_page(), gets two processes to share a page (read-write, no-COW): > > > > > > > > https://marc.info/?l=kvm&m=149762056518799&w=2 > > > > > > > > This is used to quickly scan and patch the guest software. > > > > > > > > The patch following this cover letter does not yet address the points above > > > > but aims to clear with the community the overall ABI/API, with a focus on > > > > x86. > > > > > > Are there thoughts on making this work with libvmi? Or would this > > > interface be used by a paid product? > > > > We have not looked at how we would go about adding support for this API > > in libvmi, but a quick look tells me a driver shouldn't be that hard to > > write. > > OK. Asking as the libvmi interface is quite fantastic. It has the drivers > for VMWare, Xen, and others - and a pretty cool library to do most of > the introspection through simple application linking to it. > > And more importantly (to me at least), it does allow academia > to test different ideas. > > > > We are, however, looking at publishing the sources for a small library > > called libkvmi that would expose an easy-to-use interface and on top of > > which we will add support for KVM VMI in: > > > > https://github.com/razvan-cojocaru/libbdvmi > > And thoughts of testing API that will be part of KVM test unit? We have not yet put enough though into how we will integrate with KVM's unit tests, partly because we are still making adjustments to the design. All we have on the board right now is that we must create unit tests. :-) Regards, > > Other than that, the major user will indeed be a commercial product > > that, right now, uses Xen's VMI infrastructure: > > > > https://citrixready.citrix.com/bitdefender/bitdefender-hypervisor-introspection.html > > > > > > v2: > > > > - add documentation and ABI [Paolo, Jan] > > > > - drop all the other patches for now [Paolo] > > > > - remove KVMI_GET_GUESTS, KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, > > > > and let libvirt/qemu handle this [Stefan, Paolo] > > > > - change the license from LGPL to GPL [Jan] > > > > - remove KVMI_READ_PHYSICAL and KVMI_WRITE_PHYSICAL (not used anymore) > > > > - make the interface a little more consistent > > > > > > > > Adalbert Lazar (1): > > > > kvm: Add documentation and ABI/API header for VM introspection > > > > > > > > Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++ > > > > include/uapi/linux/kvmi.h | 310 ++++++++++++ > > > > 2 files changed, 1295 insertions(+) > > > > create mode 100644 Documentation/virtual/kvm/kvmi.rst > > > > create mode 100644 include/uapi/linux/kvmi.h -- Mihai Donțu ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2017-08-07 16:44 UTC | newest] Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-07 14:34 [RFC PATCH v2 0/1] VM introspection Adalbert Lazar 2017-07-07 14:34 ` [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API header for " Adalbert Lazar 2017-07-07 16:52 ` Paolo Bonzini 2017-07-10 15:32 ` alazar 2017-07-10 17:03 ` Paolo Bonzini 2017-07-11 16:48 ` Adalbert Lazar 2017-07-11 16:51 ` Paolo Bonzini 2017-07-13 5:57 ` Mihai Donțu 2017-07-13 7:32 ` Paolo Bonzini 2017-07-18 11:51 ` Mihai Donțu 2017-07-18 12:02 ` Mihai Donțu 2017-07-23 13:02 ` Paolo Bonzini 2017-07-26 17:04 ` Mihai Donțu 2017-07-26 17:25 ` Tamas K Lengyel 2017-07-27 14:41 ` Mihai Donțu 2017-07-27 13:33 ` Paolo Bonzini 2017-07-27 14:46 ` Mihai Donțu 2017-07-13 8:36 ` Mihai Donțu 2017-07-13 9:15 ` Paolo Bonzini 2017-07-27 16:23 ` Mihai Donțu 2017-07-27 16:52 ` Paolo Bonzini 2017-07-27 17:19 ` Mihai Donțu 2017-08-01 10:40 ` Paolo Bonzini 2017-08-01 16:33 ` Tamas K Lengyel 2017-08-01 20:47 ` Paolo Bonzini 2017-08-02 11:52 ` Mihai Donțu 2017-08-02 12:27 ` Paolo Bonzini 2017-08-02 13:32 ` Mihai Donțu 2017-08-02 13:51 ` Paolo Bonzini 2017-08-02 14:17 ` Mihai Donțu 2017-08-04 8:35 ` Paolo Bonzini 2017-08-04 15:29 ` Mihai Donțu 2017-08-04 15:37 ` Paolo Bonzini 2017-08-05 8:00 ` Andrei Vlad LUTAS 2017-08-07 12:18 ` Paolo Bonzini 2017-08-07 13:25 ` Mihai Donțu 2017-08-07 13:49 ` Paolo Bonzini 2017-08-07 14:12 ` Mihai Donțu 2017-08-07 15:56 ` Paolo Bonzini 2017-08-07 16:44 ` Mihai Donțu 2017-08-02 13:53 ` Mihai Donțu 2017-07-27 17:06 ` Mihai Donțu 2017-07-27 17:18 ` Paolo Bonzini 2017-07-07 17:29 ` [RFC PATCH v2 0/1] " Paolo Bonzini 2017-08-07 15:28 ` Mihai Donțu 2017-08-07 15:44 ` Paolo Bonzini 2017-07-12 14:09 ` Konrad Rzeszutek Wilk 2017-07-13 5:37 ` Mihai Donțu 2017-07-14 16:13 ` Konrad Rzeszutek Wilk 2017-07-18 8:55 ` Mihai Donțu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).