[Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess

* [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess
@ 2019-03-07  7:22 elena.ufimtseva
  2019-03-07  8:14 ` Thomas Huth
  2019-03-07 14:26 ` Stefan Hajnoczi
  0 siblings, 2 replies; 31+ messages in thread
From: elena.ufimtseva @ 2019-03-07  7:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: elena.ufimtseva, ross.lagerwall, stefanha, liran.alon,
	kanth.ghatraju, john.g.johnson, jag.raman, konrad.wilk,
	sstabellini

From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

TODO: Make relevant changes to the doc.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 1109 insertions(+)
 create mode 100644 docs/devel/qemu-multiprocess.txt

diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
new file mode 100644
index 0000000..e29c6c8
--- /dev/null
+++ b/docs/devel/qemu-multiprocess.txt
@@ -0,0 +1,1109 @@
+/*
+ * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+Disaggregating QEMU
+
+This document describes implementation details of multi-process
+qemu.
+
+1. QEMU services
+
+        QEMU can be broadly described as providing three main
+services.  One is a VM control point, where VMs can be created,
+migrated, re-configured, and destroyed.  A second is to emulate the
+CPU instructions within the VM, often accelerated by HW virtualization
+features such as Intel's VT extensions.  Finally, it provides IO
+services to the VM by emulating HW IO devices, such as disk and
+network devices.
+
+1.1 A disaggregated QEMU
+
+        A disaggregated QEMU involves separating QEMU services into
+separate host processes.  Each of these processes can be given only
+the privileges it needs to provide its service, e.g., a disk service
+could be given access only the the disk images it provides, and not be
+allowed to access other files, or any network devices.  An attacker
+who compromised this service would not be able to use this exploit to
+access files or devices beyond what the disk service was given access
+to.
+
+        A control QEMU process would remain, but in disaggregated
+mode, it would be a control point that exec()s the processes needed to
+support the VM being created, but have no direct interface to the VM.
+During VM execution, it would still provide the user interface to
+hot-plug devices or live migrate the VM.
+
+        A first step in creating a disaggregated QEMU is to separate
+IO services from the main QEMU program, which would continue to
+provide CPU emulation. i.e., the control process would also be the CPU
+emulation process.  In a later phase, CPU emulation could be separated
+from the QEMU control process.
+
+
+2. Disaggregating IO services
+
+        Disaggregating IO services is a good place to begin QEMU
+disaggregating for a couple of reasons.  One is the sheer number of IO
+devices QEMU can emulate provides a large surface of interfaces which
+could potentially be exploited, and, indeed, have been a source of
+exploits in the past.  Another is the modular nature of QEMU device
+emulation code provides interface points where the QEMU functions that
+perform device emulation can be separated from the QEMU functions that
+manage the emulation of guest CPU instructions.
+
+2.1 QEMU device emulation
+
+        QEMU uses a object oriented SW architecture for device
+emulation code.  Configured objects are all compiled into the QEMU
+binary, then objects are instantiated by name when used by the guest
+VM.  For example, the code to emulate a device named "foo" is always
+present in QEMU, but its instantiation code is only run when a device
+named "foo" is included in the target VM (such as via the QEMU command
+line as -device "foo".)
+
+        The object model is hierarchical, so device emulation code can
+name its parent object (such as "pci-device" for a PCI device) and
+QEMU will instantiate a parent object before calling the device's
+instantiation code.
+
+2.2 Current separation models
+
+        In order to separate the device emulation code from the CPU
+emulation code, the device object code must run in a different
+process.  There are a couple of existing QEMU features that can run
+emulation code separately from the main QEMU process.  These are
+examined below.
+
+2.2.1 vhost user model
+
+        Virtio guest device drivers can be connected to vhost user
+applications in order to perform their IO operations.  This model uses
+special virtio device drivers in the guest and vhost user device
+objects in QEMU, but once the QEMU vhost user code has configured the
+vhost user application, mission-mode IO is performed by the
+application.  The vhost user application is a daemon process that can
+be contacted via a known UNIX domain socket.
+
+2.2.1.1 vhost socket
+
+        As mentioned above, one of the tasks of the vhost device
+object within QEMU is to contact the vhost application and send it
+configuration information about this device instance.  As part of the
+configuration process, the application can also be sent other file
+descriptors over the socket, which then can be used by the vhost user
+application in various ways, some of which are described below.
+
+2.2.1.2 vhost MMIO store acceleration
+
+        VMs are often run using HW virtualization features via the KVM
+kernel driver.  This driver allows QEMU to accelerate the emulation of
+guest CPU instructions by running the guest in a virtual HW mode.
+When the guest executes instructions that cannot be executed by
+virtual HW mode, execution return to the KVM driver so it can inform
+QEMU to emulate the instructions in SW.
+
+        One of the events that can cause a return to QEMU is when a
+guest device driver accesses an IO location. QEMU then dispatches the
+memory operation to the corresponding QEMU device object.  In the case
+of a vhost user device, the memory operation would need to be sent
+over a socket to the vhost application.  This path is accelerated by
+the QEMU virtio code by setting up an eventfd file descriptor that the
+vhost application can directly receive MMIO store notifications from
+the KVM driver, instead of needing them to be sent to the QEMU process
+first.
+
+2.2.1.3 vhost interrupt acceleration
+
+        Another optimization used by the vhost application is the
+ability to directly inject interrupts into the VM via the KVM driver,
+again, bypassing the need to send the interrupt back to the QEMU
+process first.  The QEMU virtio setup code configures the KVM driver
+with an eventfd that triggers the device interrupt in the guest when
+the eventfd is written. This irqfd file descriptor is then passed to
+the vhost user application program.
+
+2.2.1.4 vhost access to guest memory
+
+        The vhost application is also allowed to directly access guest
+memory, instead of needing to send the data as messages to QEMU.  This
+is also done with file descriptors sent to the vhost user application
+by QEMU.  These descriptors can be mmap()d by the vhost application to
+map the guest address space into the vhost application.
+
+        IOMMUs introduce another level of complexity, since the
+address given to the guest virtio device to DMA to or from is not a
+guest physical address.  This case is handled by having vhost code
+within QEMU register as a listener for IOMMU mapping changes.  The
+vhost application maintains a cache of IOMMMU translations: sending
+translation requests back to QEMU on cache misses, and in turn
+receiving flush requests from QEMU when mappings are purged.
+
+2.2.1.5 applicability to device separation
+
+        Much of the vhost model can be re-used by separated device
+emulation.  In particular, the ideas of using a socket between QEMU
+and the device emulation application, using a file descriptor to
+inject interrupts into the VM via KVM, and allowing the application to
+mmap() the guest should be re-used.
+
+        There are, however, some notable differences between how a
+vhost application works and the needs of separated device emulation.
+The most basic is that vhost uses custom virtio device drivers which
+always trigger IO with MMIO stores.  A separated device emulation model
+must work with existing IO device models and guest device drivers.
+MMIO loads break vhost store acceleration since they are synchronous -
+guest progress cannot continue until the load has been emulated.  By
+contrast, stores are asynchronous, the guest can continue after the
+store event has been sent to the vhost application.
+
+        Another difference is that in the vhost user model, a single
+daemon can support multiple QEMU instances.  This is contrary to the
+security regime desired, in which the emulation application should
+only be allowed to access the files or devices the VM it's running on
+behalf of can access.
+
+2.2.2 qemu-io model
+
+        Qemu-io is a test harness used to test changes to the QEMU
+block backend object code. (e.g., the code that implements disk images
+for disk driver emulation) Qemu-io is not a device emulation
+application per se, but it does compile the QEMU block objects into a
+separate binary from the main QEMU one.  This could be useful for disk
+device emulation, since its emulation applications will need to
+include the QEMU block objects.
+
+2.3 New separation model based on proxy objects
+
+        A different model based on proxy objects in the QEMU program
+communicating with proxy objects the separated emulation programs
+could provide separation while minimizing the changes needed to the
+device emulation code.  The rest of this section is a discussion of
+how a proxy object model would work.
+
+2.3.1 command line specification
+
+        The QEMU command line options will need to be modified to
+indicate which items are emulated by a separate program, and which
+remain emulated by QEMU itself.
+
+2.3.1.1 devices
+
+        Devices that are to be emulated in a separate process will be
+identified by using "-rdevice" on the QEMU command line in lieu of
+"-device".  The device's other options will also be included in the
+command line, with the addition of a "command" option that specifies
+the remote program to execute to emulate the device.  e.g., an LSI
+SCSI controller and disk can be specified as:
+
+-device lsi53c895a,id=scsi0 device
+-device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
+
+        If these devices are emulated with program "lsi-scsi," the
+QEMU command line would be:
+
+-rdevice lsi53c895a,id=scsi0,command="lsi-scsi"
+-rdevice scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
+
+        Some devices are implicitly created by the machine object.
+e.g., the "q35" machine object will create its PCI bus, and attach a
+"ich9-ahci" IDE controller to it.  In this case, options will need to
+be added to the "-machine" command line.  e.g.,
+
+-machine pc-q35,ide-command="ahci-ide"
+
+        will use the "ahci-ide" program to emulate the IDE controller
+and its disks.  The disks themselves still need to be specified with
+"-rdevice", e.g.,
+
+-rdevice ide-hd,drive=drive0,bus=ide.0,unit=0
+
+        The "-rdevice" devices will be parsed into a separate
+QemuOptsList from "-device" ones, but will still have "driver"
+as the implied name of the initial option.
+
+2.3.1.2 backends
+
+        The device's backend would similarly have a changed command
+line specification.  e.g., a qcow2 block backend specified as:
+
+-blockdev driver=file,node-name=file0,filename=disk-file0
+-blockdev driver=qcow2,node-name=drive0,file=file0
+
+becomes
+
+-rblockdev driver=file,node-name=file0,filename=disk-file0
+-rblockdev driver=qcow2,node-name=drive0,file=file0
+
+        As is the case with devices, "-rblockdev" backends will
+be parsed into their own BlockdevOptions_queue.
+
+2.3.2 device proxy objects
+
+        QEMU has an object model based on sub-classes inherited from
+the "object" super-class.  The sub-classes that are of interest here
+are the "device" and "bus" sub-classes whose child sub-classes make up
+the device tree of a QEMU emulated system.
+
+        The proxy object model will use device proxy objects to
+replace the device emulation code within the QEMU process.  These
+objects will live in the same place in the object and bus hierarchies
+as the objects they replace.  i.e., the proxy object for an LSI SCSI
+controller will be a sub-class of the "pci-device" class, and will
+have the same PCI bus parent and the same SCSI bus child objects as
+the LSI controller object it replaces.
+
+        After the QEMU command line has been parsed, the "-rdevice"
+devices will be instantiated in the same manner as "-device" devices
+are. (i.e., qdev_device_add()).  In order to distinguish them from
+regular "-device" device objects, their class name will be the name of
+the class it replaces, with "-proxy" appended.  e.g., the "scsi-hd"
+proxy class will be "scsi-hd-proxy"
+
+2.3.2.1 object initialization
+
+        QEMU object initialization occurs in two phases.  The first
+initialization happens once per object class. (i.e., there can be many
+SCSI disks in an emulated system, but the "scsi-hd" class has its
+class_init() function called only once) The second phase happens when
+each object's instance_init() function is called to initialize each
+instance of the object.
+
+        All device objects are sub-classes of the "device" class, so
+they also have a realize() function that is called after
+instance_init() is called and after the object's static properties
+have been initialized.  Many device objects don't even provide an
+instance_init() function, and do all their per-instance work in
+realize().
+
+2.3.2.1.1 class_init
+
+        The class_init() method of a proxy object will, in general
+behave similarly to the object it replaces, including setting any
+static properties and methods needed by the proxy.
+
+2.3.2.1.2 instance_init / realize
+
+        The instance_init() and realize() functions would only need to
+perform tasks related to being a proxy, such are registering its own
+MMIO handlers, or creating a child bus that other proxy devices can be
+attached to later.  They also need to add a "json_device" string
+property that contains the JSON representation of the command line
+options used to create the object.
+
+        This JSON representation is used to create the corresponding
+object in an emulation process.  e.g., for an LSI SCSI controller
+invoked as:
+
+ -rdevice lsi53c895a,id=scsi0,command="lsi-scsi"
+
+the proxy object would create a
+
+{ "driver" : "lsi53c895a", "id" : "scsi0" }
+
+JSON description.  The "driver" option is assigned to the device name
+when the command line is parsed, so the "-proxy" appended by the
+command line parsing code must be removed.  The "command" option isn't
+needed in the JSON description since it only applies to the proxy
+object in the QEMU process.
+
+        Other tasks will are device-specific.  PCI device objects will
+initialize the PCI config space in order to make a valid PCI device
+tree within the QEMU process.  Disk devices will probe their backend
+object to get its JSON description, and publish this description as a
+"json_backend" string property (see the backend discussion below.)
+
+2.3.2.2 address space registration
+
+        Most devices are driven by guest device driver accesses to IO
+addresses or ports.  The QEMU device emulation code uses QEMU's memory
+region function calls (such as memory_region_init_io()) to add
+callback functions that QEMU will invoke when the guest accesses the
+device's areas of the IO address space.  When a guest driver does
+access the device, the VM will exit HW virtualization mode and return
+to QEMU, which will then lookup and execute the corresponding callback
+function.
+
+        A proxy object would need to mirror the memory region calls
+the actual device emulator would perform in its initialization code,
+but with its own callbacks.  When invoked by QEMU as a result of a
+guest IO operation, they will forward the operation to the device
+emulation process via a proxy_proc_send() call.  Any response will
+be read via proxy_proc_recv().
+
+        Note that the callbacks are called with an address space lock,
+so it would not be a appropriate to synchronously wait for any
+response.  Instead the QEMU code must be changed to check if the
+thread needs to sleep after the address_space_rw() call (in
+kvm_cpu_exec().)
+
+2.3.2.3 PCI config space
+
+        PCI devices also have a configuration space that can be
+accessed by the guest driver.  Guest accesses to this space is not
+handled by the device emulation object, but by it's PCI parent object.
+Much of this space is read-only, but certain registers (especially BAR
+and MSI-related ones) need to be propagated to the emulation process.
+
+2.3.2.3.1 PCI parent proxy
+
+        One way to propagate guest PCI config accesses is to create a
+"pci-device-proxy" class that can serve as the parent of a PCI device
+proxy object.  This class's parent would be "pci-device" and it would
+override the PCI parent's config_read and config_write methods with
+ones that forward these operations to the emulation program.
+
+2.3.2.4 interrupt receipt
+
+        A proxy for a device that generates interrupts will receive
+the interrupt indication via the read callback it provided to
+proxy_ctx_alloc().  The interrupt indication would then be sent up to
+its bus parent to be injected into the guest.  For example, a PCI
+device object may use pci_set_irq().
+
+2.3.3 device backends
+
+        Each type of device has backends which perform IO operations
+in the host system.  For example, block backend objects emulate the
+disk images configured into the VM.  While block backends are
+implemented as objects, not all backends are.  For example, display
+backends (e.g., vnc) are not objects, they register a set of virtual
+functions that are called by QEMU's display emulation.
+
+        These device backends also need run in the device emulation
+processes, and the emulation process must be given access the the
+corresponding host files or devices.
+
+2.3.3.1 block backends
+
+        Block backends are objects that can implement file protocols
+(such as a local file or an iSCSI volume), implement disk image
+formats (such as qcow2), or serve as a request filter (such as IO
+throttling).  They are often stacked on each other (such as a qcow2
+format on a local file protocol.)  They're are named by "node-name"
+properties that are then matched to "drive" properties of the
+corresponding disk devices.
+
+        Block backend objects are not part of the QEMU object model
+(i.e., they're not sub-classes of "object").  They are instantiated
+when the bdrv_file_open() method is invoked with a Qdict dictionary
+of the backend's command line options.
+
+2.3.3.1.1 initialization
+
+        When a "-rblockdev" backend is initialized, it will not open
+the underlying backend object, as is done for "-blockdev" backends.
+Instead, it will create a BlockDriverState node that has a proxy name
+and the original options Qdict.  The proxy name will consist of the
+backend's node-name with "-peer" appended to it (i.e., a "drive0"
+node-name would have a "drive0-peer" peer.)
+
+        A proxy backend object with then be opened, using an
+initialization Qdict containing the "node-name" of the underlying
+backend, so that disk device objects and QMP commands can find it.
+The proxy's Qdict will also be given the proxy name as a "peer"
+property so it can lookup its underlying backend object and its
+associated Qdict.
+
+2.3.3.1.2 bdrv_probe_json
+
+        This API returns the JSON description of the peer of a given
+backend proxy.  It will be used by disk device proxy objects to get
+the JSON descriptions of the block backend (and any backends layered
+below) needed to emulate the disk image.
+
+2.3.3.1.3 bdrv_get_json
+
+        This is a new block backend object method that returns the
+JSON description this object, and all of its underlying objects.  It
+will recursively descend any layered backend objects (e.g., a format
+object will call its underlying protocol object) This method can be
+invoked on an object that has not been opened.  It will mainly be used
+by bdrv_probe_json().
+
+2.3.3.1.4 bdrv_assign_proxy_name
+
+        The API creates the node with a proxy name, and enters it on a
+list of peer nodes.  This list can be searched by proxy backends to
+find their associated peers.
+
+2.3.3.1.5 QMP commands
+
+        Various QMP command operate on blockdevs.  These will need
+to work on rblockevs in separated processes as well.  There are
+several cases that need to be handled.
+
+2.3.3.1.5.1 adding rblockdevs
+
+        QMP allows users to add blockdevs to a running QEMU instance.
+This is done not just to hot-plug a disk device into a guest, but also
+for advanced blockdev features such as changing quorum devices.
+Likewise, QMP needs be able to add an rblockdev to the guest, so
+similar operations can be performed on devices being emulated in a
+separate process.
+
+        This operation doesn't need to be performed differently from
+adding an rblockdev from the command line.  Blockdevs are added with a
+qmp_blockdev_add() routine that can be called from either the command
+line parser or from QMP.  Note the name of the C routine called from
+QMP is generated by a python script, so a "rblockdev-add" command must
+be implemented by qmp_rblockdev_add().
+
+2.3.3.1.5.2 targeted commands
+
+        Many QMP commands operate on specified blockdevs.  These
+commands will find the proxy node when they lookup the targeted name,
+which will then forward the request to the emulation process managing
+the peer node.
+
+2.3.3.1.5.3 blockdev lists
+
+        Several QMP query commands (such as query-block or
+query-block-jobs) operate on all blockdevs.   These will function
+much like targeted commands, with the proxy nodes forwarding the
+request to its peer emulation process.
+
+2.3.4 proxy APIs
+
+        There will be a set of APIs provided by a process execution
+service for proxy objects to use to manage the separate emulation
+program.
+
+2.3.4.1 proxy_register
+
+        A proxy device object must register itself with the
+proxy_register() API.  The registration call will include validation
+and execution callbacks that will be invoked after the emulated
+machine has been setup in QEMU.
+
+2.3.4.1.1 validation callback
+
+        This callback will invoked after all devices in the emulated
+system have been initialized.  Its purpose is to validate the device
+configuration by checking that its parent and child bus objects are
+compatible with being proxied.  For example, a disk controller can
+check that all the devices on its bus are all proxy objects, or a disk
+object can check that its backend object is a proxy.  If any of the
+validation callbacks return an error, QEMU will exit.  If there are no
+errors, the execution callbacks will be invoked.
+
+2.3.4.1.2 execution callback
+
+        A device proxy object that manages an emulation process will
+provide an execution callback in its proxy_register() call.  This
+callback will allocate an execution context with proxy_ctx_alloc(),
+marshal the arguments needed for the emulation program, and invoke
+proxy_execute() to execute it.
+
+2.3.4.2 proxy_ctx_alloc
+
+        Before the emulation program can be executed, the proxy object
+must call proxy_ctx_alloc() to create an execution context for the
+process.  The execution context will serve as a handle on which the
+other proxy APIs operate.
+
+2.3.4.3 proxy_ctx_callbacks
+
+        This API registers two callback functions: get_reply() and
+get_request(), on the context.  get_reply() is invoked to handle
+replies to requests sent to the emulation process.  get_request() is
+invoked to handle requests from the emulation process.  This API can
+be called multiple times on the same context; a class field within an
+incoming message indicates which callbacks will be invoked.
+
+2.3.4.4 proxy_execute
+
+        This function executes an emulation program.  It needs to be
+provided with an execution context, the file to execute, and any
+arguments needed by the program.  Before executing the given program,
+it will setup the communications channels for the new process.
+
+2.3.5 communication with emulation process
+
+        The execution service will setup two communication channels
+between the main QEMU process and the emulation process.  The channels
+will be created using socketpair() so that file descriptors can be
+passed from QEMU to the process.
+
+2.3.5.1 requests to emulation process
+
+        The stdin file descriptor of the emulation process will be
+used for requests from QEMU to the emulation process.  The execution
+service provides APIs to send and receive messages from the emulation
+process.
+
+2.3.5.1.1 proxy_proc_send
+
+        This API is for the proxy object in QEMU to send messages to
+the emulation process.  Its arguments will include an execution
+context in addition to the actual message.
+
+2.3.5.1.2 proxy_proc_recv
+
+        This API receives replies from the emulation process.  It
+requires the execution context of the target process, and will usually
+be called from the get_reply() callback specified in proxy_ctx_alloc.
+
+2.3.5.2 requests to QEMU process
+
+        The stdout file descriptor to the emulation process will be
+used for requests from the emulation process to QEMU.  As with
+requests to the emulation process, APIs will be provided to facilitate
+communication.
+
+2.3.5.2.1 proxy_qemu_recv
+
+        This API receives requests from the emulation process.  It
+requires the execution context of the target process, and will usually
+be called from the get_request() callback specified in
+proxy_ctx_alloc.
+
+2.3.5.2.2 proxy_qemu_send
+
+        This API is for the proxy object in QEMU to send replies to
+the emulation process.  Its arguments will include an execution
+context in addition to the actual reply.
+
+2.3.5.3 JSON descriptions
+
+        The initial messages sent to the emulation process will
+describe the devices its will be tasked to emulate.  The will be
+described as JSON arrays of backend and device objects that need to be
+instantiated by the emulation process.
+
+2.3.5.3.1 backend JSON
+
+        The device proxy object will aggregate the "json_backend"
+properties from the disk devices on the bus it controls, and send
+them as a JSON array of objects. e.g., this command line:
+
+-rblockdev driver=file,node-name=file0,filename=disk-file0
+-rblockdev driver=qcow2,node-name=drive0,file=file0
+
+would generate
+
+[
+  { "driver" : "file", "node-name" : "file0", "filename" : "disk-file0" }.
+  { "driver" : "qcow2", "node-name" : "drive0", "file" : "file0" }
+]
+
+2.3.5.3.2 device JSON
+
+        The device proxy object will aggregate a JSON description of
+itself and devices on the bus it controls (via their "json_device"
+properties), and send them to the emulation process as a JSON array of
+objects.
+
+2.3.5.4 DMA operations
+
+        DMA operations would be handled much like vhost applications
+do.  One of the initial messages sent to the emulation process is a
+guest memory table.  Each entry in this table consists of a file
+descriptor and size that the emulation process can mmap() to directly
+access guest memory, similar to vhost_user_set_mem_table().  Note
+guest memory must be backed by file descriptors, such as when QEMU is
+given the "-mem-path" command line option.
+
+2.3.5.5 IOMMU operations
+
+        When the emulated system includes an IOMMU, the proxy
+execution service will need to handle IOMMU requests from the
+emulation process using an address_space_get_iotlb_entry() call.  In
+order to handle IOMMU unmaps, the proxy execution service will also
+register as a listener on the device's DMA address space.  When an
+IOMMU memory region is created within the DMA address space, an IOMMU
+notifier for unmaps will be added to the memory region that will
+forward unmaps to the emulation process.
+
+        This also will require a proxy_ctx_callbacks() call to
+register an IOMMU handler for incoming IOMMU requests from the
+emulation program.
+
+2.3.6 device emulation process
+
+        The device emulation process will run the object hierarchy of
+the device, hopefully unmodified.  It will be based on the QEMU source
+code, because for anything but the simplest device, it would not be a
+tractable problem to re-implement both the object model and the many
+device backends that QEMU has.
+
+        The parts of QEMU that the emulation program will need include
+the object model; the memory emulation objects; the device emulation
+objects of the targeted device, and any dependent devices; and, the
+device's backends.  It will also need code to setup the machine
+environment, handle requests from the QEMU process, and route
+machine-level requests (such as interrupts or IOMMU mappings) back to
+the QEMU process.
+
+2.3.6.1 initialization
+
+        The process initialization sequence will follow the same
+sequence followed by QEMU.  It will first initialize the backend
+objects, then device emulation objects.  The JSON arrays sent by the
+QEMU process will drive which objects need to be created.
+
+2.3.6.1.1 address spaces
+
+        Before the device objects are created, the initial address
+spaces and memory regions must be configured with memory_map_init().
+This creates a RAM memory region object (system_memory) and an IO
+memory region object (system_io).
+
+2.3.6.1.2 RAM
+
+        RAM memory region creation will follow how pc_memory_init()
+creates them, but must use memory_region_init_ram_from_fd() instead of
+memory_region_allocate_system_memory().  The file descriptors needed
+will be supplied by the guest memory table from above.  Those RAM
+regions would then be added to the system_memory memory region with
+memory_region_add_subregion().
+
+2.3.6.1.3 PCI
+
+        IO initialization will be driven by the JSON description sent
+from the QEMU process.  For a PCI device, a PCI bus will need to be
+created with pci_root_bus_new(), and a PCI memory region will need to
+be created and added to the system_memory memory region with
+memory_region_add_subregion_overlap().  The overlap version is
+required for architectures where PCI memory overlaps with RAM memory.
+
+2.3.6.2 MMIO handling
+
+        The device emulation objects will use memory_region_init_io()
+to install their MMIO handlers, and pci_register_bar() to associate
+those handlers with a PCI BAR, as they do withing QEMU currently.
+
+        In order to use address_space_rw() in the emulation process to
+handle MMIO requests from QEMU, the PCI physical addresses must be the
+same in the QEMU process and the device emulation process.  In order
+to accomplish that, guest BAR programming must also be forwarded from
+QEMU to the emulation process.
+
+2.3.6.3 interrupt injection
+
+        When device emulation wants to inject an interrupt into the
+VM, the request climbs the device's bus object hierarchy until the
+point where a bus object knows how to signal the interrupt to the
+guest.  The details depend on the type of interrupt being raised.
+
+2.3.6.3.1 PCI pin interrupts
+
+        On x86 systems, there is an emulated IOAPIC object attached to
+the root PCI bus object, and the root PCI object forwards interrupt
+requests to it.  The IOAPIC object, in turn, calls the KVM driver to
+inject the corresponding interrupt into the VM.  The simplest way to
+handle this in an emulation process would be to setup the root PCI bus
+driver (via pci_bus_irqs()) to send a interrupt request back to the
+QEMU process, and have the device proxy object reflect it up the PCI
+tree there.
+
+2.3.6.3.2 PCI MSI/X interrupts
+
+        PCI MSI/X interrupts are implemented in HW as DMA writes to a
+CPU-specific PCI address.  In QEMU on x86, a KVM APIC object receives
+these DMA writes, then calls into the KVM driver to inject the
+interrupt into the VM.  A simple emulation process implementation
+would be to send the MSI DMA address from QEMU as a message at
+initialization, then install an address space handler at that address
+which forwards the MSI message back to QEMU.
+
+2.3.6.4 DMA operations
+
+        When a emulation object wants to DMA into or out of guest
+memory, it first must use dma_memory_map() to convert the DMA address
+to a local virtual address.  The emulation process memory region
+objects setup above will be used to translate the DMA address to a
+local virtual address the device emulation code can access.
+
+2.3.6.5 IOMMU
+
+        When an IOMMU is in use in QEMU, DMA translation uses IOMMU
+memory regions to translate the DMA address to a guest physical
+address before that physical address can be translated to a local
+virtual address.  The emulation process will need similar
+functionality.
+
+2.3.6.5.1 IOTLB cache
+
+        The emulation process will maintain a cache of recent IOMMU
+translations (the IOTLB).  When the translate() callback of an IOMMU
+memory region is invoked, the IOTLB cache will be searched for an
+entry that will map the DMA address to a guest PA.  On a cache miss, a
+message will be sent back to QEMU requesting the corresponding
+translation entry, which be both be used to return a guest address and
+be added to the cache.
+
+2.3.6.5.2 IOTLB purge
+
+        The IOMMU emulation will also need to act on unmap requests
+from QEMU.  These happen when the guest IOMMU driver purges an entry
+from the guest's translation table.
+
+2.4 Accelerating device emulation
+
+        The messages that are required to be sent between QEMU and the
+emulation process can add considerable latency to IO operations.  The
+optimizations described below attempt to ameliorate this effect by
+allowing the emulation process to communicate directly with the kernel
+KVM driver.  The KVM file descriptors created wold be passed to the
+emulation process via initialization messages, much like the guest
+memory table is done.
+
+2.4.1 MMIO acceleration
+
+        Vhost user applications can receive guest virtio driver stores
+directly from KVM.  The issue with the eventfd mechanism used by vhost
+user is that it does not pass any data with the event indication, so
+it cannot handle guest loads or guest stores that carry store data.
+This concept could, however, be expanded to cover more cases.
+
+        The expanded idea would require a new type of KVM device:
+KVM_DEV_TYPE_USER.  This device has two file descriptors: a master
+descriptor that QEMU can use for configuration, and a slave descriptor
+that the emulation process can use to receive MMIO notifications.
+QEMU would create both descriptors using the KVM driver, and pass the
+slave descriptor to the emulation process via an initialization
+message.
+
+2.4.1.1 data structures
+
+2.4.1.1.1 guest physical range
+
+        The guest physical range structure describes the address range
+that a device will respond to.  It includes the base and length of the
+range, as well as which bus the range resides on (e.g., on an x86
+machine, it can specify whether the range refers to memory or IO
+addresses).
+
+        A device can have multiple physical address ranges it responds
+to (e.g., a PCI device can have multiple BARs), so the structure will
+also include an enumeration value to specify which of the device's
+ranges is being referred to.
+
+2.4.1.1.2 MMIO request structure
+
+        This structure describes an MMIO operation.  It includes which
+guest physical range the MMIO was within, the offset within that
+range, the MMIO type (e.g., load or store), and its length and data.
+It also includes a sequence number that can be used to reply to the
+MMIO, and the CPU that issued the MMIO.
+
+2.4.1.1.3 MMIO request queues
+
+        MMIO request queues are FIFO arrays of MMIO request
+structures.  There are two queues: pending queue is for MMIOs that
+haven't been read by the emulation program, and the sent queue is for
+MMIOs that haven't been acknowledged.  The main use of the second
+queue is to validate MMIO replies from the emulation program.
+
+2.4.1.1.4 scoreboard
+
+        Each CPU in the VM is emulated in QEMU by a separate thread,
+so multiple MMIOs may be waiting to be consumed by an emulation
+program and multiple threads may be waiting for MMIO replies.  The
+scoreboard would contain a wait queue and sequence number for the
+per-CPU threads, allowing them to be individually woken when the MMIO
+reply is received from the emulation program.  It also tracks the
+number of posted MMIO stores to the device that haven't been replied
+to, in order to satisfy the PCI constraint that a load to a device
+will not complete until all previous stores to that device have been
+completed.
+
+2.4.1.1.5 device shadow memory
+
+        Some MMIO loads do not have device side-effects.  These MMIOs
+can be completed without sending a MMIO request to the emulation
+program if the emulation program shares a shadow image of the device's
+memory image with the KVM driver.
+
+        The emulation program will ask the KVM driver to allocate
+memory for the shadow image, and will then use mmap() to directly
+access it.  The emulation program can control KVM access to the shadow
+image by sending KVM an access map telling it which areas of the image
+have no side-effects (and can be completed immediately), and which
+require a MMIO request to the emulation program.  The access map can
+also inform the KVM drive which size accesses are allowed to the
+image.
+
+2.4.1.2 master descriptor
+
+        The master descriptor is used by QEMU to configure the new KVM
+device.  The descriptor would be returned by the KVM driver when QEMU
+issues a KVM_CREATE_DEVICE ioctl() with a KVM_DEV_TYPE_USER type.
+
+2.4.1.2.1 KVM_DEV_TYPE_USER device ops
+
+        The KVM_DEV_TYPE_USER operations vector will be registered by
+a kvm_register_device_ops() call when the KVM system in initialized by
+kvm_init().  These device ops are called by the KVM driver when QEMU
+executes certain ioctls() on its KVM file descriptor.  They include:
+
+2.4.1.1.2.1 create
+
+        This routine is called when QEMU issues a KVM_CREATE_DEVICE
+ioctl() on its per-VM file descriptor.  It will allocate and
+initialize a KVM user device specific data structure, and assign the
+kvm_device private field to it.
+
+2.4.1.1.2.2 ioctl
+
+        This routine is invoked when QEMU issues an ioctl() on the
+master descriptor.  The ioctl() commands supported are defined by the
+KVM device type.  KVM_DEV_TYPE_USER ones will need several commands:
+
+        KVM_DEV_USER_SLAVE_FD creates the slave file descriptor that
+will be passed to the device emulation program.  Only one slave can be
+created by each master descriptor.  The file operations performed by
+this descriptor are described below.
+
+        The KVM_DEV_USER_PA_RANGE command configures a guest physical
+address range that the slave descriptor will receive MMIO
+notifications for.  The range is specified by a guest physical range
+structure argument.  For buses that assign addresses to devices
+dynamically, this command can be executed while the guest is running,
+such as the case when a guest changes a device's PCI BAR registers.
+
+        KVM_DEV_USER_PA_RANGE will use kvm_io_bus_register_dev() to
+register kvm_io_device_ops callbacks to be invoked when the guest
+performs a MMIO operation within the range.  When a range is changed,
+kvm_io_bus_unregister_dev() is used to remove the previous
+instantiation.
+
+        KVM_DEV_USER_TIMEOUT will configure a timeout value that
+specifies how long KVM will wait for the emulation process to respond
+to a MMIO indication.
+
+2.4.1.1.2.3 destroy
+
+        This routine is called when the VM instance is destroyed.  It
+will need to destroy the slave descriptor; and free any memory
+allocated by the driver, as well as the kvm_device structure itself.
+
+2.4.1.3 slave descriptor
+
+        The slave descriptor will have its own file operations vector,
+which responds to system calls on the descriptor performed by the
+device emulation program.
+
+2.4.1.3.1 read
+
+        A read returns any pending MMIO requests from the KVM driver
+as MMIO request structures.  Multiple structures can be returned if
+there are multiple MMIO operations pending.  The MMIO requests are
+moved from the pending queue to the sent queue, and if there are
+threads waiting for space in the pending to add new MMIO operations,
+they will be woken here.
+
+2.4.1.3.2 write
+
+        A write also consists of a set of MMIO requests.  They are
+compared to the MMIO requests in the sent queue.  Matches are removed
+from the sent queue, and any threads waiting for the reply are woken.
+If a store is removed, then the number of posted stores in the per-CPU
+scoreboard is decremented.  When the number is zero, and a non
+side-effect load was waiting for posted stores to complete, the load
+is continued.
+
+2.4.1.3.3 ioctl
+
+        There are several ioctl()s that can be performed on the
+slave descriptor.
+
+        A KVM_DEV_USER_SHADOW_SIZE ioctl() causes the KVM driver to
+allocate memory for the shadow image.  This memory can later be
+mmap()ed by the emulation process to share the emulation's view of
+device memory with the KVM driver.
+
+        A KVM_DEV_USER_SHADOW_CTRL ioctl() controls access to the
+shadow image.  It will send the KVM driver a shadow control map, which
+specifies which areas of the image can complete guest loads without
+sending the load request to the emulation program.  It will also
+specify the size of load operations that are allowed.
+
+2.4.1.3.4 poll
+
+        An emulation program will use the poll() call with a POLLIN
+flag to determine if there are MMIO requests waiting to be read.  It
+will return if the pending MMIO request queue is not empty.
+
+2.4.1.3.5 mmap
+
+        This call allows the emulation program to directly access the
+shadow image allocated by the KVM driver.  As device emulation updates
+device memory, changes with no side-effects will be reflected in the
+shadow, and the KVM driver can satisfy guest loads from the shadow
+image without needing to wait for the emulation program.
+
+2.4.1.4 kvm_io_device ops
+
+        Each KVM per-CPU thread can handle MMIO operation on behalf of
+the guest VM.  KVM will use the MMIO's guest physical address to
+search for a matching kvm_io_devce to see if the MMIO can be handled
+by the KVM driver instead of exiting back to QEMU.  If a match is
+found, the corresponding callback will be invoked.
+
+2.4.1.4.1 read
+
+        This callback is invoked when the guest performs a load to the
+device.  Loads with side-effects must be handled synchronously, with
+the KVM driver putting the QEMU thread to sleep waiting for the
+emulation process reply before re-starting the guest.  Loads that do
+not have side-effects may be optimized by satisfying them from the
+shadow image, if there are no outstanding stores to the device by this
+CPU.  PCI memory ordering demands that a load cannot complete before
+all older stores to the same device have been completed.
+
+2.4.1.4.2 write
+
+        Stores can be handled asynchronously unless the pending MMIO
+request queue is full.  In this case, the QEMU thread must sleep
+waiting for space in the queue.  Stores will increment the number of
+posted stores in the per-CPU scoreboard, in order to implement the PCI
+ordering constraint above.
+
+2.4.2 interrupt acceleration
+
+        This performance optimization would work much like a vhost
+user application does, where the QEMU process sets up eventfds that
+cause the device's corresponding interrupt to be triggered by the KVM
+driver.  These irq file descriptors are sent to the emulation process
+at initialization, and are used when the emulation code raises a
+device interrupt.
+
+2.4.2.1 intx acceleration
+
+        Traditional PCI pin interrupts are level based, so, in
+addition to an irq file descriptor, a re-sampling file descriptor
+needs to be sent to the emulation program.  This second file
+descriptor allows multiple devices sharing an irq to be notified when
+the interrupt has been acknowledged by the guest, so they can
+re-trigger the interrupt if their device has not de-asserted it.
+
+2.4.2.1.1 intx irq descriptor
+
+        The irq descriptors are created by the proxy object using
+event_notifier_init() to create the irq and re-sampling eventds, and
+kvm_vm_ioctl(KVM_IRQFD) to bind them to an interrupt.  The interrupt
+route can be found with pci_device_route_intx_to_irq().
+
+2.4.2.1.2 intx routing changes
+
+        Intx routing can be changed when the guest programs the APIC
+the device pin is connected to.  The proxy object in QEMU will use
+pci_device_set_intx_routing_notifier() to be informed of any guest
+changes to the route.  This handler will broadly follow the VFIO
+interrupt logic to change the route: de-assigning the existing irq
+descriptor from its route, then assigning it the new route. (see
+vfio_intx_update())
+
+2.4.2.2 MSI/X acceleration
+
+        MSI/X interrupts are sent as DMA transactions to the host.
+The interrupt data contains a vector that is programed by the guest, A
+device may have multiple MSI interrupts associated with it, so
+multiple irq descriptors may need to be sent to the emulation program.
+
+2.4.2.2.1 MSI/X irq descriptor
+
+        This case will also follow the VFIO example.  For each MSI/X
+interrupt, an eventfd is created, a virtual interrupt is allocated by
+kvm_irqchip_add_msi_route(), and the virtual interrupt is bound to the
+eventfd with kvm_irqchip_add_irqfd_notifier().
+
+2.4.2.2.2 MSI/X config space changes
+
+        The guest may dynamically update several MSI-related tables in
+the device's PCI config space.  These include per-MSI interrupt
+enables and vector data.  Additionally, MSIX tables exist in device
+memory space, not config space.  Much like the BAR case above, the
+proxy object must look at guest config space programming to keep the
+MSI interrupt state consistent between QEMU and the emulation program.
+
+
+3. Disaggregated CPU emulation
+
+        After IO services have been disaggregated, a second phase
+would be to separate a process to handle CPU instruction emulation
+from the main QEMU control function.  There are no object separation
+points for this code, so the first task would be to create one.
+
+
+4. Host access controls
+
+        Separating QEMU relies on the host OS's access restriction
+mechanisms to enforce that the differing processes can only access the
+objects they are entitled to.  There are a couple types of mechanisms
+usually provided by general purpose OSs.
+
+4.1 Discretionary access control
+
+        Discretionary access control allows each user to control who
+can access their files. In Linux, this type of control is usually too
+coarse for QEMU separation, since it only provides three separate
+access controls: one for the same user ID, the second for users IDs
+with the same group ID, and the third for all other user IDs.  Each
+device instance would need a separate user ID to provide access
+control, which is likely to be unwieldy for dynamically created VMs.
+
+4.2 Mandatory access control
+
+        Mandatory access control allows the OS to add an additional
+set of controls on top of discretionary access for the OS to control.
+It also adds other attributes to processes and files such as types,
+roles, and categories, and can establish rules for how processes and
+files can interact.
+
+4.2.1 Type enforcement
+
+        Type enforcement assigns a 'type' attribute to processes and
+files, and allows rules to be written on what operations a process
+with a given type can perform on a file with a given type.  QEMU
+separation could take advantage of type enforcement by running the
+emulation processes with different types, both from the main QEMU
+process, and from the emulation processes of different classes of
+devices.
+
+        For example, guest disk images and disk emulation processes
+could have types separate from the main QEMU process and non-disk
+emulation processes, and the type rules could prevent processes other
+than disk emulation ones from accessing guest disk images.  Similarly,
+network emulation processes can have a type separate from the main
+QEMU process and non-network emulation process, and only that type can
+access the host tun/tap device used to provide guest networking.
+
+4.2.2 Category enforcement
+
+        Category enforcement assigns a set of numbers within a given
+ range to the process or file.  The process is granted access to the
+ file if the process's set is a superset of the file's set.  This
+ enforcement can be used to separate multiple instances of devices in
+ the same class.
+
+        For example, if there are multiple disk devices provides to a
+guest, each device emulation process could be provisioned with a
+separate category.  The different device emulation processes would not
+be able to access each other's backing disk images.
+
+        Alternatively, categories could be used in lieu of the type
+enforcement scheme described above.  In this scenario, different
+categories would be used to prevent device emulation processes in
+different classes from accessing resources assigned to other classes.
+
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread