qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Re: Requirements for out-of-process device emulation
       [not found] <CAJSP0QUUR72Rr_deeckz+RHpZMEBv692V4XupWy9ai3i2QD8bw@mail.gmail.com>
@ 2020-01-14 14:06 ` Stefan Hajnoczi
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2020-01-14 14:06 UTC (permalink / raw)
  To: Elena Ufimtseva, John G Johnson, Felipe Franciosi, Jag Raman,
	Michael S. Tsirkin, Gerd Hoffmann, Marc-André Lureau,
	Konrad Rzeszutek Wilk, Daniel P. Berrange, Paolo Bonzini,
	qemu-devel

The call is starting now!  Sorry, I forgot to send this to qemu-devel.

Stefan

On Tue, Jan 14, 2020 at 11:50 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> Hi,
> In today's KVM Community Call we will discuss multi-process QEMU and
> related topics (muser and VFIO).
>
> I wanted to share requirements that I've gathered from our previous discussions:
>  * Multiple bus types - new bus types can be added in the future.
>  * Security - VMM does not trust the device emulation process and vice versa.
>  * Unprivileged operation - QEMU and the device emulation process can
> be launched without root privileges.
>  * Live migration - saving device state and restoring it.
>  * Recovery - the device emulation process can be restarted after a
> crash without the guest's knowledge.
>  * vIOMMU - address translation and the ability to expose only a
> subset of guest RAM to the device emulation process.
>  * Portability - works across host OSes
>
> Following the VFIO API closely seems attractive to avoid reinventing the wheel.
>
> Stefan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Requirements for out-of-process device emulation
  2020-10-09 16:18 Stefan Hajnoczi
  2020-10-09 19:44 ` Alex Williamson
@ 2020-10-12 17:16 ` Alex Bennée
  1 sibling, 0 replies; 5+ messages in thread
From: Alex Bennée @ 2020-10-12 17:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Daniele Buono, slp, Michael S. Tsirkin,
	qemu-devel, marcandre.lureau, Hubertus Franke, rust-vmm,
	Thanos Makatos


Stefan Hajnoczi <stefanha@redhat.com> writes:

> I just posted the following on my blog to outline the requirements that
> have been discussed over the past few months around out-of-process
> device emulation (vhost-user, vfio-user, etc). I hope it's helpful for
> covering various angles of out-of-process device emulation.
>
> It's long, so no worries if you don't want to join the discussion.
>

Nice post.

> Security
> --------
> The trust model
> ```````````````
> The VMM must not trust the device emulation program. This is key to
> implementing privilege separation and the principle of least privilege.
> If a compromised device emulation program is able to gain control of the
> VMM then out-of-process device emulation has failed to provide isolation
> between devices.
>
> The device emulation program must not trust the VMM to the extent that
> this is possible. For example, it must validate inputs so that the VMM
> cannot gain control of the device emulation process through memory
> corruptions or other bugs. This makes it so that even if the VMM has
> been compromised, access to device resources and associated system calls
> still requires further compromising the device emulation process.

However in this model the guest intrinsically trusts device emulation
because it currently has full access to the guest's address space. It
would probably be worth making that explicit.

There are security models where the guest doesn't need to trust the VMM
or particular device emulations.


> Conclusion
> ----------
> This was largely a brain dump but I hope it is useful food for thought
> as out-of-process device emulation interfaces are designed and
> developed. There is a lot more to it than simply implementing a protocol
> for device register accesses and guest RAM DMA. Developing open source
> libraries in Rust and C that can be used as needed will ensure that
> out-of-process devices are high-quality and easy for users to deploy.

A useful exercise ;-)

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Requirements for out-of-process device emulation
  2020-10-09 19:44 ` Alex Williamson
@ 2020-10-12 15:39   ` Stefan Hajnoczi
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2020-10-12 15:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John G Johnson, Daniele Buono, slp, Michael S. Tsirkin,
	qemu-devel, marcandre.lureau, Hubertus Franke, rust-vmm,
	Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Fri, Oct 09, 2020 at 01:44:49PM -0600, Alex Williamson wrote:
> On Fri, 9 Oct 2020 17:18:15 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > Extensibility for new bus types
> > ```````````````````````````````
> > It should be possible to support multiple bus types. vhost-user only
> > supports vhost devices. VFIO is more extensible but currently focussed
> > on PCI devices.
> 
> Wait a sec, the vfio API essentially deconstructs devices into exactly
> the resources you've outlined above.  We not only have a vfio-pci
> device convention within vfio, but we've defined vfio-platform,
> vfio-amba, vfio-ccw, vfio-ap, and we'll likely be adding vfio-fsl-mc in
> the next kernel.  The core device, group, and container model within
> vfio is completely device/bus agnostic.  So while it's true that
> vfio-pci is the most mature and featureful convention, that's largely a
> reflection that PCI is the most ubiquitous device interface currently
> available.  Thanks,

Hi Alex,
Yes, I don't mean to say that VFIO cannot support new bus types.

The most likely new bus type I can foresee is QEMU's SysBus, which would
allow moving ISA, System-on-Chip, etc devices into a separate process.

We'll need to figure out whether vfio-user evolves independently from
the kernel VFIO ioctl interface or whether efforts are made to keep the
two in sync. The kernel may not need SysBus, but as the vfio-user
protocol diverges from the kernel VFIO ioctl interface it becomes harder
to share the commands and avoid duplication.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Requirements for out-of-process device emulation
  2020-10-09 16:18 Stefan Hajnoczi
@ 2020-10-09 19:44 ` Alex Williamson
  2020-10-12 15:39   ` Stefan Hajnoczi
  2020-10-12 17:16 ` Alex Bennée
  1 sibling, 1 reply; 5+ messages in thread
From: Alex Williamson @ 2020-10-09 19:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Daniele Buono, slp, Michael S. Tsirkin,
	qemu-devel, marcandre.lureau, Hubertus Franke, rust-vmm,
	Thanos Makatos

On Fri, 9 Oct 2020 17:18:15 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> Device emulation
> ----------------
> Device resources
> ````````````````
> Devices provide resources that drivers interact with such as hardware
> registers, memory, or interrupts. The fundamental requirement of
> out-of-process device emulation is exposing device resources.
> 
> The following types of device resources are needed:
> 
> Synchronous MMIO/PIO accesses
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The most basic device emulation operation is the hardware register
> access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO)
> access to the device. A read loads a value from a device register. A
> write stores a value to a device register. These operations are
> synchronous because the vCPU is paused until completion.
> Asynchronous doorbells
> 
> Devices often have doorbell registers, allowing the driver to inform the
> device that new requests are ready for processing. The vCPU does not
> need to wait since the access is a posted write.
> 
> The kvm.ko ioeventfd mechanism can be used to implement asynchronous
> doorbells.
> 
> Shared device memory
> ~~~~~~~~~~~~~~~~~~~~
> Devices may have memory-like regions that the CPU can access (such as
> PCI Memory BARs). The device emulation process therefore needs to share
> a region of its memory space with the VMM so the guest can access it.
> This mechanism also allows device emulation to busy wait (poll) instead
> of using synchronous MMIO/PIO accesses or asynchronous doorbells for
> notifications.
> 
> Direct Memory Access (DMA)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Devices often require read and write access to a memory address space
> belonging to the CPU. This allows network cards to transmit packet
> payloads that are located in guest RAM, for example.
> 
> Early out-of-process device emulation interfaces simply shared guest
> RAM. The allowed DMA to any guest physical memory address. More advanced
> IOMMU and address space identifier mechanisms are now becoming
> ubiquitous. Therefore, new out-of-process device emulation interfaces
> should incorporate IOMMU functionality.
> 
> The key requirement for IOMMU mechanisms is allowing the VMM to grant
> access to a region of memory so the device emulation process can read
> from and/or write to it.
> 
> Interrupts
> ~~~~~~~~~~
> Devices notify the CPU using interrupts. An interrupt is simply a
> message sent by the device emulation process to the VMM. Interrupt
> configuration is flexible on modern devices, meaning the driver may be
> able to select the number of interrupts and a mapping (using one
> interrupt with multiple event sources). This can be implemented using
> the Linux eventfd mechanism or via in-band device emulation protocol
> messages, for example.
> 
> Extensibility for new bus types
> ```````````````````````````````
> It should be possible to support multiple bus types. vhost-user only
> supports vhost devices. VFIO is more extensible but currently focussed
> on PCI devices.

Wait a sec, the vfio API essentially deconstructs devices into exactly
the resources you've outlined above.  We not only have a vfio-pci
device convention within vfio, but we've defined vfio-platform,
vfio-amba, vfio-ccw, vfio-ap, and we'll likely be adding vfio-fsl-mc in
the next kernel.  The core device, group, and container model within
vfio is completely device/bus agnostic.  So while it's true that
vfio-pci is the most mature and featureful convention, that's largely a
reflection that PCI is the most ubiquitous device interface currently
available.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Requirements for out-of-process device emulation
@ 2020-10-09 16:18 Stefan Hajnoczi
  2020-10-09 19:44 ` Alex Williamson
  2020-10-12 17:16 ` Alex Bennée
  0 siblings, 2 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2020-10-09 16:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G Johnson, Daniele Buono, slp, Michael S. Tsirkin, rust-vmm,
	Hubertus Franke, Thanos Makatos, marcandre.lureau

[-- Attachment #1: Type: text/plain, Size: 13337 bytes --]

I just posted the following on my blog to outline the requirements that
have been discussed over the past few months around out-of-process
device emulation (vhost-user, vfio-user, etc). I hope it's helpful for
covering various angles of out-of-process device emulation.

It's long, so no worries if you don't want to join the discussion.

Stefan
---
Requirements for out-of-process device emulation
================================================
Over the past months I have participated in discussions about
out-of-process device emulation. This post describes the requirements
that have become apparent. I hope this will be a useful guide to
understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?
----------------------------------------
Device emulation is traditionally implemented in the program that
executes guest code. This approach is natural because accesses to device
registers are trapped as part of the CPU run loop that sits at the core
of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in
separate processes. For example, software-defined network switches can
minimize data copies by emulating network cards directly in the switch
process. Out-of-process device emulation also enables privilege
separation and tighter sandboxing for security.

Why are these requirements important?
-------------------------------------
When emulated devices are implemented in the VMM they use common VMM
APIs. Adding new devices is relatively easy because the APIs are already
there and the developer can focus on the device specifics.
Out-of-process device emulation potentially leaves developers without
APIs since the device emulation program is a separate program that
literally starts from main(). Developers want to focus on implementing
their specific device, not on solving general problems related to
out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device
completely from scratch, but there is also a risk of developing the
wrong solution because some subtleties of device emulation are not
obvious at first glance.

I hope sharing these requirements will help in the creation of common
infrastructure so it's easy to implement high-quality out-of-process
devices.

Not all use cases have the full set of requirements. Therefore it's best
if requirements are addressed in separate, reusable libraries so that
device implementors can pick the ones that are relevant to them.

Device emulation
----------------
Device resources
````````````````
Devices provide resources that drivers interact with such as hardware
registers, memory, or interrupts. The fundamental requirement of
out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The most basic device emulation operation is the hardware register
access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO)
access to the device. A read loads a value from a device register. A
write stores a value to a device register. These operations are
synchronous because the vCPU is paused until completion.
Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the
device that new requests are ready for processing. The vCPU does not
need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous
doorbells.

Shared device memory
~~~~~~~~~~~~~~~~~~~~
Devices may have memory-like regions that the CPU can access (such as
PCI Memory BARs). The device emulation process therefore needs to share
a region of its memory space with the VMM so the guest can access it.
This mechanism also allows device emulation to busy wait (poll) instead
of using synchronous MMIO/PIO accesses or asynchronous doorbells for
notifications.

Direct Memory Access (DMA)
~~~~~~~~~~~~~~~~~~~~~~~~~~
Devices often require read and write access to a memory address space
belonging to the CPU. This allows network cards to transmit packet
payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest
RAM. The allowed DMA to any guest physical memory address. More advanced
IOMMU and address space identifier mechanisms are now becoming
ubiquitous. Therefore, new out-of-process device emulation interfaces
should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant
access to a region of memory so the device emulation process can read
from and/or write to it.

Interrupts
~~~~~~~~~~
Devices notify the CPU using interrupts. An interrupt is simply a
message sent by the device emulation process to the VMM. Interrupt
configuration is flexible on modern devices, meaning the driver may be
able to select the number of interrupts and a mapping (using one
interrupt with multiple event sources). This can be implemented using
the Linux eventfd mechanism or via in-band device emulation protocol
messages, for example.

Extensibility for new bus types
```````````````````````````````
It should be possible to support multiple bus types. vhost-user only
supports vhost devices. VFIO is more extensible but currently focussed
on PCI devices. It is likely that QEMU SysBus devices will be desirable
for implementing ad-hoc out-of-process devices (especially for
System-on-Chip target platforms).

Bus-level APIs, not protocol bindings
`````````````````````````````````````
Developers should not need to learn the out-of-process device emulation
protocol (vfio-user, etc). APIs should focus on bus-level concepts such
as defining VIRTIO or PCI devices rather than protocol bindings for
dealing with protocol messages, file descriptor passing, and shared
memory.

In other words, developers should be thinking in terms of the problem
domain, not worrying about how out-of-process device emulation is
implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning
``````````````````````````````````````````
Threading issues arise often in device emulation because asynchronous
requests or multi-queue devices can be implemented using threads.
Therefore it is necessary to clearly document what threading models are
supported and how device lifecycle operations like reset interact with
in-flight requests.

Live migration, live upgrade, and crash recovery
------------------------------------------------
There are several related issues around device state and restarting the
device emulation program without disrupting the guest.

Live migration
``````````````
Live migration transfers the state of a device from one device emulation
process to another (typically running on another host). This requires
the following functionality:

Quiescing the device
~~~~~~~~~~~~~~~~~~~~
Some devices can be live migrated at any point in time without any
preparation, while others must be put into a quiescent state to avoid
issues. An example is a storage controller that has a write request in
flight. It is not safe to live migration until the write request has
completed or been canceled. Failure to wait might result in data
corruption if the write takes effect after the destination has resumed
execution.

Therefore it is necessary to quiesce a device. After this point there is
no further device activity and no guest-visible changes will be made by
the device.

Saving/loading device state
~~~~~~~~~~~~~~~~~~~~~~~~~~~
It must be possible to save and load device state. Device state includes
the contents of hardware registers as well as device-internal state
necessary for resuming operation.

It is typically necessary to determine whether the device emulation
processes on the migration source and destination are compatible before
attempting migration. This avoids migration failure when the destination
tries to load the device state and discovers it doesn't support it. It
may be desirable to support loading device state that was generated by a
different implementation of the same device type (for example, two
virtio-net implementations).

Dirty memory logging
~~~~~~~~~~~~~~~~~~~~
Pre-copy live migration starts with an iterative phase where dirty
memory pages are copied from the migration source to the destination
host. Devices need to participate in dirty memory logging so that all
written pages are transferred to the destination and no pages are
"missed".

Crash recovery
``````````````
If the device emulation process crashes it should be possible to restart
it and resume device emulation without disrupting the guest (aside from
a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware
registers, etc) outside the device emulation process. This way the state
remains even if the process crashes and it can be resume when a new
process starts.

Live upgrade
````````````
It must be possible to upgrade the device emulation process and the VMM
without disrupting the guest. Upgrading the device emulation process is
similar to crash recovery in that the process terminates and a new one
resumes with the previous state.

Device versioning
`````````````````
The guest-visible aspects of the device must be versioned. In the
simplest case the device emulation program would have a
--compat-version=N command-line option that controls which version of
the device the guest sees. When guest-visible changes are made to the
program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to
save/load and live migrate reliably. Otherwise loading device state in a
newer device emulation program could affect the running guest. Guest
drivers typically are not prepared for the device to change underneath
them and doing so could result in guest crashes or data corruption.

Security
--------
The trust model
```````````````
The VMM must not trust the device emulation program. This is key to
implementing privilege separation and the principle of least privilege.
If a compromised device emulation program is able to gain control of the
VMM then out-of-process device emulation has failed to provide isolation
between devices.

The device emulation program must not trust the VMM to the extent that
this is possible. For example, it must validate inputs so that the VMM
cannot gain control of the device emulation process through memory
corruptions or other bugs. This makes it so that even if the VMM has
been compromised, access to device resources and associated system calls
still requires further compromising the device emulation process.

Unprivileged operation
``````````````````````
The device emulation program should run unprivileged to the extent that
this is possible. If special permissions are required to access hardware
resources then these resources can sometimes be provided via file
descriptor passing by a more privileged parent process.

Sandboxing
``````````
Operating system sandboxing mechanisms can be applied to device
emulation processes more effectively than monolithic VMMs. Seccomp can
limit the Linux system calls that may be invoked. SELinux can restrict
access to system resources.

Sandboxing is a common task that most device emulation programs need.
Therefore it is a good candidate for a library or launcher tool that is
shared by device emulation programs.

Management
----------
Command-line interface
``````````````````````
A common command-line interface should be defined where possible. For
example, vhost-user's standard --socket-path=PATH argument makes it easy
to launch any vhost-user device backend. Protocol-specific options (e.g.
socket path) and device type-specific options (e.g. virtio-net) can be
standardized.

Some options are necessarily specific to the device emulation program
and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt
can launch the device emulation programs without further user
configuration.

RPC interface
`````````````
It may be necessary to issue commands at runtime. Examples include
adjusting throttling limits, enabling/disabling logging, etc. These
operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization
software. Adopting a widely-used RPC protocol and standardizing commands
is beneficial because it makes it easy to communicate with the software
and management tools can support them relatively easily.

Conclusion
----------
This was largely a brain dump but I hope it is useful food for thought
as out-of-process device emulation interfaces are designed and
developed. There is a lot more to it than simply implementing a protocol
for device register accesses and guest RAM DMA. Developing open source
libraries in Rust and C that can be used as needed will ensure that
out-of-process devices are high-quality and easy for users to deploy.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-12 17:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJSP0QUUR72Rr_deeckz+RHpZMEBv692V4XupWy9ai3i2QD8bw@mail.gmail.com>
2020-01-14 14:06 ` Requirements for out-of-process device emulation Stefan Hajnoczi
2020-10-09 16:18 Stefan Hajnoczi
2020-10-09 19:44 ` Alex Williamson
2020-10-12 15:39   ` Stefan Hajnoczi
2020-10-12 17:16 ` Alex Bennée

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).