All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] Constraining where a guest may allocate virtio accessible resources
@ 2020-06-17 17:31 Alex Bennée
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Alex Bennée @ 2020-06-17 17:31 UTC (permalink / raw)
  To: virtio-dev
  Cc: David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Michael S. Tsirkin, Jean-Philippe Brucker


Hi,

This follows on from the discussion in the last thread I raised:

  Subject: Backend libraries for VirtIO device emulation
  Date: Fri, 06 Mar 2020 18:33:57 +0000
  Message-ID: <874kv15o4q.fsf@linaro.org>

To support the concept of a VirtIO backend having limited visibility of
a guests memory space there needs to be some mechanism to limit the
where that guest may place things. A simple VirtIO device can be
expressed purely in virt resources, for example:

   * status, feature and config fields
   * notification/doorbell
   * one or more virtqueues

Using a PCI backend the location of everything but the virtqueues it
controlled by the mapping of the PCI device so something that is
controllable by the host/hypervisor. However the guest is free to
allocate the virtqueues anywhere in the virtual address space of system
RAM.

In theory this shouldn't matter because sharing virtual pages is just a
matter of putting the appropriate translations in place. However there
are multiple ways the host and guest may interact:

* QEMU TCG

QEMU sees a block of system memory in it's virtual address space that
has a one to one mapping with the guests physical address space. If QEMU
want to share a subset of that address space it can only realistically
do it for a contiguous region of it's address space which implies the
guest must use a contiguous region of it's physical address space.

* QEMU KVM

The situation here is broadly the same - although both QEMU and the
guest are seeing a their own virtual views of a linear address space
which may well actually be a fragmented set of physical pages on the
host.

KVM based guests have additional constraints if they ever want to access
real hardware in the host as you need to ensure any address accessed by
the guest can be eventually translated into an address that can
physically access the bus which a device in one (for device
pass-through). The area also has to be DMA coherent so updates from a
bus are reliably visible to software accessing the same address space.

* Xen (and other type-1's?)

Here the situation is a little different because the guest explicitly
makes it's pages visible to other domains by way of grant tables. The
guest is still free to use whatever parts of its address space it wishes
to. Other domains then request access to those pages via the hypervisor.

In theory the requester is free to map the granted pages anywhere in
its own address space. However there are differences between the
architectures on how well this is supported.

So I think this makes a case for having a mechanism by which the guest
can restrict it's allocation to a specific area of the guest physical
address space. The question is then what is the best way to inform the
guest kernel of the limitation?

Option 1 - Kernel Command Line
==============================

This isn't without precedent - the kernel supports options like "memmap"
which can with the appropriate amount of crafting be used to carve out
sections of bad ram from the physical address space. Other formulations
can be used to mark specific areas of the address space as particular
types of memory.  

However there are cons to this approach as it then becomes a job for
whatever builds the VMM command lines to ensure the both the backend and
the kernel know where things are. It is also very Linux centric and
doesn't solve the problem for other guest OSes. Considering the rest of
VirtIO can be made discover-able this seems like it would be a backward
step.

Option 2 - Additional Platform Data
===================================

This would be extending using something like device tree or ACPI tables
which could define regions of memory that would inform the low level
memory allocation routines where they could allocate from. There is
already of the concept of "dma-ranges" in device tree which can be a
per-device property which defines the region of space that is DMA
coherent for a device.

There is the question of how you tie regions declared here with the
eventual instantiating of the VirtIO devices?

For a fully distributed set of backends (one backend per device per
worker VM) you would need several different regions. Would each region
be tied to each device or just a set of areas the guest would allocate
from in sequence?

Option 3 - Abusing PCI Regions
==============================

One of the reasons to use the VirtIO PCI backend it to help with
automatic probing and setup. Could we define a new PCI region which on
backend just maps to RAM but from the front-ends point of view is a
region it can allocate it's virtqueues? Could we go one step further and
just let the host to define and allocate the virtqueue in the reserved
PCI space and pass the base of it somehow?

Options 4 - Extend VirtIO Config
================================

Another approach would be to extend the VirtIO configuration and
start-up handshake to supply these limitations to the guest. This could
be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
additional configuration information.

One problem I can foresee is device initialisation is usually done
fairly late in the start-up of a kernel by which time any memory zoning
restrictions will likely need to have informed the kernels low level
memory management. Does that mean we would have to combine such a
feature behaviour with a another method anyway?

Option 5 - Additional Device
============================

The final approach would be to tie the allocation of virtqueues to
memory regions as defined by additional devices. For example the
proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
a fixed non-mappable region of the address space. Other proposals like
virtio-mem allow for hot plugging of "physical" memory into the guest
(conveniently treatable as separate shareable memory objects for QEMU
;-).


Closing Thoughts and Open Questions
===================================

Currently all of this is considering just virtqueues themselves but of
course only a subset of devices interact purely by virtqueue messages.
Network and Block devices often end up filling up additional structures
in memory that are usually across the whole of system memory. To achieve
better isolation you either need to ensure that specific bits of kernel
allocation are done in certain regions (i.e. block cache in "shared"
region) or implement some sort of bounce buffer [1] that allows you to bring
data from backend to frontend (which is more like the channel concept of
Xen's PV).

I suspect the solution will end up being a combination of all of these
approaches. There setup of different systems might mean we need a
plethora of ways to carve out and define regions in ways a kernel can
understand and make decisions about.

I think there will always have to be an element of VirtIO config
involved as that is *the* mechanism by which front/back end negotiate if
they can get up and running in a way they are both happy with.

One potential approach would be to introduce the concept of a region id
at the VirtIO config level which is simply a reasonably unique magic
number that virtio driver passes down into the kernel when requesting
memory for it's virtqueues. It could then be left to the kernel to
associate use that id when identifying the physical address range to
allocate from. This seems a bit of a loose binding between the driver
level and the kernel level but perhaps that is preferable to allow for
flexibility about how such regions are discovered by kernels?

I hope this message hasn't rambled on to much. I feel this is a complex
topic and I'm want to be sure I've thought through all the potential
options before starting to prototype a solution. For those that have
made it this far the final questions are:

  - is constraining guest allocation of virtqueues a reasonable requirement?

  - could virtqueues ever be directly host/hypervisor assigned?

  - should there be a tight or loose coupling between front-end driver
    and kernel/hypervisor support for allocating memory?

Of course if this is all solvable with existing code I'd be more than
happy but please let me know how ;-)

Regards,


-- 
Alex Bennée

[1] Example bounce buffer approach

Subject: [PATCH 0/5] virtio on Type-1 hypervisor
Message-Id: <1588073958-1793-1-git-send-email-vatsa@codeaurora.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 17:31 [virtio-dev] Constraining where a guest may allocate virtio accessible resources Alex Bennée
@ 2020-06-17 18:01 ` Jan Kiszka
  2020-06-18 13:29   ` Stefan Hajnoczi
                     ` (2 more replies)
  2020-06-18  7:30 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 18+ messages in thread
From: Jan Kiszka @ 2020-06-17 18:01 UTC (permalink / raw)
  To: Alex Bennée, virtio-dev
  Cc: David Hildenbrand, Srivatsa Vaddagiri, Azzedine Touzni,
	François Ozog, Ilias Apalodimas, Soni, Trilok,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Michael S. Tsirkin,
	Jean-Philippe Brucker

On 17.06.20 19:31, Alex Bennée wrote:
> 
> Hi,
> 
> This follows on from the discussion in the last thread I raised:
> 
>   Subject: Backend libraries for VirtIO device emulation
>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>   Message-ID: <874kv15o4q.fsf@linaro.org>
> 
> To support the concept of a VirtIO backend having limited visibility of
> a guests memory space there needs to be some mechanism to limit the
> where that guest may place things. A simple VirtIO device can be
> expressed purely in virt resources, for example:
> 
>    * status, feature and config fields
>    * notification/doorbell
>    * one or more virtqueues
> 
> Using a PCI backend the location of everything but the virtqueues it
> controlled by the mapping of the PCI device so something that is
> controllable by the host/hypervisor. However the guest is free to
> allocate the virtqueues anywhere in the virtual address space of system
> RAM.
> 
> In theory this shouldn't matter because sharing virtual pages is just a
> matter of putting the appropriate translations in place. However there
> are multiple ways the host and guest may interact:
> 
> * QEMU TCG
> 
> QEMU sees a block of system memory in it's virtual address space that
> has a one to one mapping with the guests physical address space. If QEMU
> want to share a subset of that address space it can only realistically
> do it for a contiguous region of it's address space which implies the
> guest must use a contiguous region of it's physical address space.
> 
> * QEMU KVM
> 
> The situation here is broadly the same - although both QEMU and the
> guest are seeing a their own virtual views of a linear address space
> which may well actually be a fragmented set of physical pages on the
> host.
> 
> KVM based guests have additional constraints if they ever want to access
> real hardware in the host as you need to ensure any address accessed by
> the guest can be eventually translated into an address that can
> physically access the bus which a device in one (for device
> pass-through). The area also has to be DMA coherent so updates from a
> bus are reliably visible to software accessing the same address space.
> 
> * Xen (and other type-1's?)
> 
> Here the situation is a little different because the guest explicitly
> makes it's pages visible to other domains by way of grant tables. The
> guest is still free to use whatever parts of its address space it wishes
> to. Other domains then request access to those pages via the hypervisor.
> 
> In theory the requester is free to map the granted pages anywhere in
> its own address space. However there are differences between the
> architectures on how well this is supported.
> 
> So I think this makes a case for having a mechanism by which the guest
> can restrict it's allocation to a specific area of the guest physical
> address space. The question is then what is the best way to inform the
> guest kernel of the limitation?
> 
> Option 1 - Kernel Command Line
> ==============================
> 
> This isn't without precedent - the kernel supports options like "memmap"
> which can with the appropriate amount of crafting be used to carve out
> sections of bad ram from the physical address space. Other formulations
> can be used to mark specific areas of the address space as particular
> types of memory.  
> 
> However there are cons to this approach as it then becomes a job for
> whatever builds the VMM command lines to ensure the both the backend and
> the kernel know where things are. It is also very Linux centric and
> doesn't solve the problem for other guest OSes. Considering the rest of
> VirtIO can be made discover-able this seems like it would be a backward
> step.
> 
> Option 2 - Additional Platform Data
> ===================================
> 
> This would be extending using something like device tree or ACPI tables
> which could define regions of memory that would inform the low level
> memory allocation routines where they could allocate from. There is
> already of the concept of "dma-ranges" in device tree which can be a
> per-device property which defines the region of space that is DMA
> coherent for a device.
> 
> There is the question of how you tie regions declared here with the
> eventual instantiating of the VirtIO devices?
> 
> For a fully distributed set of backends (one backend per device per
> worker VM) you would need several different regions. Would each region
> be tied to each device or just a set of areas the guest would allocate
> from in sequence?
> 
> Option 3 - Abusing PCI Regions
> ==============================
> 
> One of the reasons to use the VirtIO PCI backend it to help with
> automatic probing and setup. Could we define a new PCI region which on
> backend just maps to RAM but from the front-ends point of view is a
> region it can allocate it's virtqueues? Could we go one step further and
> just let the host to define and allocate the virtqueue in the reserved
> PCI space and pass the base of it somehow?
> 
> Options 4 - Extend VirtIO Config
> ================================
> 
> Another approach would be to extend the VirtIO configuration and
> start-up handshake to supply these limitations to the guest. This could
> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
> additional configuration information.
> 
> One problem I can foresee is device initialisation is usually done
> fairly late in the start-up of a kernel by which time any memory zoning
> restrictions will likely need to have informed the kernels low level
> memory management. Does that mean we would have to combine such a
> feature behaviour with a another method anyway?
> 
> Option 5 - Additional Device
> ============================
> 
> The final approach would be to tie the allocation of virtqueues to
> memory regions as defined by additional devices. For example the
> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> a fixed non-mappable region of the address space. Other proposals like
> virtio-mem allow for hot plugging of "physical" memory into the guest
> (conveniently treatable as separate shareable memory objects for QEMU
> ;-).
> 

I think you forgot one approach: virtual IOMMU. That is the advanced
form of the grant table approach. The backend still "sees" the full
address space of the frontend, but it will not be able to access all of
it and there might even be a translation going on. Well, like IOMMUs work.

However, this implies dynamics that are under guest control, namely of
the frontend guest. And such dynamics can be counterproductive for
certain scenarios. That's where this static windows of shared memory
came up.

> 
> Closing Thoughts and Open Questions
> ===================================
> 
> Currently all of this is considering just virtqueues themselves but of
> course only a subset of devices interact purely by virtqueue messages.
> Network and Block devices often end up filling up additional structures
> in memory that are usually across the whole of system memory. To achieve
> better isolation you either need to ensure that specific bits of kernel
> allocation are done in certain regions (i.e. block cache in "shared"
> region) or implement some sort of bounce buffer [1] that allows you to bring
> data from backend to frontend (which is more like the channel concept of
> Xen's PV).

For [1], look at https://lkml.org/lkml/2020/3/26/700 or at
http://git.kiszka.org/?p=linux.git;a=blob;f=drivers/virtio/virtio_ivshmem.c;hb=refs/heads/queues/jailhouse
(which should be using swiotlb one day).

> 
> I suspect the solution will end up being a combination of all of these
> approaches. There setup of different systems might mean we need a
> plethora of ways to carve out and define regions in ways a kernel can
> understand and make decisions about.
> 
> I think there will always have to be an element of VirtIO config
> involved as that is *the* mechanism by which front/back end negotiate if
> they can get up and running in a way they are both happy with.
> 
> One potential approach would be to introduce the concept of a region id
> at the VirtIO config level which is simply a reasonably unique magic
> number that virtio driver passes down into the kernel when requesting
> memory for it's virtqueues. It could then be left to the kernel to
> associate use that id when identifying the physical address range to
> allocate from. This seems a bit of a loose binding between the driver
> level and the kernel level but perhaps that is preferable to allow for
> flexibility about how such regions are discovered by kernels?
> 
> I hope this message hasn't rambled on to much. I feel this is a complex
> topic and I'm want to be sure I've thought through all the potential
> options before starting to prototype a solution. For those that have
> made it this far the final questions are:
> 
>   - is constraining guest allocation of virtqueues a reasonable requirement?
> 
>   - could virtqueues ever be directly host/hypervisor assigned?
> 
>   - should there be a tight or loose coupling between front-end driver
>     and kernel/hypervisor support for allocating memory?
> 
> Of course if this is all solvable with existing code I'd be more than
> happy but please let me know how ;-)
> 

Queues are a central element of virtio, but there is a (maintainability
& security) benefit if you can keep them away from the hosting
hypervisor, limit their interpretation and negotiation to the backend
driver in a host process or in a backend guest VM. So I would be careful
with coupling things too tightly.

One of the issues I see in virtio for use in minimalistic hypervisors is
the need to be aware of the different virtio devices when using PCI or
MMIO transports. That's where a shared memory transport come into play.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 17:31 [virtio-dev] Constraining where a guest may allocate virtio accessible resources Alex Bennée
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
@ 2020-06-18  7:30 ` Michael S. Tsirkin
  2020-06-19 18:20   ` Alex Bennée
  2020-06-18 13:25 ` Stefan Hajnoczi
  2020-06-19  8:02 ` Jean-Philippe Brucker
  3 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2020-06-18  7:30 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Jean-Philippe Brucker

On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
> 
> Hi,
> 
> This follows on from the discussion in the last thread I raised:
> 
>   Subject: Backend libraries for VirtIO device emulation
>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>   Message-ID: <874kv15o4q.fsf@linaro.org>
> 
> To support the concept of a VirtIO backend having limited visibility of
> a guests memory space there needs to be some mechanism to limit the
> where that guest may place things. A simple VirtIO device can be
> expressed purely in virt resources, for example:
> 
>    * status, feature and config fields
>    * notification/doorbell
>    * one or more virtqueues
> 
> Using a PCI backend the location of everything but the virtqueues it
> controlled by the mapping of the PCI device so something that is
> controllable by the host/hypervisor. However the guest is free to
> allocate the virtqueues anywhere in the virtual address space of system
> RAM.
> 
> In theory this shouldn't matter because sharing virtual pages is just a
> matter of putting the appropriate translations in place. However there
> are multiple ways the host and guest may interact:
> 
> * QEMU TCG
> 
> QEMU sees a block of system memory in it's virtual address space that
> has a one to one mapping with the guests physical address space. If QEMU
> want to share a subset of that address space it can only realistically
> do it for a contiguous region of it's address space which implies the
> guest must use a contiguous region of it's physical address space.
> 
> * QEMU KVM
> 
> The situation here is broadly the same - although both QEMU and the
> guest are seeing a their own virtual views of a linear address space
> which may well actually be a fragmented set of physical pages on the
> host.
> 
> KVM based guests have additional constraints if they ever want to access
> real hardware in the host as you need to ensure any address accessed by
> the guest can be eventually translated into an address that can
> physically access the bus which a device in one (for device
> pass-through). The area also has to be DMA coherent so updates from a
> bus are reliably visible to software accessing the same address space.
> 
> * Xen (and other type-1's?)
> 
> Here the situation is a little different because the guest explicitly
> makes it's pages visible to other domains by way of grant tables. The
> guest is still free to use whatever parts of its address space it wishes
> to. Other domains then request access to those pages via the hypervisor.
> 
> In theory the requester is free to map the granted pages anywhere in
> its own address space. However there are differences between the
> architectures on how well this is supported.
> 
> So I think this makes a case for having a mechanism by which the guest
> can restrict it's allocation to a specific area of the guest physical
> address space. The question is then what is the best way to inform the
> guest kernel of the limitation?

Something that's unclear to me is whether you envision each
device to have its own dedicated memory it can access,
or broadly to have a couple of groups of devices,
kind of like e.g. there are 32 bit and 64 bit DMA capable pci devices,
or like we have devices with VIRTIO_F_ACCESS_PLATFORM and
without it?


> Option 1 - Kernel Command Line
> ==============================
> 
> This isn't without precedent - the kernel supports options like "memmap"
> which can with the appropriate amount of crafting be used to carve out
> sections of bad ram from the physical address space. Other formulations
> can be used to mark specific areas of the address space as particular
> types of memory.  
> 
> However there are cons to this approach as it then becomes a job for
> whatever builds the VMM command lines to ensure the both the backend and
> the kernel know where things are. It is also very Linux centric and
> doesn't solve the problem for other guest OSes. Considering the rest of
> VirtIO can be made discover-able this seems like it would be a backward
> step.
> 
> Option 2 - Additional Platform Data
> ===================================
> 
> This would be extending using something like device tree or ACPI tables
> which could define regions of memory that would inform the low level
> memory allocation routines where they could allocate from. There is
> already of the concept of "dma-ranges" in device tree which can be a
> per-device property which defines the region of space that is DMA
> coherent for a device.
> 
> There is the question of how you tie regions declared here with the
> eventual instantiating of the VirtIO devices?
> 
> For a fully distributed set of backends (one backend per device per
> worker VM) you would need several different regions. Would each region
> be tied to each device or just a set of areas the guest would allocate
> from in sequence?
> 
> Option 3 - Abusing PCI Regions
> ==============================
> 
> One of the reasons to use the VirtIO PCI backend it to help with
> automatic probing and setup. Could we define a new PCI region which on
> backend just maps to RAM but from the front-ends point of view is a
> region it can allocate it's virtqueues? Could we go one step further and
> just let the host to define and allocate the virtqueue in the reserved
> PCI space and pass the base of it somehow?
> 
> Options 4 - Extend VirtIO Config
> ================================
> 
> Another approach would be to extend the VirtIO configuration and
> start-up handshake to supply these limitations to the guest. This could
> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
> additional configuration information.
> 
> One problem I can foresee is device initialisation is usually done
> fairly late in the start-up of a kernel by which time any memory zoning
> restrictions will likely need to have informed the kernels low level
> memory management. Does that mean we would have to combine such a
> feature behaviour with a another method anyway?
> 
> Option 5 - Additional Device
> ============================
> 
> The final approach would be to tie the allocation of virtqueues to
> memory regions as defined by additional devices. For example the
> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> a fixed non-mappable region of the address space. Other proposals like
> virtio-mem allow for hot plugging of "physical" memory into the guest
> (conveniently treatable as separate shareable memory objects for QEMU
> ;-).

Another approach would be supplying this information through virtio-iommu.
That already has topology information, and can be used together with
VIRTIO_F_ACCESS_PLATFORM to limit device access to memory.
As virtio iommu is fairly new I kind of like this approach myself -
not a lot of legacy to contend with.

> 
> Closing Thoughts and Open Questions
> ===================================
> 
> Currently all of this is considering just virtqueues themselves but of
> course only a subset of devices interact purely by virtqueue messages.
> Network and Block devices often end up filling up additional structures
> in memory that are usually across the whole of system memory. To achieve
> better isolation you either need to ensure that specific bits of kernel
> allocation are done in certain regions (i.e. block cache in "shared"
> region) or implement some sort of bounce buffer [1] that allows you to bring
> data from backend to frontend (which is more like the channel concept of
> Xen's PV).
> 
> I suspect the solution will end up being a combination of all of these
> approaches. There setup of different systems might mean we need a
> plethora of ways to carve out and define regions in ways a kernel can
> understand and make decisions about.
> 
> I think there will always have to be an element of VirtIO config
> involved as that is *the* mechanism by which front/back end negotiate if
> they can get up and running in a way they are both happy with.
> 
> One potential approach would be to introduce the concept of a region id
> at the VirtIO config level which is simply a reasonably unique magic
> number that virtio driver passes down into the kernel when requesting
> memory for it's virtqueues. It could then be left to the kernel to
> associate use that id when identifying the physical address range to
> allocate from. This seems a bit of a loose binding between the driver
> level and the kernel level but perhaps that is preferable to allow for
> flexibility about how such regions are discovered by kernels?
> 
> I hope this message hasn't rambled on to much. I feel this is a complex
> topic and I'm want to be sure I've thought through all the potential
> options before starting to prototype a solution. For those that have
> made it this far the final questions are:
> 
>   - is constraining guest allocation of virtqueues a reasonable requirement?
> 
>   - could virtqueues ever be directly host/hypervisor assigned?
> 
>   - should there be a tight or loose coupling between front-end driver
>     and kernel/hypervisor support for allocating memory?
> 
> Of course if this is all solvable with existing code I'd be more than
> happy but please let me know how ;-)
> 
> Regards,
> 
> 
> -- 
> Alex Bennée
> 
> [1] Example bounce buffer approach
> 
> Subject: [PATCH 0/5] virtio on Type-1 hypervisor
> Message-Id: <1588073958-1793-1-git-send-email-vatsa@codeaurora.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 17:31 [virtio-dev] Constraining where a guest may allocate virtio accessible resources Alex Bennée
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
  2020-06-18  7:30 ` Michael S. Tsirkin
@ 2020-06-18 13:25 ` Stefan Hajnoczi
  2020-06-19 17:35   ` Alex Bennée
  2020-06-19  8:02 ` Jean-Philippe Brucker
  3 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2020-06-18 13:25 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Jean-Philippe Brucker

[-- Attachment #1: Type: text/plain, Size: 4941 bytes --]

On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
> This follows on from the discussion in the last thread I raised:
> 
>   Subject: Backend libraries for VirtIO device emulation
>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>   Message-ID: <874kv15o4q.fsf@linaro.org>
> 
> To support the concept of a VirtIO backend having limited visibility of

It's unclear what we're discussing. Does "VirtIO backend" mean
vhost-user devices?

Can you describe what you are trying to do?

> a guests memory space there needs to be some mechanism to limit the
> where that guest may place things.

Or an enforcing IOMMU? In other words, an IOMMU that only gives access
to memory that has been put forth for DMA.

This was discussed recently in the context of the ongoing
vfio-over-socket work ("RFC: use VFIO over a UNIX domain socket to
implement device offloading" on qemu-devel). The idea is to use the VFIO
protocol but over UNIX domain sockets to another host userspace process
instead of over ioctls to the kernel VFIO drivers. This would allow
arbitary devices to be emulated in a separate process from QEMU. As a
first step I suggested DMA_READ/DMA_WRITE protocol messages, even though
this will have poor performance.

I think finding a solution for an enforcing IOMMU is preferrable to
guest cooperation. The problem with guest cooperation is that you may be
able to get new VIRTIO guest drivers to restrict where the virtqueues
are placed, but what about applications (e.g. O_DIRECT disk I/O, network
packets) with memory buffers at arbitrary addresses?

Modifying guest applications to honor buffer memory restrictions is too
disruptive for most use cases.

> A simple VirtIO device can be
> expressed purely in virt resources, for example:
> 
>    * status, feature and config fields
>    * notification/doorbell
>    * one or more virtqueues
> 
> Using a PCI backend the location of everything but the virtqueues it
> controlled by the mapping of the PCI device so something that is
> controllable by the host/hypervisor. However the guest is free to
> allocate the virtqueues anywhere in the virtual address space of system
> RAM.
> 
> In theory this shouldn't matter because sharing virtual pages is just a
> matter of putting the appropriate translations in place. However there
> are multiple ways the host and guest may interact:
> 
> * QEMU TCG
> 
> QEMU sees a block of system memory in it's virtual address space that
> has a one to one mapping with the guests physical address space. If QEMU
> want to share a subset of that address space it can only realistically
> do it for a contiguous region of it's address space which implies the
> guest must use a contiguous region of it's physical address space.

This paragraph doesn't reflect my understanding. There can be multiple
RAMBlocks. There isn't necessarily just 1 contiguous piece of RAM.

> 
> * QEMU KVM
> 
> The situation here is broadly the same - although both QEMU and the
> guest are seeing a their own virtual views of a linear address space
> which may well actually be a fragmented set of physical pages on the
> host.

I don't understand the "although" part. Isn't the situation the same as
with TCG, where guest physical memory ranges can cross RAMBlock
boundaries?

> 
> KVM based guests have additional constraints if they ever want to access
> real hardware in the host as you need to ensure any address accessed by
> the guest can be eventually translated into an address that can
> physically access the bus which a device in one (for device
> pass-through). The area also has to be DMA coherent so updates from a
> bus are reliably visible to software accessing the same address space.

I'm surprised about the DMA coherency sentence. Dont't VFIO and other
userspace I/O APIs provide the DMA APIs allowing applications to deal
with caches/coherency?

> 
> * Xen (and other type-1's?)
> 
> Here the situation is a little different because the guest explicitly
> makes it's pages visible to other domains by way of grant tables. The
> guest is still free to use whatever parts of its address space it wishes
> to. Other domains then request access to those pages via the hypervisor.
> 
> In theory the requester is free to map the granted pages anywhere in
> its own address space. However there are differences between the
> architectures on how well this is supported.
> 
> So I think this makes a case for having a mechanism by which the guest
> can restrict it's allocation to a specific area of the guest physical
> address space. The question is then what is the best way to inform the
> guest kernel of the limitation?

As mentioned above, I don't think it's possible to do this without
modifying applications - which is not possible in many use cases.
Instead we could improve IOMMU support so that this works transparently.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
@ 2020-06-18 13:29   ` Stefan Hajnoczi
  2020-06-18 13:59     ` Jan Kiszka
  2020-06-18 13:53   ` Laszlo Ersek
  2020-06-19 15:16   ` Alex Bennée
  2 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2020-06-18 13:29 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Jean-Philippe Brucker

[-- Attachment #1: Type: text/plain, Size: 7583 bytes --]

On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote:
> On 17.06.20 19:31, Alex Bennée wrote:
> > 
> > Hi,
> > 
> > This follows on from the discussion in the last thread I raised:
> > 
> >   Subject: Backend libraries for VirtIO device emulation
> >   Date: Fri, 06 Mar 2020 18:33:57 +0000
> >   Message-ID: <874kv15o4q.fsf@linaro.org>
> > 
> > To support the concept of a VirtIO backend having limited visibility of
> > a guests memory space there needs to be some mechanism to limit the
> > where that guest may place things. A simple VirtIO device can be
> > expressed purely in virt resources, for example:
> > 
> >    * status, feature and config fields
> >    * notification/doorbell
> >    * one or more virtqueues
> > 
> > Using a PCI backend the location of everything but the virtqueues it
> > controlled by the mapping of the PCI device so something that is
> > controllable by the host/hypervisor. However the guest is free to
> > allocate the virtqueues anywhere in the virtual address space of system
> > RAM.
> > 
> > In theory this shouldn't matter because sharing virtual pages is just a
> > matter of putting the appropriate translations in place. However there
> > are multiple ways the host and guest may interact:
> > 
> > * QEMU TCG
> > 
> > QEMU sees a block of system memory in it's virtual address space that
> > has a one to one mapping with the guests physical address space. If QEMU
> > want to share a subset of that address space it can only realistically
> > do it for a contiguous region of it's address space which implies the
> > guest must use a contiguous region of it's physical address space.
> > 
> > * QEMU KVM
> > 
> > The situation here is broadly the same - although both QEMU and the
> > guest are seeing a their own virtual views of a linear address space
> > which may well actually be a fragmented set of physical pages on the
> > host.
> > 
> > KVM based guests have additional constraints if they ever want to access
> > real hardware in the host as you need to ensure any address accessed by
> > the guest can be eventually translated into an address that can
> > physically access the bus which a device in one (for device
> > pass-through). The area also has to be DMA coherent so updates from a
> > bus are reliably visible to software accessing the same address space.
> > 
> > * Xen (and other type-1's?)
> > 
> > Here the situation is a little different because the guest explicitly
> > makes it's pages visible to other domains by way of grant tables. The
> > guest is still free to use whatever parts of its address space it wishes
> > to. Other domains then request access to those pages via the hypervisor.
> > 
> > In theory the requester is free to map the granted pages anywhere in
> > its own address space. However there are differences between the
> > architectures on how well this is supported.
> > 
> > So I think this makes a case for having a mechanism by which the guest
> > can restrict it's allocation to a specific area of the guest physical
> > address space. The question is then what is the best way to inform the
> > guest kernel of the limitation?
> > 
> > Option 1 - Kernel Command Line
> > ==============================
> > 
> > This isn't without precedent - the kernel supports options like "memmap"
> > which can with the appropriate amount of crafting be used to carve out
> > sections of bad ram from the physical address space. Other formulations
> > can be used to mark specific areas of the address space as particular
> > types of memory.  
> > 
> > However there are cons to this approach as it then becomes a job for
> > whatever builds the VMM command lines to ensure the both the backend and
> > the kernel know where things are. It is also very Linux centric and
> > doesn't solve the problem for other guest OSes. Considering the rest of
> > VirtIO can be made discover-able this seems like it would be a backward
> > step.
> > 
> > Option 2 - Additional Platform Data
> > ===================================
> > 
> > This would be extending using something like device tree or ACPI tables
> > which could define regions of memory that would inform the low level
> > memory allocation routines where they could allocate from. There is
> > already of the concept of "dma-ranges" in device tree which can be a
> > per-device property which defines the region of space that is DMA
> > coherent for a device.
> > 
> > There is the question of how you tie regions declared here with the
> > eventual instantiating of the VirtIO devices?
> > 
> > For a fully distributed set of backends (one backend per device per
> > worker VM) you would need several different regions. Would each region
> > be tied to each device or just a set of areas the guest would allocate
> > from in sequence?
> > 
> > Option 3 - Abusing PCI Regions
> > ==============================
> > 
> > One of the reasons to use the VirtIO PCI backend it to help with
> > automatic probing and setup. Could we define a new PCI region which on
> > backend just maps to RAM but from the front-ends point of view is a
> > region it can allocate it's virtqueues? Could we go one step further and
> > just let the host to define and allocate the virtqueue in the reserved
> > PCI space and pass the base of it somehow?
> > 
> > Options 4 - Extend VirtIO Config
> > ================================
> > 
> > Another approach would be to extend the VirtIO configuration and
> > start-up handshake to supply these limitations to the guest. This could
> > be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
> > additional configuration information.
> > 
> > One problem I can foresee is device initialisation is usually done
> > fairly late in the start-up of a kernel by which time any memory zoning
> > restrictions will likely need to have informed the kernels low level
> > memory management. Does that mean we would have to combine such a
> > feature behaviour with a another method anyway?
> > 
> > Option 5 - Additional Device
> > ============================
> > 
> > The final approach would be to tie the allocation of virtqueues to
> > memory regions as defined by additional devices. For example the
> > proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> > a fixed non-mappable region of the address space. Other proposals like
> > virtio-mem allow for hot plugging of "physical" memory into the guest
> > (conveniently treatable as separate shareable memory objects for QEMU
> > ;-).
> > 
> 
> I think you forgot one approach: virtual IOMMU. That is the advanced
> form of the grant table approach. The backend still "sees" the full
> address space of the frontend, but it will not be able to access all of
> it and there might even be a translation going on. Well, like IOMMUs work.
> 
> However, this implies dynamics that are under guest control, namely of
> the frontend guest. And such dynamics can be counterproductive for
> certain scenarios. That's where this static windows of shared memory
> came up.

Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
are now widely implemented in Linux and virtualization software. That
means guest modifications aren't necessary and unmodified guest
applications will run.

Applications that need the best performance can use a static mapping
while applications that want the strongest isolation can map/unmap DMA
buffers dynamically.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
  2020-06-18 13:29   ` Stefan Hajnoczi
@ 2020-06-18 13:53   ` Laszlo Ersek
  2020-06-19 15:16   ` Alex Bennée
  2 siblings, 0 replies; 18+ messages in thread
From: Laszlo Ersek @ 2020-06-18 13:53 UTC (permalink / raw)
  To: Jan Kiszka, Alex Bennée, virtio-dev
  Cc: David Hildenbrand, Srivatsa Vaddagiri, Azzedine Touzni,
	François Ozog, Ilias Apalodimas, Soni, Trilok,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Michael S. Tsirkin,
	Jean-Philippe Brucker

On 06/17/20 20:01, Jan Kiszka wrote:
> On 17.06.20 19:31, Alex Bennée wrote:
>>
>> Hi,
>>
>> This follows on from the discussion in the last thread I raised:
>>
>>   Subject: Backend libraries for VirtIO device emulation
>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>>
>> To support the concept of a VirtIO backend having limited visibility of
>> a guests memory space there needs to be some mechanism to limit the
>> where that guest may place things. A simple VirtIO device can be
>> expressed purely in virt resources, for example:
>>
>>    * status, feature and config fields
>>    * notification/doorbell
>>    * one or more virtqueues
>>
>> Using a PCI backend the location of everything but the virtqueues it
>> controlled by the mapping of the PCI device so something that is
>> controllable by the host/hypervisor. However the guest is free to
>> allocate the virtqueues anywhere in the virtual address space of system
>> RAM.
>>
>> In theory this shouldn't matter because sharing virtual pages is just a
>> matter of putting the appropriate translations in place. However there
>> are multiple ways the host and guest may interact:
>>
>> * QEMU TCG
>>
>> QEMU sees a block of system memory in it's virtual address space that
>> has a one to one mapping with the guests physical address space. If QEMU
>> want to share a subset of that address space it can only realistically
>> do it for a contiguous region of it's address space which implies the
>> guest must use a contiguous region of it's physical address space.
>>
>> * QEMU KVM
>>
>> The situation here is broadly the same - although both QEMU and the
>> guest are seeing a their own virtual views of a linear address space
>> which may well actually be a fragmented set of physical pages on the
>> host.
>>
>> KVM based guests have additional constraints if they ever want to access
>> real hardware in the host as you need to ensure any address accessed by
>> the guest can be eventually translated into an address that can
>> physically access the bus which a device in one (for device
>> pass-through). The area also has to be DMA coherent so updates from a
>> bus are reliably visible to software accessing the same address space.
>>
>> * Xen (and other type-1's?)
>>
>> Here the situation is a little different because the guest explicitly
>> makes it's pages visible to other domains by way of grant tables. The
>> guest is still free to use whatever parts of its address space it wishes
>> to. Other domains then request access to those pages via the hypervisor.
>>
>> In theory the requester is free to map the granted pages anywhere in
>> its own address space. However there are differences between the
>> architectures on how well this is supported.
>>
>> So I think this makes a case for having a mechanism by which the guest
>> can restrict it's allocation to a specific area of the guest physical
>> address space. The question is then what is the best way to inform the
>> guest kernel of the limitation?
>>
>> Option 1 - Kernel Command Line
>> ==============================
>>
>> This isn't without precedent - the kernel supports options like "memmap"
>> which can with the appropriate amount of crafting be used to carve out
>> sections of bad ram from the physical address space. Other formulations
>> can be used to mark specific areas of the address space as particular
>> types of memory.  
>>
>> However there are cons to this approach as it then becomes a job for
>> whatever builds the VMM command lines to ensure the both the backend and
>> the kernel know where things are. It is also very Linux centric and
>> doesn't solve the problem for other guest OSes. Considering the rest of
>> VirtIO can be made discover-able this seems like it would be a backward
>> step.
>>
>> Option 2 - Additional Platform Data
>> ===================================
>>
>> This would be extending using something like device tree or ACPI tables
>> which could define regions of memory that would inform the low level
>> memory allocation routines where they could allocate from. There is
>> already of the concept of "dma-ranges" in device tree which can be a
>> per-device property which defines the region of space that is DMA
>> coherent for a device.
>>
>> There is the question of how you tie regions declared here with the
>> eventual instantiating of the VirtIO devices?
>>
>> For a fully distributed set of backends (one backend per device per
>> worker VM) you would need several different regions. Would each region
>> be tied to each device or just a set of areas the guest would allocate
>> from in sequence?
>>
>> Option 3 - Abusing PCI Regions
>> ==============================
>>
>> One of the reasons to use the VirtIO PCI backend it to help with
>> automatic probing and setup. Could we define a new PCI region which on
>> backend just maps to RAM but from the front-ends point of view is a
>> region it can allocate it's virtqueues? Could we go one step further and
>> just let the host to define and allocate the virtqueue in the reserved
>> PCI space and pass the base of it somehow?
>>
>> Options 4 - Extend VirtIO Config
>> ================================
>>
>> Another approach would be to extend the VirtIO configuration and
>> start-up handshake to supply these limitations to the guest. This could
>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
>> additional configuration information.
>>
>> One problem I can foresee is device initialisation is usually done
>> fairly late in the start-up of a kernel by which time any memory zoning
>> restrictions will likely need to have informed the kernels low level
>> memory management. Does that mean we would have to combine such a
>> feature behaviour with a another method anyway?
>>
>> Option 5 - Additional Device
>> ============================
>>
>> The final approach would be to tie the allocation of virtqueues to
>> memory regions as defined by additional devices. For example the
>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>> a fixed non-mappable region of the address space. Other proposals like
>> virtio-mem allow for hot plugging of "physical" memory into the guest
>> (conveniently treatable as separate shareable memory objects for QEMU
>> ;-).
>>
> 
> I think you forgot one approach: virtual IOMMU. That is the advanced
> form of the grant table approach. The backend still "sees" the full
> address space of the frontend, but it will not be able to access all of
> it and there might even be a translation going on. Well, like IOMMUs work.

I don't claim to fully understand the scope, but I agree with the
virtual IOMMU reference. Front-end drivers that negotiate
VIRTIO_F_IOMMU_PLATFORM already express they use the "DMA API",
regardless of:

- how the DMA API in the particular guest OS or guest firmware looks like,

- how the DMA API is implemented by a separate driver in the guest OS or
guest firmware.

So I'd consider this a question of designing new (virtual) IOMMU
devices, and writing drivers for them in (front-end) guests, upon which
existent virtio drivers that negotiate VIRTIO_F_IOMMU_PLATFORM could
transparently stack themselves.

AMD SEV is one example where this works. In particular, Jan's words

    The backend still "sees" the full address space of the frontend, but
    it will not be able to access all of it

apply quite well to SEV -- while the host "sees" the full RAM content of
the guest, the host cannot *usefully* access any guest RAM area that's
not deliberately decrypted by the guest first.

> However, this implies dynamics that are under guest control, namely of
> the frontend guest. And such dynamics can be counterproductive for
> certain scenarios. That's where this static windows of shared memory
> came up.

I (vaguely) feel this still fits the vIOMMU approach -- configure the
(front-end) guest IOMMU driver (by whatever means) to allocate the
device-visible buffers in pre-determined / fixed areas.

I guess what I'm saying is that the VIRTIO_F_IOMMU_PLATFORM feature bit
should already cover this use case, as far as virtio (the spec) and
existent virtio front-end drivers are  concerned.  The specific IOMMU
design & configuration seems orthogonal to virtio (both spec and
existent drivers).

An existent driver that negotiates VIRTIO_F_IOMMU_PLATFORM should not
have to undergo any changes to conform to this use case. Instead, the
affected (front-end) guest OS or guest firmware should get a new IOMMU
driver, and a method to select and configure that driver.

(Sorry if I misunderstood -- I tend to follow this list from a distance,
and I'm only commenting because Alex's and Jan's emails reminded me of
how SEV support is implemented in OVMF's (and I believe Linux's) virtio
drivers. When you have a hammer...)

Thanks
Laszlo

> 
>>
>> Closing Thoughts and Open Questions
>> ===================================
>>
>> Currently all of this is considering just virtqueues themselves but of
>> course only a subset of devices interact purely by virtqueue messages.
>> Network and Block devices often end up filling up additional structures
>> in memory that are usually across the whole of system memory. To achieve
>> better isolation you either need to ensure that specific bits of kernel
>> allocation are done in certain regions (i.e. block cache in "shared"
>> region) or implement some sort of bounce buffer [1] that allows you to bring
>> data from backend to frontend (which is more like the channel concept of
>> Xen's PV).
> 
> For [1], look at https://lkml.org/lkml/2020/3/26/700 or at
> http://git.kiszka.org/?p=linux.git;a=blob;f=drivers/virtio/virtio_ivshmem.c;hb=refs/heads/queues/jailhouse
> (which should be using swiotlb one day).
> 
>>
>> I suspect the solution will end up being a combination of all of these
>> approaches. There setup of different systems might mean we need a
>> plethora of ways to carve out and define regions in ways a kernel can
>> understand and make decisions about.
>>
>> I think there will always have to be an element of VirtIO config
>> involved as that is *the* mechanism by which front/back end negotiate if
>> they can get up and running in a way they are both happy with.
>>
>> One potential approach would be to introduce the concept of a region id
>> at the VirtIO config level which is simply a reasonably unique magic
>> number that virtio driver passes down into the kernel when requesting
>> memory for it's virtqueues. It could then be left to the kernel to
>> associate use that id when identifying the physical address range to
>> allocate from. This seems a bit of a loose binding between the driver
>> level and the kernel level but perhaps that is preferable to allow for
>> flexibility about how such regions are discovered by kernels?
>>
>> I hope this message hasn't rambled on to much. I feel this is a complex
>> topic and I'm want to be sure I've thought through all the potential
>> options before starting to prototype a solution. For those that have
>> made it this far the final questions are:
>>
>>   - is constraining guest allocation of virtqueues a reasonable requirement?
>>
>>   - could virtqueues ever be directly host/hypervisor assigned?
>>
>>   - should there be a tight or loose coupling between front-end driver
>>     and kernel/hypervisor support for allocating memory?
>>
>> Of course if this is all solvable with existing code I'd be more than
>> happy but please let me know how ;-)
>>
> 
> Queues are a central element of virtio, but there is a (maintainability
> & security) benefit if you can keep them away from the hosting
> hypervisor, limit their interpretation and negotiation to the backend
> driver in a host process or in a backend guest VM. So I would be careful
> with coupling things too tightly.
> 
> One of the issues I see in virtio for use in minimalistic hypervisors is
> the need to be aware of the different virtio devices when using PCI or
> MMIO transports. That's where a shared memory transport come into play.
> 
> Jan
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 13:29   ` Stefan Hajnoczi
@ 2020-06-18 13:59     ` Jan Kiszka
  2020-06-18 14:52       ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2020-06-18 13:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Jean-Philippe Brucker

On 18.06.20 15:29, Stefan Hajnoczi wrote:
> On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote:
>> On 17.06.20 19:31, Alex Bennée wrote:
>>>
>>> Hi,
>>>
>>> This follows on from the discussion in the last thread I raised:
>>>
>>>   Subject: Backend libraries for VirtIO device emulation
>>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>>>
>>> To support the concept of a VirtIO backend having limited visibility of
>>> a guests memory space there needs to be some mechanism to limit the
>>> where that guest may place things. A simple VirtIO device can be
>>> expressed purely in virt resources, for example:
>>>
>>>    * status, feature and config fields
>>>    * notification/doorbell
>>>    * one or more virtqueues
>>>
>>> Using a PCI backend the location of everything but the virtqueues it
>>> controlled by the mapping of the PCI device so something that is
>>> controllable by the host/hypervisor. However the guest is free to
>>> allocate the virtqueues anywhere in the virtual address space of system
>>> RAM.
>>>
>>> In theory this shouldn't matter because sharing virtual pages is just a
>>> matter of putting the appropriate translations in place. However there
>>> are multiple ways the host and guest may interact:
>>>
>>> * QEMU TCG
>>>
>>> QEMU sees a block of system memory in it's virtual address space that
>>> has a one to one mapping with the guests physical address space. If QEMU
>>> want to share a subset of that address space it can only realistically
>>> do it for a contiguous region of it's address space which implies the
>>> guest must use a contiguous region of it's physical address space.
>>>
>>> * QEMU KVM
>>>
>>> The situation here is broadly the same - although both QEMU and the
>>> guest are seeing a their own virtual views of a linear address space
>>> which may well actually be a fragmented set of physical pages on the
>>> host.
>>>
>>> KVM based guests have additional constraints if they ever want to access
>>> real hardware in the host as you need to ensure any address accessed by
>>> the guest can be eventually translated into an address that can
>>> physically access the bus which a device in one (for device
>>> pass-through). The area also has to be DMA coherent so updates from a
>>> bus are reliably visible to software accessing the same address space.
>>>
>>> * Xen (and other type-1's?)
>>>
>>> Here the situation is a little different because the guest explicitly
>>> makes it's pages visible to other domains by way of grant tables. The
>>> guest is still free to use whatever parts of its address space it wishes
>>> to. Other domains then request access to those pages via the hypervisor.
>>>
>>> In theory the requester is free to map the granted pages anywhere in
>>> its own address space. However there are differences between the
>>> architectures on how well this is supported.
>>>
>>> So I think this makes a case for having a mechanism by which the guest
>>> can restrict it's allocation to a specific area of the guest physical
>>> address space. The question is then what is the best way to inform the
>>> guest kernel of the limitation?
>>>
>>> Option 1 - Kernel Command Line
>>> ==============================
>>>
>>> This isn't without precedent - the kernel supports options like "memmap"
>>> which can with the appropriate amount of crafting be used to carve out
>>> sections of bad ram from the physical address space. Other formulations
>>> can be used to mark specific areas of the address space as particular
>>> types of memory.  
>>>
>>> However there are cons to this approach as it then becomes a job for
>>> whatever builds the VMM command lines to ensure the both the backend and
>>> the kernel know where things are. It is also very Linux centric and
>>> doesn't solve the problem for other guest OSes. Considering the rest of
>>> VirtIO can be made discover-able this seems like it would be a backward
>>> step.
>>>
>>> Option 2 - Additional Platform Data
>>> ===================================
>>>
>>> This would be extending using something like device tree or ACPI tables
>>> which could define regions of memory that would inform the low level
>>> memory allocation routines where they could allocate from. There is
>>> already of the concept of "dma-ranges" in device tree which can be a
>>> per-device property which defines the region of space that is DMA
>>> coherent for a device.
>>>
>>> There is the question of how you tie regions declared here with the
>>> eventual instantiating of the VirtIO devices?
>>>
>>> For a fully distributed set of backends (one backend per device per
>>> worker VM) you would need several different regions. Would each region
>>> be tied to each device or just a set of areas the guest would allocate
>>> from in sequence?
>>>
>>> Option 3 - Abusing PCI Regions
>>> ==============================
>>>
>>> One of the reasons to use the VirtIO PCI backend it to help with
>>> automatic probing and setup. Could we define a new PCI region which on
>>> backend just maps to RAM but from the front-ends point of view is a
>>> region it can allocate it's virtqueues? Could we go one step further and
>>> just let the host to define and allocate the virtqueue in the reserved
>>> PCI space and pass the base of it somehow?
>>>
>>> Options 4 - Extend VirtIO Config
>>> ================================
>>>
>>> Another approach would be to extend the VirtIO configuration and
>>> start-up handshake to supply these limitations to the guest. This could
>>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
>>> additional configuration information.
>>>
>>> One problem I can foresee is device initialisation is usually done
>>> fairly late in the start-up of a kernel by which time any memory zoning
>>> restrictions will likely need to have informed the kernels low level
>>> memory management. Does that mean we would have to combine such a
>>> feature behaviour with a another method anyway?
>>>
>>> Option 5 - Additional Device
>>> ============================
>>>
>>> The final approach would be to tie the allocation of virtqueues to
>>> memory regions as defined by additional devices. For example the
>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>>> a fixed non-mappable region of the address space. Other proposals like
>>> virtio-mem allow for hot plugging of "physical" memory into the guest
>>> (conveniently treatable as separate shareable memory objects for QEMU
>>> ;-).
>>>
>>
>> I think you forgot one approach: virtual IOMMU. That is the advanced
>> form of the grant table approach. The backend still "sees" the full
>> address space of the frontend, but it will not be able to access all of
>> it and there might even be a translation going on. Well, like IOMMUs work.
>>
>> However, this implies dynamics that are under guest control, namely of
>> the frontend guest. And such dynamics can be counterproductive for
>> certain scenarios. That's where this static windows of shared memory
>> came up.
> 
> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> are now widely implemented in Linux and virtualization software. That
> means guest modifications aren't necessary and unmodified guest
> applications will run.
> 
> Applications that need the best performance can use a static mapping
> while applications that want the strongest isolation can map/unmap DMA
> buffers dynamically.

I do not see yet that you can model with an IOMMU a static, not guest
controlled window.

And IOMMU implies guest modifications as well (you need its driver). It
just happened to be there now in newer guests. A virtio shared memory
transport could be introduced similarly.

But the biggest challenge would be that a static mode would allow for a
trivial hypervisor side model. Otherwise, we would only try to achieve a
simpler secure model by adding complexity elsewhere.

I'm not arguing against vIOMMU per se. It's there, it is and will be
widely used. It's just not solving all issues.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 13:59     ` Jan Kiszka
@ 2020-06-18 14:52       ` Michael S. Tsirkin
  2020-06-18 14:58         ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2020-06-18 14:52 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Stefan Hajnoczi, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

On Thu, Jun 18, 2020 at 03:59:54PM +0200, Jan Kiszka wrote:
> On 18.06.20 15:29, Stefan Hajnoczi wrote:
> > On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote:
> >> On 17.06.20 19:31, Alex Bennée wrote:
> >>>
> >>> Hi,
> >>>
> >>> This follows on from the discussion in the last thread I raised:
> >>>
> >>>   Subject: Backend libraries for VirtIO device emulation
> >>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
> >>>   Message-ID: <874kv15o4q.fsf@linaro.org>
> >>>
> >>> To support the concept of a VirtIO backend having limited visibility of
> >>> a guests memory space there needs to be some mechanism to limit the
> >>> where that guest may place things. A simple VirtIO device can be
> >>> expressed purely in virt resources, for example:
> >>>
> >>>    * status, feature and config fields
> >>>    * notification/doorbell
> >>>    * one or more virtqueues
> >>>
> >>> Using a PCI backend the location of everything but the virtqueues it
> >>> controlled by the mapping of the PCI device so something that is
> >>> controllable by the host/hypervisor. However the guest is free to
> >>> allocate the virtqueues anywhere in the virtual address space of system
> >>> RAM.
> >>>
> >>> In theory this shouldn't matter because sharing virtual pages is just a
> >>> matter of putting the appropriate translations in place. However there
> >>> are multiple ways the host and guest may interact:
> >>>
> >>> * QEMU TCG
> >>>
> >>> QEMU sees a block of system memory in it's virtual address space that
> >>> has a one to one mapping with the guests physical address space. If QEMU
> >>> want to share a subset of that address space it can only realistically
> >>> do it for a contiguous region of it's address space which implies the
> >>> guest must use a contiguous region of it's physical address space.
> >>>
> >>> * QEMU KVM
> >>>
> >>> The situation here is broadly the same - although both QEMU and the
> >>> guest are seeing a their own virtual views of a linear address space
> >>> which may well actually be a fragmented set of physical pages on the
> >>> host.
> >>>
> >>> KVM based guests have additional constraints if they ever want to access
> >>> real hardware in the host as you need to ensure any address accessed by
> >>> the guest can be eventually translated into an address that can
> >>> physically access the bus which a device in one (for device
> >>> pass-through). The area also has to be DMA coherent so updates from a
> >>> bus are reliably visible to software accessing the same address space.
> >>>
> >>> * Xen (and other type-1's?)
> >>>
> >>> Here the situation is a little different because the guest explicitly
> >>> makes it's pages visible to other domains by way of grant tables. The
> >>> guest is still free to use whatever parts of its address space it wishes
> >>> to. Other domains then request access to those pages via the hypervisor.
> >>>
> >>> In theory the requester is free to map the granted pages anywhere in
> >>> its own address space. However there are differences between the
> >>> architectures on how well this is supported.
> >>>
> >>> So I think this makes a case for having a mechanism by which the guest
> >>> can restrict it's allocation to a specific area of the guest physical
> >>> address space. The question is then what is the best way to inform the
> >>> guest kernel of the limitation?
> >>>
> >>> Option 1 - Kernel Command Line
> >>> ==============================
> >>>
> >>> This isn't without precedent - the kernel supports options like "memmap"
> >>> which can with the appropriate amount of crafting be used to carve out
> >>> sections of bad ram from the physical address space. Other formulations
> >>> can be used to mark specific areas of the address space as particular
> >>> types of memory.  
> >>>
> >>> However there are cons to this approach as it then becomes a job for
> >>> whatever builds the VMM command lines to ensure the both the backend and
> >>> the kernel know where things are. It is also very Linux centric and
> >>> doesn't solve the problem for other guest OSes. Considering the rest of
> >>> VirtIO can be made discover-able this seems like it would be a backward
> >>> step.
> >>>
> >>> Option 2 - Additional Platform Data
> >>> ===================================
> >>>
> >>> This would be extending using something like device tree or ACPI tables
> >>> which could define regions of memory that would inform the low level
> >>> memory allocation routines where they could allocate from. There is
> >>> already of the concept of "dma-ranges" in device tree which can be a
> >>> per-device property which defines the region of space that is DMA
> >>> coherent for a device.
> >>>
> >>> There is the question of how you tie regions declared here with the
> >>> eventual instantiating of the VirtIO devices?
> >>>
> >>> For a fully distributed set of backends (one backend per device per
> >>> worker VM) you would need several different regions. Would each region
> >>> be tied to each device or just a set of areas the guest would allocate
> >>> from in sequence?
> >>>
> >>> Option 3 - Abusing PCI Regions
> >>> ==============================
> >>>
> >>> One of the reasons to use the VirtIO PCI backend it to help with
> >>> automatic probing and setup. Could we define a new PCI region which on
> >>> backend just maps to RAM but from the front-ends point of view is a
> >>> region it can allocate it's virtqueues? Could we go one step further and
> >>> just let the host to define and allocate the virtqueue in the reserved
> >>> PCI space and pass the base of it somehow?
> >>>
> >>> Options 4 - Extend VirtIO Config
> >>> ================================
> >>>
> >>> Another approach would be to extend the VirtIO configuration and
> >>> start-up handshake to supply these limitations to the guest. This could
> >>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
> >>> additional configuration information.
> >>>
> >>> One problem I can foresee is device initialisation is usually done
> >>> fairly late in the start-up of a kernel by which time any memory zoning
> >>> restrictions will likely need to have informed the kernels low level
> >>> memory management. Does that mean we would have to combine such a
> >>> feature behaviour with a another method anyway?
> >>>
> >>> Option 5 - Additional Device
> >>> ============================
> >>>
> >>> The final approach would be to tie the allocation of virtqueues to
> >>> memory regions as defined by additional devices. For example the
> >>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> >>> a fixed non-mappable region of the address space. Other proposals like
> >>> virtio-mem allow for hot plugging of "physical" memory into the guest
> >>> (conveniently treatable as separate shareable memory objects for QEMU
> >>> ;-).
> >>>
> >>
> >> I think you forgot one approach: virtual IOMMU. That is the advanced
> >> form of the grant table approach. The backend still "sees" the full
> >> address space of the frontend, but it will not be able to access all of
> >> it and there might even be a translation going on. Well, like IOMMUs work.
> >>
> >> However, this implies dynamics that are under guest control, namely of
> >> the frontend guest. And such dynamics can be counterproductive for
> >> certain scenarios. That's where this static windows of shared memory
> >> came up.
> > 
> > Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> > are now widely implemented in Linux and virtualization software. That
> > means guest modifications aren't necessary and unmodified guest
> > applications will run.
> > 
> > Applications that need the best performance can use a static mapping
> > while applications that want the strongest isolation can map/unmap DMA
> > buffers dynamically.
> 
> I do not see yet that you can model with an IOMMU a static, not guest
> controlled window.

Well basically the IOMMU will have as part of the
topology description and range of addresses devices behind it
are allowed to access. What's the problem with that?


> And IOMMU implies guest modifications as well (you need its driver). It
> just happened to be there now in newer guests. A virtio shared memory
> transport could be introduced similarly.
> 
> But the biggest challenge would be that a static mode would allow for a
> trivial hypervisor side model. Otherwise, we would only try to achieve a
> simpler secure model by adding complexity elsewhere.
> 
> I'm not arguing against vIOMMU per se. It's there, it is and will be
> widely used. It's just not solving all issues.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 14:52       ` Michael S. Tsirkin
@ 2020-06-18 14:58         ` Jan Kiszka
  2020-06-18 15:05           ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2020-06-18 14:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

On 18.06.20 16:52, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2020 at 03:59:54PM +0200, Jan Kiszka wrote:
>> On 18.06.20 15:29, Stefan Hajnoczi wrote:
>>> On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote:
>>>> On 17.06.20 19:31, Alex Bennée wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> This follows on from the discussion in the last thread I raised:
>>>>>
>>>>>   Subject: Backend libraries for VirtIO device emulation
>>>>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>>>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>>>>>
>>>>> To support the concept of a VirtIO backend having limited visibility of
>>>>> a guests memory space there needs to be some mechanism to limit the
>>>>> where that guest may place things. A simple VirtIO device can be
>>>>> expressed purely in virt resources, for example:
>>>>>
>>>>>    * status, feature and config fields
>>>>>    * notification/doorbell
>>>>>    * one or more virtqueues
>>>>>
>>>>> Using a PCI backend the location of everything but the virtqueues it
>>>>> controlled by the mapping of the PCI device so something that is
>>>>> controllable by the host/hypervisor. However the guest is free to
>>>>> allocate the virtqueues anywhere in the virtual address space of system
>>>>> RAM.
>>>>>
>>>>> In theory this shouldn't matter because sharing virtual pages is just a
>>>>> matter of putting the appropriate translations in place. However there
>>>>> are multiple ways the host and guest may interact:
>>>>>
>>>>> * QEMU TCG
>>>>>
>>>>> QEMU sees a block of system memory in it's virtual address space that
>>>>> has a one to one mapping with the guests physical address space. If QEMU
>>>>> want to share a subset of that address space it can only realistically
>>>>> do it for a contiguous region of it's address space which implies the
>>>>> guest must use a contiguous region of it's physical address space.
>>>>>
>>>>> * QEMU KVM
>>>>>
>>>>> The situation here is broadly the same - although both QEMU and the
>>>>> guest are seeing a their own virtual views of a linear address space
>>>>> which may well actually be a fragmented set of physical pages on the
>>>>> host.
>>>>>
>>>>> KVM based guests have additional constraints if they ever want to access
>>>>> real hardware in the host as you need to ensure any address accessed by
>>>>> the guest can be eventually translated into an address that can
>>>>> physically access the bus which a device in one (for device
>>>>> pass-through). The area also has to be DMA coherent so updates from a
>>>>> bus are reliably visible to software accessing the same address space.
>>>>>
>>>>> * Xen (and other type-1's?)
>>>>>
>>>>> Here the situation is a little different because the guest explicitly
>>>>> makes it's pages visible to other domains by way of grant tables. The
>>>>> guest is still free to use whatever parts of its address space it wishes
>>>>> to. Other domains then request access to those pages via the hypervisor.
>>>>>
>>>>> In theory the requester is free to map the granted pages anywhere in
>>>>> its own address space. However there are differences between the
>>>>> architectures on how well this is supported.
>>>>>
>>>>> So I think this makes a case for having a mechanism by which the guest
>>>>> can restrict it's allocation to a specific area of the guest physical
>>>>> address space. The question is then what is the best way to inform the
>>>>> guest kernel of the limitation?
>>>>>
>>>>> Option 1 - Kernel Command Line
>>>>> ==============================
>>>>>
>>>>> This isn't without precedent - the kernel supports options like "memmap"
>>>>> which can with the appropriate amount of crafting be used to carve out
>>>>> sections of bad ram from the physical address space. Other formulations
>>>>> can be used to mark specific areas of the address space as particular
>>>>> types of memory.  
>>>>>
>>>>> However there are cons to this approach as it then becomes a job for
>>>>> whatever builds the VMM command lines to ensure the both the backend and
>>>>> the kernel know where things are. It is also very Linux centric and
>>>>> doesn't solve the problem for other guest OSes. Considering the rest of
>>>>> VirtIO can be made discover-able this seems like it would be a backward
>>>>> step.
>>>>>
>>>>> Option 2 - Additional Platform Data
>>>>> ===================================
>>>>>
>>>>> This would be extending using something like device tree or ACPI tables
>>>>> which could define regions of memory that would inform the low level
>>>>> memory allocation routines where they could allocate from. There is
>>>>> already of the concept of "dma-ranges" in device tree which can be a
>>>>> per-device property which defines the region of space that is DMA
>>>>> coherent for a device.
>>>>>
>>>>> There is the question of how you tie regions declared here with the
>>>>> eventual instantiating of the VirtIO devices?
>>>>>
>>>>> For a fully distributed set of backends (one backend per device per
>>>>> worker VM) you would need several different regions. Would each region
>>>>> be tied to each device or just a set of areas the guest would allocate
>>>>> from in sequence?
>>>>>
>>>>> Option 3 - Abusing PCI Regions
>>>>> ==============================
>>>>>
>>>>> One of the reasons to use the VirtIO PCI backend it to help with
>>>>> automatic probing and setup. Could we define a new PCI region which on
>>>>> backend just maps to RAM but from the front-ends point of view is a
>>>>> region it can allocate it's virtqueues? Could we go one step further and
>>>>> just let the host to define and allocate the virtqueue in the reserved
>>>>> PCI space and pass the base of it somehow?
>>>>>
>>>>> Options 4 - Extend VirtIO Config
>>>>> ================================
>>>>>
>>>>> Another approach would be to extend the VirtIO configuration and
>>>>> start-up handshake to supply these limitations to the guest. This could
>>>>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
>>>>> additional configuration information.
>>>>>
>>>>> One problem I can foresee is device initialisation is usually done
>>>>> fairly late in the start-up of a kernel by which time any memory zoning
>>>>> restrictions will likely need to have informed the kernels low level
>>>>> memory management. Does that mean we would have to combine such a
>>>>> feature behaviour with a another method anyway?
>>>>>
>>>>> Option 5 - Additional Device
>>>>> ============================
>>>>>
>>>>> The final approach would be to tie the allocation of virtqueues to
>>>>> memory regions as defined by additional devices. For example the
>>>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>>>>> a fixed non-mappable region of the address space. Other proposals like
>>>>> virtio-mem allow for hot plugging of "physical" memory into the guest
>>>>> (conveniently treatable as separate shareable memory objects for QEMU
>>>>> ;-).
>>>>>
>>>>
>>>> I think you forgot one approach: virtual IOMMU. That is the advanced
>>>> form of the grant table approach. The backend still "sees" the full
>>>> address space of the frontend, but it will not be able to access all of
>>>> it and there might even be a translation going on. Well, like IOMMUs work.
>>>>
>>>> However, this implies dynamics that are under guest control, namely of
>>>> the frontend guest. And such dynamics can be counterproductive for
>>>> certain scenarios. That's where this static windows of shared memory
>>>> came up.
>>>
>>> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
>>> are now widely implemented in Linux and virtualization software. That
>>> means guest modifications aren't necessary and unmodified guest
>>> applications will run.
>>>
>>> Applications that need the best performance can use a static mapping
>>> while applications that want the strongest isolation can map/unmap DMA
>>> buffers dynamically.
>>
>> I do not see yet that you can model with an IOMMU a static, not guest
>> controlled window.
> 
> Well basically the IOMMU will have as part of the
> topology description and range of addresses devices behind it
> are allowed to access. What's the problem with that?
> 

I didn't look at the detail of the vIOMMU from that perspective, but our
requirement would be that it would just statically communicate to the
guest where DMA windows are, rather than allowing the guest to configure
that (which is the normal usage of an IOMMU).

In addition, it would only address the memory transfer topic. We would
still be left with the current issue of virtio that the hypervisor's
device model needs to understand all supported device types.

Jan

> 
>> And IOMMU implies guest modifications as well (you need its driver). It
>> just happened to be there now in newer guests. A virtio shared memory
>> transport could be introduced similarly.
>>
>> But the biggest challenge would be that a static mode would allow for a
>> trivial hypervisor side model. Otherwise, we would only try to achieve a
>> simpler secure model by adding complexity elsewhere.
>>
>> I'm not arguing against vIOMMU per se. It's there, it is and will be
>> widely used. It's just not solving all issues.
>>
>> Jan
>>
>> -- 
>> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
>> Corporate Competence Center Embedded Linux
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 


-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 14:58         ` Jan Kiszka
@ 2020-06-18 15:05           ` Michael S. Tsirkin
  2020-06-18 15:22             ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2020-06-18 15:05 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Stefan Hajnoczi, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

On Thu, Jun 18, 2020 at 04:58:40PM +0200, Jan Kiszka wrote:
> >>>>> Option 5 - Additional Device
> >>>>> ============================
> >>>>>
> >>>>> The final approach would be to tie the allocation of virtqueues to
> >>>>> memory regions as defined by additional devices. For example the
> >>>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> >>>>> a fixed non-mappable region of the address space. Other proposals like
> >>>>> virtio-mem allow for hot plugging of "physical" memory into the guest
> >>>>> (conveniently treatable as separate shareable memory objects for QEMU
> >>>>> ;-).
> >>>>>
> >>>>
> >>>> I think you forgot one approach: virtual IOMMU. That is the advanced
> >>>> form of the grant table approach. The backend still "sees" the full
> >>>> address space of the frontend, but it will not be able to access all of
> >>>> it and there might even be a translation going on. Well, like IOMMUs work.
> >>>>
> >>>> However, this implies dynamics that are under guest control, namely of
> >>>> the frontend guest. And such dynamics can be counterproductive for
> >>>> certain scenarios. That's where this static windows of shared memory
> >>>> came up.
> >>>
> >>> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> >>> are now widely implemented in Linux and virtualization software. That
> >>> means guest modifications aren't necessary and unmodified guest
> >>> applications will run.
> >>>
> >>> Applications that need the best performance can use a static mapping
> >>> while applications that want the strongest isolation can map/unmap DMA
> >>> buffers dynamically.
> >>
> >> I do not see yet that you can model with an IOMMU a static, not guest
> >> controlled window.
> > 
> > Well basically the IOMMU will have as part of the
> > topology description and range of addresses devices behind it
> > are allowed to access. What's the problem with that?
> > 
> 
> I didn't look at the detail of the vIOMMU from that perspective, but our
> requirement would be that it would just statically communicate to the
> guest where DMA windows are, rather than allowing the guest to configure
> that (which is the normal usage of an IOMMU).

Right, I got that - IOMMUs aren't necessarily fully configurable though.
E.g. some IOMMUs are restricted in the # of bits they can address.


> In addition, it would only address the memory transfer topic. We would
> still be left with the current issue of virtio that the hypervisor's
> device model needs to understand all supported device types.
> 
> Jan

I'd expect the DMA API would try to paper over that likely using
bounce buffering. If you want to avoid copies, that's a harder
problem generally.

-- 
MST


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 15:05           ` Michael S. Tsirkin
@ 2020-06-18 15:22             ` Jan Kiszka
  2020-06-18 15:29               ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2020-06-18 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

On 18.06.20 17:05, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2020 at 04:58:40PM +0200, Jan Kiszka wrote:
>>>>>>> Option 5 - Additional Device
>>>>>>> ============================
>>>>>>>
>>>>>>> The final approach would be to tie the allocation of virtqueues to
>>>>>>> memory regions as defined by additional devices. For example the
>>>>>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>>>>>>> a fixed non-mappable region of the address space. Other proposals like
>>>>>>> virtio-mem allow for hot plugging of "physical" memory into the guest
>>>>>>> (conveniently treatable as separate shareable memory objects for QEMU
>>>>>>> ;-).
>>>>>>>
>>>>>>
>>>>>> I think you forgot one approach: virtual IOMMU. That is the advanced
>>>>>> form of the grant table approach. The backend still "sees" the full
>>>>>> address space of the frontend, but it will not be able to access all of
>>>>>> it and there might even be a translation going on. Well, like IOMMUs work.
>>>>>>
>>>>>> However, this implies dynamics that are under guest control, namely of
>>>>>> the frontend guest. And such dynamics can be counterproductive for
>>>>>> certain scenarios. That's where this static windows of shared memory
>>>>>> came up.
>>>>>
>>>>> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
>>>>> are now widely implemented in Linux and virtualization software. That
>>>>> means guest modifications aren't necessary and unmodified guest
>>>>> applications will run.
>>>>>
>>>>> Applications that need the best performance can use a static mapping
>>>>> while applications that want the strongest isolation can map/unmap DMA
>>>>> buffers dynamically.
>>>>
>>>> I do not see yet that you can model with an IOMMU a static, not guest
>>>> controlled window.
>>>
>>> Well basically the IOMMU will have as part of the
>>> topology description and range of addresses devices behind it
>>> are allowed to access. What's the problem with that?
>>>
>>
>> I didn't look at the detail of the vIOMMU from that perspective, but our
>> requirement would be that it would just statically communicate to the
>> guest where DMA windows are, rather than allowing the guest to configure
>> that (which is the normal usage of an IOMMU).
> 
> Right, I got that - IOMMUs aren't necessarily fully configurable though.
> E.g. some IOMMUs are restricted in the # of bits they can address.
> 
> 
>> In addition, it would only address the memory transfer topic. We would
>> still be left with the current issue of virtio that the hypervisor's
>> device model needs to understand all supported device types.
>>
>> Jan
> 
> I'd expect the DMA API would try to paper over that likely using
> bounce buffering. If you want to avoid copies, that's a harder
> problem generally.
> 

Here I was referring to the permutations of the control path in a device
model when switching from, say, a storage to a network virtio device.
With PCI and MMIO (didn't check Channel I/O, but that's not portable
anyway), you need to patch the "first-level" hypervisor when you want to
add a brand-new virtio-sound device and the hypervisor is not yet aware
of it. For minimized setups, I would prefer to only reconfigure it and
just add a new backend service app or VM. Naturally, that model also
shrinks the logic the core hypervisor needs to provide for virtio.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 15:22             ` Jan Kiszka
@ 2020-06-18 15:29               ` Michael S. Tsirkin
  2020-07-03 12:22                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2020-06-18 15:29 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Stefan Hajnoczi, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

On Thu, Jun 18, 2020 at 05:22:40PM +0200, Jan Kiszka wrote:
> On 18.06.20 17:05, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2020 at 04:58:40PM +0200, Jan Kiszka wrote:
> >>>>>>> Option 5 - Additional Device
> >>>>>>> ============================
> >>>>>>>
> >>>>>>> The final approach would be to tie the allocation of virtqueues to
> >>>>>>> memory regions as defined by additional devices. For example the
> >>>>>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> >>>>>>> a fixed non-mappable region of the address space. Other proposals like
> >>>>>>> virtio-mem allow for hot plugging of "physical" memory into the guest
> >>>>>>> (conveniently treatable as separate shareable memory objects for QEMU
> >>>>>>> ;-).
> >>>>>>>
> >>>>>>
> >>>>>> I think you forgot one approach: virtual IOMMU. That is the advanced
> >>>>>> form of the grant table approach. The backend still "sees" the full
> >>>>>> address space of the frontend, but it will not be able to access all of
> >>>>>> it and there might even be a translation going on. Well, like IOMMUs work.
> >>>>>>
> >>>>>> However, this implies dynamics that are under guest control, namely of
> >>>>>> the frontend guest. And such dynamics can be counterproductive for
> >>>>>> certain scenarios. That's where this static windows of shared memory
> >>>>>> came up.
> >>>>>
> >>>>> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> >>>>> are now widely implemented in Linux and virtualization software. That
> >>>>> means guest modifications aren't necessary and unmodified guest
> >>>>> applications will run.
> >>>>>
> >>>>> Applications that need the best performance can use a static mapping
> >>>>> while applications that want the strongest isolation can map/unmap DMA
> >>>>> buffers dynamically.
> >>>>
> >>>> I do not see yet that you can model with an IOMMU a static, not guest
> >>>> controlled window.
> >>>
> >>> Well basically the IOMMU will have as part of the
> >>> topology description and range of addresses devices behind it
> >>> are allowed to access. What's the problem with that?
> >>>
> >>
> >> I didn't look at the detail of the vIOMMU from that perspective, but our
> >> requirement would be that it would just statically communicate to the
> >> guest where DMA windows are, rather than allowing the guest to configure
> >> that (which is the normal usage of an IOMMU).
> > 
> > Right, I got that - IOMMUs aren't necessarily fully configurable though.
> > E.g. some IOMMUs are restricted in the # of bits they can address.
> > 
> > 
> >> In addition, it would only address the memory transfer topic. We would
> >> still be left with the current issue of virtio that the hypervisor's
> >> device model needs to understand all supported device types.
> >>
> >> Jan
> > 
> > I'd expect the DMA API would try to paper over that likely using
> > bounce buffering. If you want to avoid copies, that's a harder
> > problem generally.
> > 
> 
> Here I was referring to the permutations of the control path in a device
> model when switching from, say, a storage to a network virtio device.
> With PCI and MMIO (didn't check Channel I/O, but that's not portable
> anyway), you need to patch the "first-level" hypervisor when you want to
> add a brand-new virtio-sound device and the hypervisor is not yet aware
> of it. For minimized setups, I would prefer to only reconfigure it and
> just add a new backend service app or VM. Naturally, that model also
> shrinks the logic the core hypervisor needs to provide for virtio.
> 
> Jan

Hmm that went woosh over my head a bit, sorry.
If it's important for this discussion, a diagram might help.

-- 
MST


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 17:31 [virtio-dev] Constraining where a guest may allocate virtio accessible resources Alex Bennée
                   ` (2 preceding siblings ...)
  2020-06-18 13:25 ` Stefan Hajnoczi
@ 2020-06-19  8:02 ` Jean-Philippe Brucker
  3 siblings, 0 replies; 18+ messages in thread
From: Jean-Philippe Brucker @ 2020-06-19  8:02 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Michael S. Tsirkin

On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
[...]
> Option 2 - Additional Platform Data
> ===================================
> 
> This would be extending using something like device tree or ACPI tables
> which could define regions of memory that would inform the low level
> memory allocation routines where they could allocate from. There is
> already of the concept of "dma-ranges" in device tree which can be a
> per-device property which defines the region of space that is DMA
> coherent for a device.

They are regions that are accessible to a device for DMA, coherency is
described through other methods.

Thinking more about this, dma-ranges (and ACPI _DMA) don't exactly
describe what you need. They describe addressing limitation from a
bridge's perspective, for example from the PCI root complex. So there are
at least two issues:

1. They apply to the whole downstream bus, so you can't define per-device
   DMA windows. Although with PCIe I suppose you could put one on each
   downstream port.

2. More importantly, they only describe addressing limitations locally.
   When the device directly accesses memory, it emits guest-physical
   addresses (GPA) so you can use DMA ranges to describe which memory it
   can access. However, if there is an IOMMU in between, the device emits
   I/O virtual addresses (IOVA), which are translated by the IOMMU into
   GPA. In this case the DMA ranges apply to the IOVA, and there doesn't
   exist a way to describe limitations on the GPA.

There are other mechanisms describing addressing limitations such as
Intel's RMRR, but those also apply to IOVAs as far as I know.

Thanks,
Jean

> 
> There is the question of how you tie regions declared here with the
> eventual instantiating of the VirtIO devices?
> 
> For a fully distributed set of backends (one backend per device per
> worker VM) you would need several different regions. Would each region
> be tied to each device or just a set of areas the guest would allocate
> from in sequence?

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
  2020-06-18 13:29   ` Stefan Hajnoczi
  2020-06-18 13:53   ` Laszlo Ersek
@ 2020-06-19 15:16   ` Alex Bennée
  2 siblings, 0 replies; 18+ messages in thread
From: Alex Bennée @ 2020-06-19 15:16 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, David Hildenbrand, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Michael S. Tsirkin, Jean-Philippe Brucker


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 17.06.20 19:31, Alex Bennée wrote:
>> 
>> Hi,
>> 
>> This follows on from the discussion in the last thread I raised:
>> 
>>   Subject: Backend libraries for VirtIO device emulation
>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>> 
>> To support the concept of a VirtIO backend having limited visibility of
>> a guests memory space there needs to be some mechanism to limit the
>> where that guest may place things. A simple VirtIO device can be
>> expressed purely in virt resources, for example:
>> 
>>    * status, feature and config fields
>>    * notification/doorbell
>>    * one or more virtqueues
>> 
>> Using a PCI backend the location of everything but the virtqueues it
>> controlled by the mapping of the PCI device so something that is
>> controllable by the host/hypervisor. However the guest is free to
>> allocate the virtqueues anywhere in the virtual address space of system
>> RAM.

Dave has helpfully reminded me the guest still has control in via the
BARs of where in the guests physical address space these PCI regions
exist. Although there is I believe a mechanism which allows for fixed
PCI regions.

<snip>
>> 
>> Option 5 - Additional Device
>> ============================
>> 
>> The final approach would be to tie the allocation of virtqueues to
>> memory regions as defined by additional devices. For example the
>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>> a fixed non-mappable region of the address space. Other proposals like
>> virtio-mem allow for hot plugging of "physical" memory into the guest
>> (conveniently treatable as separate shareable memory objects for QEMU
>> ;-).
>> 
>
> I think you forgot one approach: virtual IOMMU. That is the advanced
> form of the grant table approach. The backend still "sees" the full
> address space of the frontend, but it will not be able to access all of
> it and there might even be a translation going on. Well, like IOMMUs
> work.

I can see how this works in the type-1 case with strict control of which
pages are visible to which domains of a system. In the QEMU KVM/TCG case
however the main process will always see the whole address space unless
there is something else (like SEV encryption) that allows it to peek the
guest. Maybe that can't be helped though but then the question is how
does it hand off a portion of the address space to either:

  - another userspace process in QEMU's domain
  - another userspace process in another VM

Maybe the problem of sharing memory between two processes in the same
domain and in different domains should be treated differently? The APIs
available to userspace<->userspace are different to
userspace<->hypervisor.

> However, this implies dynamics that are under guest control, namely of
> the frontend guest. And such dynamics can be counterproductive for
> certain scenarios. That's where this static windows of shared memory
> came up.
>
>> 
>> Closing Thoughts and Open Questions
>> ===================================
>> 
>> Currently all of this is considering just virtqueues themselves but of
>> course only a subset of devices interact purely by virtqueue messages.
>> Network and Block devices often end up filling up additional structures
>> in memory that are usually across the whole of system memory. To achieve
>> better isolation you either need to ensure that specific bits of kernel
>> allocation are done in certain regions (i.e. block cache in "shared"
>> region) or implement some sort of bounce buffer [1] that allows you to bring
>> data from backend to frontend (which is more like the channel concept of
>> Xen's PV).
>
> For [1], look at https://lkml.org/lkml/2020/3/26/700 or at
> http://git.kiszka.org/?p=linux.git;a=blob;f=drivers/virtio/virtio_ivshmem.c;hb=refs/heads/queues/jailhouse
> (which should be using swiotlb one day).

So I guess this depends on dev_memremap being able to successfully allocate
the memory when the driver is initialised?

In the use cases I'm looking at I guess there will always be a trade off
between performance and security. I suspect for file-systems there is
too much benefit in being able to map pages directly into the primary
guests address space compared to a device which may be interacting with
an un-trusted component.

>> I suspect the solution will end up being a combination of all of these
>> approaches. There setup of different systems might mean we need a
>> plethora of ways to carve out and define regions in ways a kernel can
>> understand and make decisions about.
>> 
>> I think there will always have to be an element of VirtIO config
>> involved as that is *the* mechanism by which front/back end negotiate if
>> they can get up and running in a way they are both happy with.
>> 
>> One potential approach would be to introduce the concept of a region id
>> at the VirtIO config level which is simply a reasonably unique magic
>> number that virtio driver passes down into the kernel when requesting
>> memory for it's virtqueues. It could then be left to the kernel to
>> associate use that id when identifying the physical address range to
>> allocate from. This seems a bit of a loose binding between the driver
>> level and the kernel level but perhaps that is preferable to allow for
>> flexibility about how such regions are discovered by kernels?
>> 
>> I hope this message hasn't rambled on to much. I feel this is a complex
>> topic and I'm want to be sure I've thought through all the potential
>> options before starting to prototype a solution. For those that have
>> made it this far the final questions are:
>> 
>>   - is constraining guest allocation of virtqueues a reasonable requirement?
>> 
>>   - could virtqueues ever be directly host/hypervisor assigned?
>> 
>>   - should there be a tight or loose coupling between front-end driver
>>     and kernel/hypervisor support for allocating memory?
>> 
>> Of course if this is all solvable with existing code I'd be more than
>> happy but please let me know how ;-)
>> 
>
> Queues are a central element of virtio, but there is a (maintainability
> & security) benefit if you can keep them away from the hosting
> hypervisor, limit their interpretation and negotiation to the backend
> driver in a host process or in a backend guest VM. So I would be careful
> with coupling things too tightly.
>
> One of the issues I see in virtio for use in minimalistic hypervisors is
> the need to be aware of the different virtio devices when using PCI or
> MMIO transports. That's where a shared memory transport come into
> play.

Yes the majority of the use cases are for security isolation. For
example having a secure but un-trusted component provide some sort of
service to the main OS via a virtio-device. The ARM hypervisor model now
allows for both secure and non-secure hypervisors each with their
attendant kernel and user-mode layers. Hypervisors all the way down ;-)

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 13:25 ` Stefan Hajnoczi
@ 2020-06-19 17:35   ` Alex Bennée
  2020-07-03 13:14     ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Bennée @ 2020-06-19 17:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Jean-Philippe Brucker


Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
>> This follows on from the discussion in the last thread I raised:
>> 
>>   Subject: Backend libraries for VirtIO device emulation
>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>> 
>> To support the concept of a VirtIO backend having limited visibility of
>
> It's unclear what we're discussing. Does "VirtIO backend" mean
> vhost-user devices?
>
> Can you describe what you are trying to do?
>

Yes - although eventually the vhost-user device might be hosted in a
separate VM. See this contrived architecture diagram:

                                                   |                                                     
                Secure World                       |          Non-secure world                 
                                                   |   +--------------------+  +---------------+  
                                                   |   |c1AB                |  |cGRE           |  
                                                   |   |                    |  |               |  
                                                   |   |     Primary OS     |  |   Secondary   |  
                                                   |   |      (android)     |  |      VM       |  
         +--------------+                          |   |                    |  |               |  
         |cYEL          |                          |   |                    |  |   (Backend)   |  
         |              |                          |   |                    |  +---------------+  
         |              |                          |   |                    |                   
         |  Untrusted   |                          |   |                    |                   
         |              |                          |   |                    |  +---------------+  
   EL0   |   Service    |                          |   |                    |  |cGRE           |
    .    |              |                          |   |                    |  |               |
    .    |              |                          :   | +----------------+ |  |   Secondary   |
    .    |              |                          |   | |{io} VirtIO     | |  |      VM       |
   EL1   |              |                          |   | |                | |  |               |
         |  (Backend)   |                          |   | +----------------+ |  |   (Backend)   |
         +--------------+                          |   +----------------+---+  +---------------+
                                                   |                                        
         +-------------------------------------+   |   +---------------------------------------+
         |cPNK                                 |   |   |cGRE                                   |
   EL2   |        Secure Hypervisor            |   |   |          Non-secure Hypervisor        |
         |                                     |   |   |                                       |
         +-------------------------------------+   |   +---------------------------------------+
                                                   +-----------------------------------------------
         +-------------------------------------------------------------------------------------+
         |cRED                                                                                 |
   EL3   |                                  Secure Firmware                                    |
         |                                                                                     |
         +-------------------------------------------------------------------------------------+
  ----=-----------------------------------------------------------------------------------------   
         +------------------------+ +-------------------------+ +------------------------------+
         | c444                   | | {s}                c444 | | {io}                    c444 |
   HW    |        Compute         | |         Storage         | |             I/O              |
         |  (CPUs, GPUs, Accel)   | |  (Flash, Secure Flash)  | |   Network, USB, Peripherals  |
         |                        | |                         | |                              |
         +------------------------+ +-------------------------+ +------------------------------+

Here the primary OS is connected to the work through VirtIO devices
(acting as a common HAL). Each individual device might have a secondary
VM associated with it. Some devices might be virtual - for example a 3rd
party DRM module. It would be un-trusted so doesn't run as part of the
secure firmware but it might still need to access secure resources like
a key store or a video port.

For all these backends they should only have access to the minimum
amount of the primary OS's memory space that they need to fulfil their
job. While the non-secure hypervisor could be something like KVM it's
likely the secure one will be a much more lightweight type-1 hypervisor.

My interest in including TCG in this mix is for early prototyping and
ease of debugging when working with this towering array of layers ;-)

>> a guests memory space there needs to be some mechanism to limit the
>> where that guest may place things.
>
> Or an enforcing IOMMU? In other words, an IOMMU that only gives access
> to memory that has been put forth for DMA.
>
> This was discussed recently in the context of the ongoing
> vfio-over-socket work ("RFC: use VFIO over a UNIX domain socket to
> implement device offloading" on qemu-devel). The idea is to use the VFIO
> protocol but over UNIX domain sockets to another host userspace process
> instead of over ioctls to the kernel VFIO drivers. This would allow
> arbitary devices to be emulated in a separate process from QEMU. As a
> first step I suggested DMA_READ/DMA_WRITE protocol messages, even though
> this will have poor performance.

This is still mediated by a kernel though right?

> I think finding a solution for an enforcing IOMMU is preferrable to
> guest cooperation. The problem with guest cooperation is that you may be
> able to get new VIRTIO guest drivers to restrict where the virtqueues
> are placed, but what about applications (e.g. O_DIRECT disk I/O, network
> packets) with memory buffers at arbitrary addresses?

The virtqueues are the simple case but yes it gets complex for the rest
of the data - the simple case is handled by a bounce buffer which the
guest then copies from into it's own secure address space.

> Modifying guest applications to honor buffer memory restrictions is too
> disruptive for most use cases.
>
>> A simple VirtIO device can be
>> expressed purely in virt resources, for example:
>> 
>>    * status, feature and config fields
>>    * notification/doorbell
>>    * one or more virtqueues
>> 
>> Using a PCI backend the location of everything but the virtqueues it
>> controlled by the mapping of the PCI device so something that is
>> controllable by the host/hypervisor. However the guest is free to
>> allocate the virtqueues anywhere in the virtual address space of system
>> RAM.
>> 
>> In theory this shouldn't matter because sharing virtual pages is just a
>> matter of putting the appropriate translations in place. However there
>> are multiple ways the host and guest may interact:
>> 
>> * QEMU TCG
>> 
>> QEMU sees a block of system memory in it's virtual address space that
>> has a one to one mapping with the guests physical address space. If QEMU
>> want to share a subset of that address space it can only realistically
>> do it for a contiguous region of it's address space which implies the
>> guest must use a contiguous region of it's physical address space.
>
> This paragraph doesn't reflect my understanding. There can be multiple
> RAMBlocks. There isn't necessarily just 1 contiguous piece of RAM.
>
>> 
>> * QEMU KVM
>> 
>> The situation here is broadly the same - although both QEMU and the
>> guest are seeing a their own virtual views of a linear address space
>> which may well actually be a fragmented set of physical pages on the
>> host.
>
> I don't understand the "although" part. Isn't the situation the same as
> with TCG, where guest physical memory ranges can cross RAMBlock
> boundaries?

You are correct - I was over simplifying. This is why I was thinking
about the virtio-mem device. That would have it's own RAMBlock which
could be the only one with an associated shared memory object.

>> KVM based guests have additional constraints if they ever want to access
>> real hardware in the host as you need to ensure any address accessed by
>> the guest can be eventually translated into an address that can
>> physically access the bus which a device in one (for device
>> pass-through). The area also has to be DMA coherent so updates from a
>> bus are reliably visible to software accessing the same address space.
>
> I'm surprised about the DMA coherency sentence. Dont't VFIO and other
> userspace I/O APIs provide the DMA APIs allowing applications to deal
> with caches/coherency?

Yes - but the kernel has to ensure the buffers used by these APIs are
allocated in regions that meet the requirements.

>
>> 
>> * Xen (and other type-1's?)
>> 
>> Here the situation is a little different because the guest explicitly
>> makes it's pages visible to other domains by way of grant tables. The
>> guest is still free to use whatever parts of its address space it wishes
>> to. Other domains then request access to those pages via the hypervisor.
>> 
>> In theory the requester is free to map the granted pages anywhere in
>> its own address space. However there are differences between the
>> architectures on how well this is supported.
>> 
>> So I think this makes a case for having a mechanism by which the guest
>> can restrict it's allocation to a specific area of the guest physical
>> address space. The question is then what is the best way to inform the
>> guest kernel of the limitation?
>
> As mentioned above, I don't think it's possible to do this without
> modifying applications - which is not possible in many use cases.
> Instead we could improve IOMMU support so that this works transparently.

Yes. So the IOMMU allows the guest to mark all the pages associated with
a particular device and it's transactions but how do we map that to the
userspace view which in controlled in software?

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18  7:30 ` Michael S. Tsirkin
@ 2020-06-19 18:20   ` Alex Bennée
  0 siblings, 0 replies; 18+ messages in thread
From: Alex Bennée @ 2020-06-19 18:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Jean-Philippe Brucker


Michael S. Tsirkin <mst@redhat.com> writes:

> On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
>> 
>> Hi,
>> 
>> This follows on from the discussion in the last thread I raised:
>> 
>>   Subject: Backend libraries for VirtIO device emulation
>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>> 
>> To support the concept of a VirtIO backend having limited visibility of
>> a guests memory space there needs to be some mechanism to limit the
>> where that guest may place things. A simple VirtIO device can be
>> expressed purely in virt resources, for example:
>> 
>>    * status, feature and config fields
>>    * notification/doorbell
>>    * one or more virtqueues
>> 
>> Using a PCI backend the location of everything but the virtqueues it
>> controlled by the mapping of the PCI device so something that is
>> controllable by the host/hypervisor. However the guest is free to
>> allocate the virtqueues anywhere in the virtual address space of system
>> RAM.
>> 
>> In theory this shouldn't matter because sharing virtual pages is just a
>> matter of putting the appropriate translations in place. However there
>> are multiple ways the host and guest may interact:
>> 
>> * QEMU TCG
>> 
>> QEMU sees a block of system memory in it's virtual address space that
>> has a one to one mapping with the guests physical address space. If QEMU
>> want to share a subset of that address space it can only realistically
>> do it for a contiguous region of it's address space which implies the
>> guest must use a contiguous region of it's physical address space.
>> 
>> * QEMU KVM
>> 
>> The situation here is broadly the same - although both QEMU and the
>> guest are seeing a their own virtual views of a linear address space
>> which may well actually be a fragmented set of physical pages on the
>> host.
>> 
>> KVM based guests have additional constraints if they ever want to access
>> real hardware in the host as you need to ensure any address accessed by
>> the guest can be eventually translated into an address that can
>> physically access the bus which a device in one (for device
>> pass-through). The area also has to be DMA coherent so updates from a
>> bus are reliably visible to software accessing the same address space.
>> 
>> * Xen (and other type-1's?)
>> 
>> Here the situation is a little different because the guest explicitly
>> makes it's pages visible to other domains by way of grant tables. The
>> guest is still free to use whatever parts of its address space it wishes
>> to. Other domains then request access to those pages via the hypervisor.
>> 
>> In theory the requester is free to map the granted pages anywhere in
>> its own address space. However there are differences between the
>> architectures on how well this is supported.
>> 
>> So I think this makes a case for having a mechanism by which the guest
>> can restrict it's allocation to a specific area of the guest physical
>> address space. The question is then what is the best way to inform the
>> guest kernel of the limitation?
>
> Something that's unclear to me is whether you envision each
> device to have its own dedicated memory it can access,
> or broadly to have a couple of groups of devices,
> kind of like e.g. there are 32 bit and 64 bit DMA capable pci devices,
> or like we have devices with VIRTIO_F_ACCESS_PLATFORM and
> without it?

See the diagram I posted upthread in reply to Stefan but yes potentially
a different bit of dedicated memory per virtio device so each backend
can only see it's particular virt queues (and potentially kernel buffers
it needs access to).

<snip>
>> 
>> Option 5 - Additional Device
>> ============================
>> 
>> The final approach would be to tie the allocation of virtqueues to
>> memory regions as defined by additional devices. For example the
>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>> a fixed non-mappable region of the address space. Other proposals like
>> virtio-mem allow for hot plugging of "physical" memory into the guest
>> (conveniently treatable as separate shareable memory objects for QEMU
>> ;-).
>
> Another approach would be supplying this information through virtio-iommu.
> That already has topology information, and can be used together with
> VIRTIO_F_ACCESS_PLATFORM to limit device access to memory.
> As virtio iommu is fairly new I kind of like this approach myself -
> not a lot of legacy to contend with.

Does anything implement this yet? I had a dig through QEMU and Linux and
couldn't see it mentioned.


-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-18 15:29               ` Michael S. Tsirkin
@ 2020-07-03 12:22                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2020-07-03 12:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jan Kiszka, Alex Bennée, virtio-dev, David Hildenbrand,
	Srivatsa Vaddagiri, Azzedine Touzni, François Ozog,
	Ilias Apalodimas, Soni, Trilok, Dr. David Alan Gilbert,
	Jean-Philippe Brucker

[-- Attachment #1: Type: text/plain, Size: 4402 bytes --]

On Thu, Jun 18, 2020 at 11:29:07AM -0400, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2020 at 05:22:40PM +0200, Jan Kiszka wrote:
> > On 18.06.20 17:05, Michael S. Tsirkin wrote:
> > > On Thu, Jun 18, 2020 at 04:58:40PM +0200, Jan Kiszka wrote:
> > >>>>>>> Option 5 - Additional Device
> > >>>>>>> ============================
> > >>>>>>>
> > >>>>>>> The final approach would be to tie the allocation of virtqueues to
> > >>>>>>> memory regions as defined by additional devices. For example the
> > >>>>>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
> > >>>>>>> a fixed non-mappable region of the address space. Other proposals like
> > >>>>>>> virtio-mem allow for hot plugging of "physical" memory into the guest
> > >>>>>>> (conveniently treatable as separate shareable memory objects for QEMU
> > >>>>>>> ;-).
> > >>>>>>>
> > >>>>>>
> > >>>>>> I think you forgot one approach: virtual IOMMU. That is the advanced
> > >>>>>> form of the grant table approach. The backend still "sees" the full
> > >>>>>> address space of the frontend, but it will not be able to access all of
> > >>>>>> it and there might even be a translation going on. Well, like IOMMUs work.
> > >>>>>>
> > >>>>>> However, this implies dynamics that are under guest control, namely of
> > >>>>>> the frontend guest. And such dynamics can be counterproductive for
> > >>>>>> certain scenarios. That's where this static windows of shared memory
> > >>>>>> came up.
> > >>>>>
> > >>>>> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> > >>>>> are now widely implemented in Linux and virtualization software. That
> > >>>>> means guest modifications aren't necessary and unmodified guest
> > >>>>> applications will run.
> > >>>>>
> > >>>>> Applications that need the best performance can use a static mapping
> > >>>>> while applications that want the strongest isolation can map/unmap DMA
> > >>>>> buffers dynamically.
> > >>>>
> > >>>> I do not see yet that you can model with an IOMMU a static, not guest
> > >>>> controlled window.
> > >>>
> > >>> Well basically the IOMMU will have as part of the
> > >>> topology description and range of addresses devices behind it
> > >>> are allowed to access. What's the problem with that?
> > >>>
> > >>
> > >> I didn't look at the detail of the vIOMMU from that perspective, but our
> > >> requirement would be that it would just statically communicate to the
> > >> guest where DMA windows are, rather than allowing the guest to configure
> > >> that (which is the normal usage of an IOMMU).
> > > 
> > > Right, I got that - IOMMUs aren't necessarily fully configurable though.
> > > E.g. some IOMMUs are restricted in the # of bits they can address.
> > > 
> > > 
> > >> In addition, it would only address the memory transfer topic. We would
> > >> still be left with the current issue of virtio that the hypervisor's
> > >> device model needs to understand all supported device types.
> > >>
> > >> Jan
> > > 
> > > I'd expect the DMA API would try to paper over that likely using
> > > bounce buffering. If you want to avoid copies, that's a harder
> > > problem generally.
> > > 
> > 
> > Here I was referring to the permutations of the control path in a device
> > model when switching from, say, a storage to a network virtio device.
> > With PCI and MMIO (didn't check Channel I/O, but that's not portable
> > anyway), you need to patch the "first-level" hypervisor when you want to
> > add a brand-new virtio-sound device and the hypervisor is not yet aware
> > of it. For minimized setups, I would prefer to only reconfigure it and
> > just add a new backend service app or VM. Naturally, that model also
> > shrinks the logic the core hypervisor needs to provide for virtio.
> > 
> > Jan
> 
> Hmm that went woosh over my head a bit, sorry.
> If it's important for this discussion, a diagram might help.

Kernel VFIO and VFIO-over-socket provide this sort of interface where
it's possible to add new device types without modifying the hypervisor.

vhost-user is not (yet?) a full VIRTIO device so it requires code in the
VMM to set up the VIRTIO device that the guest sees. But recent
developments in vhost-user and vDPA seem to be moving closer to a full
VIRTIO device model and not just an offload for a subset of virtqueues.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  2020-06-19 17:35   ` Alex Bennée
@ 2020-07-03 13:14     ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2020-07-03 13:14 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-dev, David Hildenbrand, jan.kiszka, Srivatsa Vaddagiri,
	Azzedine Touzni, François Ozog, Ilias Apalodimas, Soni,
	Trilok, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Jean-Philippe Brucker

[-- Attachment #1: Type: text/plain, Size: 14277 bytes --]

On Fri, Jun 19, 2020 at 06:35:39PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
> >> This follows on from the discussion in the last thread I raised:
> >> 
> >>   Subject: Backend libraries for VirtIO device emulation
> >>   Date: Fri, 06 Mar 2020 18:33:57 +0000
> >>   Message-ID: <874kv15o4q.fsf@linaro.org>
> >> 
> >> To support the concept of a VirtIO backend having limited visibility of
> >
> > It's unclear what we're discussing. Does "VirtIO backend" mean
> > vhost-user devices?
> >
> > Can you describe what you are trying to do?
> >
> 
> Yes - although eventually the vhost-user device might be hosted in a
> separate VM. See this contrived architecture diagram:
> 
>                                                    |                                                     
>                 Secure World                       |          Non-secure world                 
>                                                    |   +--------------------+  +---------------+  
>                                                    |   |c1AB                |  |cGRE           |  
>                                                    |   |                    |  |               |  
>                                                    |   |     Primary OS     |  |   Secondary   |  
>                                                    |   |      (android)     |  |      VM       |  
>          +--------------+                          |   |                    |  |               |  
>          |cYEL          |                          |   |                    |  |   (Backend)   |  
>          |              |                          |   |                    |  +---------------+  
>          |              |                          |   |                    |                   
>          |  Untrusted   |                          |   |                    |                   
>          |              |                          |   |                    |  +---------------+  
>    EL0   |   Service    |                          |   |                    |  |cGRE           |
>     .    |              |                          |   |                    |  |               |
>     .    |              |                          :   | +----------------+ |  |   Secondary   |
>     .    |              |                          |   | |{io} VirtIO     | |  |      VM       |
>    EL1   |              |                          |   | |                | |  |               |
>          |  (Backend)   |                          |   | +----------------+ |  |   (Backend)   |
>          +--------------+                          |   +----------------+---+  +---------------+
>                                                    |                                        
>          +-------------------------------------+   |   +---------------------------------------+
>          |cPNK                                 |   |   |cGRE                                   |
>    EL2   |        Secure Hypervisor            |   |   |          Non-secure Hypervisor        |
>          |                                     |   |   |                                       |
>          +-------------------------------------+   |   +---------------------------------------+
>                                                    +-----------------------------------------------
>          +-------------------------------------------------------------------------------------+
>          |cRED                                                                                 |
>    EL3   |                                  Secure Firmware                                    |
>          |                                                                                     |
>          +-------------------------------------------------------------------------------------+
>   ----=-----------------------------------------------------------------------------------------   
>          +------------------------+ +-------------------------+ +------------------------------+
>          | c444                   | | {s}                c444 | | {io}                    c444 |
>    HW    |        Compute         | |         Storage         | |             I/O              |
>          |  (CPUs, GPUs, Accel)   | |  (Flash, Secure Flash)  | |   Network, USB, Peripherals  |
>          |                        | |                         | |                              |
>          +------------------------+ +-------------------------+ +------------------------------+
> 
> Here the primary OS is connected to the work through VirtIO devices
> (acting as a common HAL). Each individual device might have a secondary
> VM associated with it. Some devices might be virtual - for example a 3rd
> party DRM module. It would be un-trusted so doesn't run as part of the
> secure firmware but it might still need to access secure resources like
> a key store or a video port.
> 
> For all these backends they should only have access to the minimum
> amount of the primary OS's memory space that they need to fulfil their
> job. While the non-secure hypervisor could be something like KVM it's
> likely the secure one will be a much more lightweight type-1 hypervisor.

This is possible with vhost-user + virtio-vhost-user:

Expose a subset of the Primary VM's memory over vhost-user to the
Secondary VM. Normally all guest RAM is exposed over vhost-user, but
it's simple to expose only a subset. The DMA region needs to be its own
file descriptor (not a larger memfd that also contains other guest RAM)
since vhost-user uses file descriptor passing to share memory.

The "Guest Physical Addresses" in virtqueues don't need to be
translated, they can be the same GPAs used inside the guest. The
Secondary just have access to GPAs outside the memfd region(s) that have
been provided by the Primary.

The guest drivers in the Primary will need to copy buffers to the DMA
memory region if existing applications are not aware of DMA address
contraints.

Nikos Dragazis has DPDK and SPDK code with virtio-vhost-user support, so
you could use that as a starting point to add DMA constraints. The
Secondary either emulates a virtio-net (DPDK) device or a virtio-scsi
(SPDK) device:

https://ndragazis.github.io/spdk.html

The missing pieces are:

1. A way to associate memory backends with vhost-user devices so the
   Primary knows which memory to expose:

     -object memory-backend-memfd,id=foo,...
     -device vhost-user-scsi-pci,memory-backend[0]=foo,...

   Now the vhost-uesr-scsi-pci device will only expose the 'foo' memfd
   over vhost-user, not all of guest RAM.

2. A way to communicate per-device DMA address restrictions to the
   Primary OS and the necessary driver memory allocation changes and/or
   bounce buffers.

Adding those two things on top of virtio-vhost-user will do what you've
described in the diagram.

> My interest in including TCG in this mix is for early prototyping and
> ease of debugging when working with this towering array of layers ;-)

Once TCG works with vhost-user it will also work with virtio-vhost-user.

> >> a guests memory space there needs to be some mechanism to limit the
> >> where that guest may place things.
> >
> > Or an enforcing IOMMU? In other words, an IOMMU that only gives access
> > to memory that has been put forth for DMA.
> >
> > This was discussed recently in the context of the ongoing
> > vfio-over-socket work ("RFC: use VFIO over a UNIX domain socket to
> > implement device offloading" on qemu-devel). The idea is to use the VFIO
> > protocol but over UNIX domain sockets to another host userspace process
> > instead of over ioctls to the kernel VFIO drivers. This would allow
> > arbitary devices to be emulated in a separate process from QEMU. As a
> > first step I suggested DMA_READ/DMA_WRITE protocol messages, even though
> > this will have poor performance.
> 
> This is still mediated by a kernel though right?

The host kernel? The guest kernel?

It's possible to have a vIOMMU in the guest and no host kernel IOMMU.

The guest kernel needs to support the vIOMMU. Infrastructure is there in
Linux since the vIOMMU can already be used already for device
passthrough.

> > I think finding a solution for an enforcing IOMMU is preferrable to
> > guest cooperation. The problem with guest cooperation is that you may be
> > able to get new VIRTIO guest drivers to restrict where the virtqueues
> > are placed, but what about applications (e.g. O_DIRECT disk I/O, network
> > packets) with memory buffers at arbitrary addresses?
> 
> The virtqueues are the simple case but yes it gets complex for the rest
> of the data - the simple case is handled by a bounce buffer which the
> guest then copies from into it's own secure address space.
> 
> > Modifying guest applications to honor buffer memory restrictions is too
> > disruptive for most use cases.
> >
> >> A simple VirtIO device can be
> >> expressed purely in virt resources, for example:
> >> 
> >>    * status, feature and config fields
> >>    * notification/doorbell
> >>    * one or more virtqueues
> >> 
> >> Using a PCI backend the location of everything but the virtqueues it
> >> controlled by the mapping of the PCI device so something that is
> >> controllable by the host/hypervisor. However the guest is free to
> >> allocate the virtqueues anywhere in the virtual address space of system
> >> RAM.
> >> 
> >> In theory this shouldn't matter because sharing virtual pages is just a
> >> matter of putting the appropriate translations in place. However there
> >> are multiple ways the host and guest may interact:
> >> 
> >> * QEMU TCG
> >> 
> >> QEMU sees a block of system memory in it's virtual address space that
> >> has a one to one mapping with the guests physical address space. If QEMU
> >> want to share a subset of that address space it can only realistically
> >> do it for a contiguous region of it's address space which implies the
> >> guest must use a contiguous region of it's physical address space.
> >
> > This paragraph doesn't reflect my understanding. There can be multiple
> > RAMBlocks. There isn't necessarily just 1 contiguous piece of RAM.
> >
> >> 
> >> * QEMU KVM
> >> 
> >> The situation here is broadly the same - although both QEMU and the
> >> guest are seeing a their own virtual views of a linear address space
> >> which may well actually be a fragmented set of physical pages on the
> >> host.
> >
> > I don't understand the "although" part. Isn't the situation the same as
> > with TCG, where guest physical memory ranges can cross RAMBlock
> > boundaries?
> 
> You are correct - I was over simplifying. This is why I was thinking
> about the virtio-mem device. That would have it's own RAMBlock which
> could be the only one with an associated shared memory object.
> 
> >> KVM based guests have additional constraints if they ever want to access
> >> real hardware in the host as you need to ensure any address accessed by
> >> the guest can be eventually translated into an address that can
> >> physically access the bus which a device in one (for device
> >> pass-through). The area also has to be DMA coherent so updates from a
> >> bus are reliably visible to software accessing the same address space.
> >
> > I'm surprised about the DMA coherency sentence. Dont't VFIO and other
> > userspace I/O APIs provide the DMA APIs allowing applications to deal
> > with caches/coherency?
> 
> Yes - but the kernel has to ensure the buffers used by these APIs are
> allocated in regions that meet the requirements.

Is coherency as issue for software devices? Normally software device
implementations respect memory ordering (e.g. by using memory barriers)
but there is nothing else to worry about.

> >
> >> 
> >> * Xen (and other type-1's?)
> >> 
> >> Here the situation is a little different because the guest explicitly
> >> makes it's pages visible to other domains by way of grant tables. The
> >> guest is still free to use whatever parts of its address space it wishes
> >> to. Other domains then request access to those pages via the hypervisor.
> >> 
> >> In theory the requester is free to map the granted pages anywhere in
> >> its own address space. However there are differences between the
> >> architectures on how well this is supported.
> >> 
> >> So I think this makes a case for having a mechanism by which the guest
> >> can restrict it's allocation to a specific area of the guest physical
> >> address space. The question is then what is the best way to inform the
> >> guest kernel of the limitation?
> >
> > As mentioned above, I don't think it's possible to do this without
> > modifying applications - which is not possible in many use cases.
> > Instead we could improve IOMMU support so that this works transparently.
> 
> Yes. So the IOMMU allows the guest to mark all the pages associated with
> a particular device and it's transactions but how do we map that to the
> userspace view which in controlled in software?

The userspace device implementation in the Secondary? vhost-user
supports vIOMMU address translation. It asks the vIOMMU to translate a
IOVA to a GPA. The problem is that the device can still access all of
guest RAM. I'm not aware of a fast and secure interface for changing
mmaps in another process :( so it seems tricky to achieve this.

If you care more about security than performance you can add DMA
read/write messages to the vhost-user protocol. The Secondary will the
perform each DMA read/write by sending a message to the Primary
including the data that needs to be transferred.

To get slightly better performance this could be enhanced with
traditional shared memory regions for the virtqueues so that at least
the vring can be accessed via shared memory. Then only indirect
descriptor tables and the actual data buffers need to take the slow
path through the Primary.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-07-03 13:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 17:31 [virtio-dev] Constraining where a guest may allocate virtio accessible resources Alex Bennée
2020-06-17 18:01 ` [virtio-dev] " Jan Kiszka
2020-06-18 13:29   ` Stefan Hajnoczi
2020-06-18 13:59     ` Jan Kiszka
2020-06-18 14:52       ` Michael S. Tsirkin
2020-06-18 14:58         ` Jan Kiszka
2020-06-18 15:05           ` Michael S. Tsirkin
2020-06-18 15:22             ` Jan Kiszka
2020-06-18 15:29               ` Michael S. Tsirkin
2020-07-03 12:22                 ` Stefan Hajnoczi
2020-06-18 13:53   ` Laszlo Ersek
2020-06-19 15:16   ` Alex Bennée
2020-06-18  7:30 ` Michael S. Tsirkin
2020-06-19 18:20   ` Alex Bennée
2020-06-18 13:25 ` Stefan Hajnoczi
2020-06-19 17:35   ` Alex Bennée
2020-07-03 13:14     ` Stefan Hajnoczi
2020-06-19  8:02 ` Jean-Philippe Brucker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.