All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
@ 2021-08-04  9:04 Alex Bennée
  2021-08-04 19:20 ` Stefano Stabellini
                   ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Alex Bennée @ 2021-08-04  9:04 UTC (permalink / raw)
  To: Stratos Mailing List, virtio-dev
  Cc: Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	stefanha, Jan Kiszka, Carl van Schaik, pratikp,
	Srivatsa Vaddagiri, Jean-Philippe Brucker, Mathieu Poirier

Hi,

One of the goals of Project Stratos is to enable hypervisor agnostic
backends so we can enable as much re-use of code as possible and avoid
repeating ourselves. This is the flip side of the front end where
multiple front-end implementations are required - one per OS, assuming
you don't just want Linux guests. The resultant guests are trivially
movable between hypervisors modulo any abstracted paravirt type
interfaces.

In my original thumb nail sketch of a solution I envisioned vhost-user
daemons running in a broadly POSIX like environment. The interface to
the daemon is fairly simple requiring only some mapped memory and some
sort of signalling for events (on Linux this is eventfd). The idea was a
stub binary would be responsible for any hypervisor specific setup and
then launch a common binary to deal with the actual virtqueue requests
themselves.

Since that original sketch we've seen an expansion in the sort of ways
backends could be created. There is interest in encapsulating backends
in RTOSes or unikernels for solutions like SCMI. There interest in Rust
has prompted ideas of using the trait interface to abstract differences
away as well as the idea of bare-metal Rust backends.

We have a card (STR-12) called "Hypercall Standardisation" which
calls for a description of the APIs needed from the hypervisor side to
support VirtIO guests and their backends. However we are some way off
from that at the moment as I think we need to at least demonstrate one
portable backend before we start codifying requirements. To that end I
want to think about what we need for a backend to function.

Configuration
=============

In the type-2 setup this is typically fairly simple because the host
system can orchestrate the various modules that make up the complete
system. In the type-1 case (or even type-2 with delegated service VMs)
we need some sort of mechanism to inform the backend VM about key
details about the system:

  - where virt queue memory is in it's address space
  - how it's going to receive (interrupt) and trigger (kick) events
  - what (if any) resources the backend needs to connect to

Obviously you can elide over configuration issues by having static
configurations and baking the assumptions into your guest images however
this isn't scalable in the long term. The obvious solution seems to be
extending a subset of Device Tree data to user space but perhaps there
are other approaches?

Before any virtio transactions can take place the appropriate memory
mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible
to whatever is serving the virtio requests. I can envision 3 approaches:

 * BE guest boots with memory already mapped

 This would entail the guest OS knowing where in it's Guest Physical
 Address space is already taken up and avoiding clashing. I would assume
 in this case you would want a standard interface to userspace to then
 make that address space visible to the backend daemon.

 * BE guests boots with a hypervisor handle to memory

 The BE guest is then free to map the FE's memory to where it wants in
 the BE's guest physical address space. To activate the mapping will
 require some sort of hypercall to the hypervisor. I can see two options
 at this point:

  - expose the handle to userspace for daemon/helper to trigger the
    mapping via existing hypercall interfaces. If using a helper you
    would have a hypervisor specific one to avoid the daemon having to
    care too much about the details or push that complexity into a
    compile time option for the daemon which would result in different
    binaries although a common source base.

  - expose a new kernel ABI to abstract the hypercall differences away
    in the guest kernel. In this case the userspace would essentially
    ask for an abstract "map guest N memory to userspace ptr" and let
    the kernel deal with the different hypercall interfaces. This of
    course assumes the majority of BE guests would be Linux kernels and
    leaves the bare-metal/unikernel approaches to their own devices.

Operation
=========

The core of the operation of VirtIO is fairly simple. Once the
vhost-user feature negotiation is done it's a case of receiving update
events and parsing the resultant virt queue for data. The vhost-user
specification handles a bunch of setup before that point, mostly to
detail where the virt queues are set up FD's for memory and event
communication. This is where the envisioned stub process would be
responsible for getting the daemon up and ready to run. This is
currently done inside a big VMM like QEMU but I suspect a modern
approach would be to use the rust-vmm vhost crate. It would then either
communicate with the kernel's abstracted ABI or be re-targeted as a
build option for the various hypervisors.

One question is how to best handle notification and kicks. The existing
vhost-user framework uses eventfd to signal the daemon (although QEMU
is quite capable of simulating them when you use TCG). Xen has it's own
IOREQ mechanism. However latency is an important factor and having
events go through the stub would add quite a lot.

Could we consider the kernel internally converting IOREQ messages from
the Xen hypervisor to eventfd events? Would this scale with other kernel
hypercall interfaces?

So any thoughts on what directions are worth experimenting with?

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-04  9:04 [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends Alex Bennée
@ 2021-08-04 19:20 ` Stefano Stabellini
  2021-08-11  6:27   ` AKASHI Takahiro
                     ` (2 more replies)
  2021-08-05 15:48 ` [virtio-dev] " Stefan Hajnoczi
  2021-08-19  9:11 ` [virtio-dev] " Matias Ezequiel Vara Larsen
  2 siblings, 3 replies; 66+ messages in thread
From: Stefano Stabellini @ 2021-08-04 19:20 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	AKASHI Takahiro, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9148 bytes --]

CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
email to let them read the full context.

My comments below are related to a potential Xen implementation, not
because it is the only implementation that matters, but because it is
the one I know best.

Also, please see this relevant email thread:
https://marc.info/?l=xen-devel&m=162373754705233&w=2


On Wed, 4 Aug 2021, Alex Bennée wrote:
> Hi,
> 
> One of the goals of Project Stratos is to enable hypervisor agnostic
> backends so we can enable as much re-use of code as possible and avoid
> repeating ourselves. This is the flip side of the front end where
> multiple front-end implementations are required - one per OS, assuming
> you don't just want Linux guests. The resultant guests are trivially
> movable between hypervisors modulo any abstracted paravirt type
> interfaces.
> 
> In my original thumb nail sketch of a solution I envisioned vhost-user
> daemons running in a broadly POSIX like environment. The interface to
> the daemon is fairly simple requiring only some mapped memory and some
> sort of signalling for events (on Linux this is eventfd). The idea was a
> stub binary would be responsible for any hypervisor specific setup and
> then launch a common binary to deal with the actual virtqueue requests
> themselves.
> 
> Since that original sketch we've seen an expansion in the sort of ways
> backends could be created. There is interest in encapsulating backends
> in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> has prompted ideas of using the trait interface to abstract differences
> away as well as the idea of bare-metal Rust backends.
> 
> We have a card (STR-12) called "Hypercall Standardisation" which
> calls for a description of the APIs needed from the hypervisor side to
> support VirtIO guests and their backends. However we are some way off
> from that at the moment as I think we need to at least demonstrate one
> portable backend before we start codifying requirements. To that end I
> want to think about what we need for a backend to function.
> 
> Configuration
> =============
> 
> In the type-2 setup this is typically fairly simple because the host
> system can orchestrate the various modules that make up the complete
> system. In the type-1 case (or even type-2 with delegated service VMs)
> we need some sort of mechanism to inform the backend VM about key
> details about the system:
> 
>   - where virt queue memory is in it's address space
>   - how it's going to receive (interrupt) and trigger (kick) events
>   - what (if any) resources the backend needs to connect to
> 
> Obviously you can elide over configuration issues by having static
> configurations and baking the assumptions into your guest images however
> this isn't scalable in the long term. The obvious solution seems to be
> extending a subset of Device Tree data to user space but perhaps there
> are other approaches?
> 
> Before any virtio transactions can take place the appropriate memory
> mappings need to be made between the FE guest and the BE guest.

> Currently the whole of the FE guests address space needs to be visible
> to whatever is serving the virtio requests. I can envision 3 approaches:
> 
>  * BE guest boots with memory already mapped
> 
>  This would entail the guest OS knowing where in it's Guest Physical
>  Address space is already taken up and avoiding clashing. I would assume
>  in this case you would want a standard interface to userspace to then
>  make that address space visible to the backend daemon.
> 
>  * BE guests boots with a hypervisor handle to memory
> 
>  The BE guest is then free to map the FE's memory to where it wants in
>  the BE's guest physical address space.

I cannot see how this could work for Xen. There is no "handle" to give
to the backend if the backend is not running in dom0. So for Xen I think
the memory has to be already mapped and the mapping probably done by the
toolstack (also see below.) Or we would have to invent a new Xen
hypervisor interface and Xen virtual machine privileges to allow this
kind of mapping.

If we run the backend in Dom0 that we have no problems of course.


> To activate the mapping will
>  require some sort of hypercall to the hypervisor. I can see two options
>  at this point:
> 
>   - expose the handle to userspace for daemon/helper to trigger the
>     mapping via existing hypercall interfaces. If using a helper you
>     would have a hypervisor specific one to avoid the daemon having to
>     care too much about the details or push that complexity into a
>     compile time option for the daemon which would result in different
>     binaries although a common source base.
> 
>   - expose a new kernel ABI to abstract the hypercall differences away
>     in the guest kernel. In this case the userspace would essentially
>     ask for an abstract "map guest N memory to userspace ptr" and let
>     the kernel deal with the different hypercall interfaces. This of
>     course assumes the majority of BE guests would be Linux kernels and
>     leaves the bare-metal/unikernel approaches to their own devices.
> 
> Operation
> =========
> 
> The core of the operation of VirtIO is fairly simple. Once the
> vhost-user feature negotiation is done it's a case of receiving update
> events and parsing the resultant virt queue for data. The vhost-user
> specification handles a bunch of setup before that point, mostly to
> detail where the virt queues are set up FD's for memory and event
> communication. This is where the envisioned stub process would be
> responsible for getting the daemon up and ready to run. This is
> currently done inside a big VMM like QEMU but I suspect a modern
> approach would be to use the rust-vmm vhost crate. It would then either
> communicate with the kernel's abstracted ABI or be re-targeted as a
> build option for the various hypervisors.

One thing I mentioned before to Alex is that Xen doesn't have VMMs the
way they are typically envisioned and described in other environments.
Instead, Xen has IOREQ servers. Each of them connects independently to
Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
emulators for a single Xen VM, each of them connecting to Xen
independently via the IOREQ interface.

The component responsible for starting a daemon and/or setting up shared
interfaces is the toolstack: the xl command and the libxl/libxc
libraries.

Oleksandr and others I CCed have been working on ways for the toolstack
to create virtio backends and setup memory mappings. They might be able
to provide more info on the subject. I do think we miss a way to provide
the configuration to the backend and anything else that the backend
might require to start doing its job.


> One question is how to best handle notification and kicks. The existing
> vhost-user framework uses eventfd to signal the daemon (although QEMU
> is quite capable of simulating them when you use TCG). Xen has it's own
> IOREQ mechanism. However latency is an important factor and having
> events go through the stub would add quite a lot.

Yeah I think, regardless of anything else, we want the backends to
connect directly to the Xen hypervisor.


> Could we consider the kernel internally converting IOREQ messages from
> the Xen hypervisor to eventfd events? Would this scale with other kernel
> hypercall interfaces?
> 
> So any thoughts on what directions are worth experimenting with?
 
One option we should consider is for each backend to connect to Xen via
the IOREQ interface. We could generalize the IOREQ interface and make it
hypervisor agnostic. The interface is really trivial and easy to add.
The only Xen-specific part is the notification mechanism, which is an
event channel. If we replaced the event channel with something else the
interface would be generic. See:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52

I don't think that translating IOREQs to eventfd in the kernel is a
good idea: if feels like it would be extra complexity and that the
kernel shouldn't be involved as this is a backend-hypervisor interface.
Also, eventfd is very Linux-centric and we are trying to design an
interface that could work well for RTOSes too. If we want to do
something different, both OS-agnostic and hypervisor-agnostic, perhaps
we could design a new interface. One that could be implementable in the
Xen hypervisor itself (like IOREQ) and of course any other hypervisor
too.


There is also another problem. IOREQ is probably not be the only
interface needed. Have a look at
https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
an interface for the backend to inject interrupts into the frontend? And
if the backend requires dynamic memory mappings of frontend pages, then
we would also need an interface to map/unmap domU pages.

These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
and self-contained. It is easy to add anywhere. A new interface to
inject interrupts or map pages is more difficult to manage because it
would require changes scattered across the various emulators.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-04  9:04 [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends Alex Bennée
  2021-08-04 19:20 ` Stefano Stabellini
@ 2021-08-05 15:48 ` Stefan Hajnoczi
  2021-08-19  9:11 ` [virtio-dev] " Matias Ezequiel Vara Larsen
  2 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-05 15:48 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	AKASHI Takahiro, Stefano Stabellini, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier

[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]

On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> Hi,
> 
> One of the goals of Project Stratos is to enable hypervisor agnostic
> backends so we can enable as much re-use of code as possible and avoid
> repeating ourselves. This is the flip side of the front end where
> multiple front-end implementations are required - one per OS, assuming
> you don't just want Linux guests. The resultant guests are trivially
> movable between hypervisors modulo any abstracted paravirt type
> interfaces.
> 
> In my original thumb nail sketch of a solution I envisioned vhost-user
> daemons running in a broadly POSIX like environment. The interface to
> the daemon is fairly simple requiring only some mapped memory and some
> sort of signalling for events (on Linux this is eventfd). The idea was a
> stub binary would be responsible for any hypervisor specific setup and
> then launch a common binary to deal with the actual virtqueue requests
> themselves.
> 
> Since that original sketch we've seen an expansion in the sort of ways
> backends could be created. There is interest in encapsulating backends
> in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> has prompted ideas of using the trait interface to abstract differences
> away as well as the idea of bare-metal Rust backends.
> 
> We have a card (STR-12) called "Hypercall Standardisation" which
> calls for a description of the APIs needed from the hypervisor side to
> support VirtIO guests and their backends. However we are some way off
> from that at the moment as I think we need to at least demonstrate one
> portable backend before we start codifying requirements. To that end I
> want to think about what we need for a backend to function.
> 
> Configuration
> =============
> 
> In the type-2 setup this is typically fairly simple because the host
> system can orchestrate the various modules that make up the complete
> system. In the type-1 case (or even type-2 with delegated service VMs)
> we need some sort of mechanism to inform the backend VM about key
> details about the system:
> 
>   - where virt queue memory is in it's address space
>   - how it's going to receive (interrupt) and trigger (kick) events
>   - what (if any) resources the backend needs to connect to
> 
> Obviously you can elide over configuration issues by having static
> configurations and baking the assumptions into your guest images however
> this isn't scalable in the long term. The obvious solution seems to be
> extending a subset of Device Tree data to user space but perhaps there
> are other approaches?
> 
> Before any virtio transactions can take place the appropriate memory
> mappings need to be made between the FE guest and the BE guest.
> Currently the whole of the FE guests address space needs to be visible
> to whatever is serving the virtio requests. I can envision 3 approaches:
> 
>  * BE guest boots with memory already mapped
> 
>  This would entail the guest OS knowing where in it's Guest Physical
>  Address space is already taken up and avoiding clashing. I would assume
>  in this case you would want a standard interface to userspace to then
>  make that address space visible to the backend daemon.
> 
>  * BE guests boots with a hypervisor handle to memory
> 
>  The BE guest is then free to map the FE's memory to where it wants in
>  the BE's guest physical address space. To activate the mapping will
>  require some sort of hypercall to the hypervisor. I can see two options
>  at this point:
> 
>   - expose the handle to userspace for daemon/helper to trigger the
>     mapping via existing hypercall interfaces. If using a helper you
>     would have a hypervisor specific one to avoid the daemon having to
>     care too much about the details or push that complexity into a
>     compile time option for the daemon which would result in different
>     binaries although a common source base.
> 
>   - expose a new kernel ABI to abstract the hypercall differences away
>     in the guest kernel. In this case the userspace would essentially
>     ask for an abstract "map guest N memory to userspace ptr" and let
>     the kernel deal with the different hypercall interfaces. This of
>     course assumes the majority of BE guests would be Linux kernels and
>     leaves the bare-metal/unikernel approaches to their own devices.

VIRTIO typically uses the vring memory layout but doesn't need to. The
VIRTIO device model deals with virtqueues. The shared memory vring
layout is part of the VIRTIO transport (PCI, MMIO, and CCW use vrings).
Alternative transports with other virtqueue representations are possible
(e.g. VIRTIO-over-TCP). They don't need to involve a BE mapping shared
memory and processing a vring owned by the FE.

For example, there could be BE hypercalls to pop a virtqueue elements,
push a virtqueue elements, and to access buffers (basically DMA
read/write). The FE could either be a traditional virtio-mmio/pci device
with a vring or use FE hypercalls to add available elements to a
virtqueue and get used elements.

I don't know the goals of project Stratos or whether this helps, but it
might allow other architectures that have different security,
complexity, etc properties.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-04 19:20 ` Stefano Stabellini
@ 2021-08-11  6:27   ` AKASHI Takahiro
  2021-08-14 15:37     ` Oleksandr Tyshchenko
       [not found]   ` <0100017b33e585a5-06d4248e-b1a7-485e-800c-7ead89e5f916-000000@email.amazonses.com>
  2021-08-17 10:41     ` [virtio-dev] " Stefan Hajnoczi
  2 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-11  6:27 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Alex Benn??e, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> email to let them read the full context.
> 
> My comments below are related to a potential Xen implementation, not
> because it is the only implementation that matters, but because it is
> the one I know best.

Please note that my proposal (and hence the working prototype)[1]
is based on Xen's virtio implementation (i.e. IOREQ) and particularly
EPAM's virtio-disk application (backend server).
It has been, I believe, well generalized but is still a bit biased
toward this original design.

So I hope you like my approach :)

[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html

Let me take this opportunity to explain a bit more about my approach below.

> Also, please see this relevant email thread:
> https://marc.info/?l=xen-devel&m=162373754705233&w=2
> 
> 
> On Wed, 4 Aug 2021, Alex Bennée wrote:
> > Hi,
> > 
> > One of the goals of Project Stratos is to enable hypervisor agnostic
> > backends so we can enable as much re-use of code as possible and avoid
> > repeating ourselves. This is the flip side of the front end where
> > multiple front-end implementations are required - one per OS, assuming
> > you don't just want Linux guests. The resultant guests are trivially
> > movable between hypervisors modulo any abstracted paravirt type
> > interfaces.
> > 
> > In my original thumb nail sketch of a solution I envisioned vhost-user
> > daemons running in a broadly POSIX like environment. The interface to
> > the daemon is fairly simple requiring only some mapped memory and some
> > sort of signalling for events (on Linux this is eventfd). The idea was a
> > stub binary would be responsible for any hypervisor specific setup and
> > then launch a common binary to deal with the actual virtqueue requests
> > themselves.
> > 
> > Since that original sketch we've seen an expansion in the sort of ways
> > backends could be created. There is interest in encapsulating backends
> > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > has prompted ideas of using the trait interface to abstract differences
> > away as well as the idea of bare-metal Rust backends.
> > 
> > We have a card (STR-12) called "Hypercall Standardisation" which
> > calls for a description of the APIs needed from the hypervisor side to
> > support VirtIO guests and their backends. However we are some way off
> > from that at the moment as I think we need to at least demonstrate one
> > portable backend before we start codifying requirements. To that end I
> > want to think about what we need for a backend to function.
> > 
> > Configuration
> > =============
> > 
> > In the type-2 setup this is typically fairly simple because the host
> > system can orchestrate the various modules that make up the complete
> > system. In the type-1 case (or even type-2 with delegated service VMs)
> > we need some sort of mechanism to inform the backend VM about key
> > details about the system:
> > 
> >   - where virt queue memory is in it's address space
> >   - how it's going to receive (interrupt) and trigger (kick) events
> >   - what (if any) resources the backend needs to connect to
> > 
> > Obviously you can elide over configuration issues by having static
> > configurations and baking the assumptions into your guest images however
> > this isn't scalable in the long term. The obvious solution seems to be
> > extending a subset of Device Tree data to user space but perhaps there
> > are other approaches?
> > 
> > Before any virtio transactions can take place the appropriate memory
> > mappings need to be made between the FE guest and the BE guest.
> 
> > Currently the whole of the FE guests address space needs to be visible
> > to whatever is serving the virtio requests. I can envision 3 approaches:
> > 
> >  * BE guest boots with memory already mapped
> > 
> >  This would entail the guest OS knowing where in it's Guest Physical
> >  Address space is already taken up and avoiding clashing. I would assume
> >  in this case you would want a standard interface to userspace to then
> >  make that address space visible to the backend daemon.

Yet another way here is that we would have well known "shared memory" between
VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
and that it can even be an alternative for hypervisor-agnostic solution. 

(Please note memory regions in ivshmem appear as a PCI device and can be
mapped locally.)

I want to add this shared memory aspect to my virtio-proxy, but
the resultant solution would eventually look similar to ivshmem.

> >  * BE guests boots with a hypervisor handle to memory
> > 
> >  The BE guest is then free to map the FE's memory to where it wants in
> >  the BE's guest physical address space.
> 
> I cannot see how this could work for Xen. There is no "handle" to give
> to the backend if the backend is not running in dom0. So for Xen I think
> the memory has to be already mapped

In Xen's IOREQ solution (virtio-blk), the following information is expected
to be exposed to BE via Xenstore:
(I know that this is a tentative approach though.)
   - the start address of configuration space
   - interrupt number
   - file path for backing storage
   - read-only flag
And the BE server have to call a particular hypervisor interface to
map the configuration space.

In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
stuffs are contained in virtio-proxy, yet another VM, to hide all details.

# My point is that a "handle" is not mandatory for executing mapping.

> and the mapping probably done by the
> toolstack (also see below.) Or we would have to invent a new Xen
> hypervisor interface and Xen virtual machine privileges to allow this
> kind of mapping.

> If we run the backend in Dom0 that we have no problems of course.

One of difficulties on Xen that I found in my approach is that calling
such hypervisor intefaces (registering IOREQ, mapping memory) is only
allowed on BE servers themselvies and so we will have to extend those
interfaces.
This, however, will raise some concern on security and privilege distribution
as Stefan suggested.
> 
> 
> > To activate the mapping will
> >  require some sort of hypercall to the hypervisor. I can see two options
> >  at this point:
> > 
> >   - expose the handle to userspace for daemon/helper to trigger the
> >     mapping via existing hypercall interfaces. If using a helper you
> >     would have a hypervisor specific one to avoid the daemon having to
> >     care too much about the details or push that complexity into a
> >     compile time option for the daemon which would result in different
> >     binaries although a common source base.
> > 
> >   - expose a new kernel ABI to abstract the hypercall differences away
> >     in the guest kernel. In this case the userspace would essentially
> >     ask for an abstract "map guest N memory to userspace ptr" and let
> >     the kernel deal with the different hypercall interfaces. This of
> >     course assumes the majority of BE guests would be Linux kernels and
> >     leaves the bare-metal/unikernel approaches to their own devices.
> > 
> > Operation
> > =========
> > 
> > The core of the operation of VirtIO is fairly simple. Once the
> > vhost-user feature negotiation is done it's a case of receiving update
> > events and parsing the resultant virt queue for data. The vhost-user
> > specification handles a bunch of setup before that point, mostly to
> > detail where the virt queues are set up FD's for memory and event
> > communication. This is where the envisioned stub process would be
> > responsible for getting the daemon up and ready to run. This is
> > currently done inside a big VMM like QEMU but I suspect a modern
> > approach would be to use the rust-vmm vhost crate. It would then either
> > communicate with the kernel's abstracted ABI or be re-targeted as a
> > build option for the various hypervisors.
> 
> One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> way they are typically envisioned and described in other environments.
> Instead, Xen has IOREQ servers. Each of them connects independently to
> Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> emulators for a single Xen VM, each of them connecting to Xen
> independently via the IOREQ interface.
> 
> The component responsible for starting a daemon and/or setting up shared
> interfaces is the toolstack: the xl command and the libxl/libxc
> libraries.

I think that VM configuration management (or orchestration in Startos
jargon?) is a subject to debate in parallel.
Otherwise, is there any good assumption to avoid it right now?

> Oleksandr and others I CCed have been working on ways for the toolstack
> to create virtio backends and setup memory mappings. They might be able
> to provide more info on the subject. I do think we miss a way to provide
> the configuration to the backend and anything else that the backend
> might require to start doing its job.
> 
> 
> > One question is how to best handle notification and kicks. The existing
> > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > is quite capable of simulating them when you use TCG). Xen has it's own
> > IOREQ mechanism. However latency is an important factor and having
> > events go through the stub would add quite a lot.
> 
> Yeah I think, regardless of anything else, we want the backends to
> connect directly to the Xen hypervisor.

In my approach,
 a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
              via virtio-proxy
 b) FE -> BE: MMIO to config raises events (in event channels), which is
              converted to a callback to BE via virtio-proxy
              (Xen's event channel is internnally implemented by interrupts.)

I don't know what "connect directly" means here, but sending interrupts
to the opposite side would be best efficient.
Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.

> 
> > Could we consider the kernel internally converting IOREQ messages from
> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > hypercall interfaces?
> > 
> > So any thoughts on what directions are worth experimenting with?
>  
> One option we should consider is for each backend to connect to Xen via
> the IOREQ interface. We could generalize the IOREQ interface and make it
> hypervisor agnostic. The interface is really trivial and easy to add.

As I said above, my proposal does the same thing that you mentioned here :)
The difference is that I do call hypervisor interfaces via virtio-proxy.

> The only Xen-specific part is the notification mechanism, which is an
> event channel. If we replaced the event channel with something else the
> interface would be generic. See:
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> 
> I don't think that translating IOREQs to eventfd in the kernel is a
> good idea: if feels like it would be extra complexity and that the
> kernel shouldn't be involved as this is a backend-hypervisor interface.

Given that we may want to implement BE as a bare-metal application
as I did on Zephyr, I don't think that the translation would not be
a big issue, especially on RTOS's.
It will be some kind of abstraction layer of interrupt handling
(or nothing but a callback mechanism).

> Also, eventfd is very Linux-centric and we are trying to design an
> interface that could work well for RTOSes too. If we want to do
> something different, both OS-agnostic and hypervisor-agnostic, perhaps
> we could design a new interface. One that could be implementable in the
> Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> too.
> 
> 
> There is also another problem. IOREQ is probably not be the only
> interface needed. Have a look at
> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> an interface for the backend to inject interrupts into the frontend? And
> if the backend requires dynamic memory mappings of frontend pages, then
> we would also need an interface to map/unmap domU pages.

My proposal document might help here; All the interfaces required for
virtio-proxy (or hypervisor-related interfaces) are listed as
RPC protocols :)

> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> and self-contained. It is easy to add anywhere. A new interface to
> inject interrupts or map pages is more difficult to manage because it
> would require changes scattered across the various emulators.

Exactly. I have no confident yet that my approach will also apply
to other hypervisors than Xen.
Technically, yes, but whether people can accept it or not is a different
matter.

Thanks,
-Takahiro Akashi



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
       [not found]   ` <0100017b33e585a5-06d4248e-b1a7-485e-800c-7ead89e5f916-000000@email.amazonses.com>
@ 2021-08-12  7:55     ` François Ozog
  2021-08-13  5:10       ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: François Ozog @ 2021-08-12  7:55 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefano Stabellini, paul, Stratos Mailing List, virtio-dev,
	Jan Kiszka, Arnd Bergmann, jgross, julien, Carl van Schaik,
	Bertrand.Marquis, stefanha, Artem_Mygaiev, xen-devel, olekstysh,
	Oleksandr_Tyshchenko

[-- Attachment #1: Type: text/plain, Size: 14906 bytes --]

I top post as I find it difficult to identify where to make the comments.

1) BE acceleration
Network and storage backends may actually be executed in SmartNICs. As
virtio 1.1 is hardware friendly, there may be SmartNICs with virtio 1.1 PCI
VFs. Is it a valid use case for the generic BE framework to be used in this
context?
DPDK is used in some BE to significantly accelerate switching. DPDK is also
used sometimes in guests. In that case, there are no event injection but
just high performance memory scheme. Is this considered as a use case?

2) Virtio as OS HAL
Panasonic CTO has been calling for a virtio based HAL and based on the
teachings of Google GKI, an internal HAL seem inevitable in the long term.
Virtio is then a contender to Google promoted Android HAL. Could the
framework be used in that context?

On Wed, 11 Aug 2021 at 08:28, AKASHI Takahiro via Stratos-dev <
stratos-dev@op-lists.linaro.org> wrote:

> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> > email to let them read the full context.
> >
> > My comments below are related to a potential Xen implementation, not
> > because it is the only implementation that matters, but because it is
> > the one I know best.
>
> Please note that my proposal (and hence the working prototype)[1]
> is based on Xen's virtio implementation (i.e. IOREQ) and particularly
> EPAM's virtio-disk application (backend server).
> It has been, I believe, well generalized but is still a bit biased
> toward this original design.
>
> So I hope you like my approach :)
>
> [1]
> https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
>
> Let me take this opportunity to explain a bit more about my approach below.
>
> > Also, please see this relevant email thread:
> > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> >
> >
> > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > Hi,
> > >
> > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > backends so we can enable as much re-use of code as possible and avoid
> > > repeating ourselves. This is the flip side of the front end where
> > > multiple front-end implementations are required - one per OS, assuming
> > > you don't just want Linux guests. The resultant guests are trivially
> > > movable between hypervisors modulo any abstracted paravirt type
> > > interfaces.
> > >
> > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > daemons running in a broadly POSIX like environment. The interface to
> > > the daemon is fairly simple requiring only some mapped memory and some
> > > sort of signalling for events (on Linux this is eventfd). The idea was
> a
> > > stub binary would be responsible for any hypervisor specific setup and
> > > then launch a common binary to deal with the actual virtqueue requests
> > > themselves.
> > >
> > > Since that original sketch we've seen an expansion in the sort of ways
> > > backends could be created. There is interest in encapsulating backends
> > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > has prompted ideas of using the trait interface to abstract differences
> > > away as well as the idea of bare-metal Rust backends.
> > >
> > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > calls for a description of the APIs needed from the hypervisor side to
> > > support VirtIO guests and their backends. However we are some way off
> > > from that at the moment as I think we need to at least demonstrate one
> > > portable backend before we start codifying requirements. To that end I
> > > want to think about what we need for a backend to function.
> > >
> > > Configuration
> > > =============
> > >
> > > In the type-2 setup this is typically fairly simple because the host
> > > system can orchestrate the various modules that make up the complete
> > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > we need some sort of mechanism to inform the backend VM about key
> > > details about the system:
> > >
> > >   - where virt queue memory is in it's address space
> > >   - how it's going to receive (interrupt) and trigger (kick) events
> > >   - what (if any) resources the backend needs to connect to
> > >
> > > Obviously you can elide over configuration issues by having static
> > > configurations and baking the assumptions into your guest images
> however
> > > this isn't scalable in the long term. The obvious solution seems to be
> > > extending a subset of Device Tree data to user space but perhaps there
> > > are other approaches?
> > >
> > > Before any virtio transactions can take place the appropriate memory
> > > mappings need to be made between the FE guest and the BE guest.
> >
> > > Currently the whole of the FE guests address space needs to be visible
> > > to whatever is serving the virtio requests. I can envision 3
> approaches:
> > >
> > >  * BE guest boots with memory already mapped
> > >
> > >  This would entail the guest OS knowing where in it's Guest Physical
> > >  Address space is already taken up and avoiding clashing. I would
> assume
> > >  in this case you would want a standard interface to userspace to then
> > >  make that address space visible to the backend daemon.
>
> Yet another way here is that we would have well known "shared memory"
> between
> VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
> and that it can even be an alternative for hypervisor-agnostic solution.
>
> (Please note memory regions in ivshmem appear as a PCI device and can be
> mapped locally.)
>
> I want to add this shared memory aspect to my virtio-proxy, but
> the resultant solution would eventually look similar to ivshmem.
>
> > >  * BE guests boots with a hypervisor handle to memory
> > >
> > >  The BE guest is then free to map the FE's memory to where it wants in
> > >  the BE's guest physical address space.
> >
> > I cannot see how this could work for Xen. There is no "handle" to give
> > to the backend if the backend is not running in dom0. So for Xen I think
> > the memory has to be already mapped
>
> In Xen's IOREQ solution (virtio-blk), the following information is expected
> to be exposed to BE via Xenstore:
> (I know that this is a tentative approach though.)
>    - the start address of configuration space
>    - interrupt number
>    - file path for backing storage
>    - read-only flag
> And the BE server have to call a particular hypervisor interface to
> map the configuration space.
>
> In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
> stuffs are contained in virtio-proxy, yet another VM, to hide all details.
>
> # My point is that a "handle" is not mandatory for executing mapping.
>
> > and the mapping probably done by the
> > toolstack (also see below.) Or we would have to invent a new Xen
> > hypervisor interface and Xen virtual machine privileges to allow this
> > kind of mapping.
>
> > If we run the backend in Dom0 that we have no problems of course.
>
> One of difficulties on Xen that I found in my approach is that calling
> such hypervisor intefaces (registering IOREQ, mapping memory) is only
> allowed on BE servers themselvies and so we will have to extend those
> interfaces.
> This, however, will raise some concern on security and privilege
> distribution
> as Stefan suggested.
> >
> >
> > > To activate the mapping will
> > >  require some sort of hypercall to the hypervisor. I can see two
> options
> > >  at this point:
> > >
> > >   - expose the handle to userspace for daemon/helper to trigger the
> > >     mapping via existing hypercall interfaces. If using a helper you
> > >     would have a hypervisor specific one to avoid the daemon having to
> > >     care too much about the details or push that complexity into a
> > >     compile time option for the daemon which would result in different
> > >     binaries although a common source base.
> > >
> > >   - expose a new kernel ABI to abstract the hypercall differences away
> > >     in the guest kernel. In this case the userspace would essentially
> > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > >     the kernel deal with the different hypercall interfaces. This of
> > >     course assumes the majority of BE guests would be Linux kernels and
> > >     leaves the bare-metal/unikernel approaches to their own devices.
> > >
> > > Operation
> > > =========
> > >
> > > The core of the operation of VirtIO is fairly simple. Once the
> > > vhost-user feature negotiation is done it's a case of receiving update
> > > events and parsing the resultant virt queue for data. The vhost-user
> > > specification handles a bunch of setup before that point, mostly to
> > > detail where the virt queues are set up FD's for memory and event
> > > communication. This is where the envisioned stub process would be
> > > responsible for getting the daemon up and ready to run. This is
> > > currently done inside a big VMM like QEMU but I suspect a modern
> > > approach would be to use the rust-vmm vhost crate. It would then either
> > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > build option for the various hypervisors.
> >
> > One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> > way they are typically envisioned and described in other environments.
> > Instead, Xen has IOREQ servers. Each of them connects independently to
> > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> > emulators for a single Xen VM, each of them connecting to Xen
> > independently via the IOREQ interface.
> >
> > The component responsible for starting a daemon and/or setting up shared
> > interfaces is the toolstack: the xl command and the libxl/libxc
> > libraries.
>
> I think that VM configuration management (or orchestration in Startos
> jargon?) is a subject to debate in parallel.
> Otherwise, is there any good assumption to avoid it right now?
>
> > Oleksandr and others I CCed have been working on ways for the toolstack
> > to create virtio backends and setup memory mappings. They might be able
> > to provide more info on the subject. I do think we miss a way to provide
> > the configuration to the backend and anything else that the backend
> > might require to start doing its job.
> >
> >
> > > One question is how to best handle notification and kicks. The existing
> > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > IOREQ mechanism. However latency is an important factor and having
> > > events go through the stub would add quite a lot.
> >
> > Yeah I think, regardless of anything else, we want the backends to
> > connect directly to the Xen hypervisor.
>
> In my approach,
>  a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
>               via virtio-proxy
>  b) FE -> BE: MMIO to config raises events (in event channels), which is
>               converted to a callback to BE via virtio-proxy
>               (Xen's event channel is internnally implemented by
> interrupts.)
>
> I don't know what "connect directly" means here, but sending interrupts
> to the opposite side would be best efficient.
> Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
>
> >
> > > Could we consider the kernel internally converting IOREQ messages from
> > > the Xen hypervisor to eventfd events? Would this scale with other
> kernel
> > > hypercall interfaces?
> > >
> > > So any thoughts on what directions are worth experimenting with?
> >
> > One option we should consider is for each backend to connect to Xen via
> > the IOREQ interface. We could generalize the IOREQ interface and make it
> > hypervisor agnostic. The interface is really trivial and easy to add.
>
> As I said above, my proposal does the same thing that you mentioned here :)
> The difference is that I do call hypervisor interfaces via virtio-proxy.
>
> > The only Xen-specific part is the notification mechanism, which is an
> > event channel. If we replaced the event channel with something else the
> > interface would be generic. See:
> >
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >
> > I don't think that translating IOREQs to eventfd in the kernel is a
> > good idea: if feels like it would be extra complexity and that the
> > kernel shouldn't be involved as this is a backend-hypervisor interface.
>
> Given that we may want to implement BE as a bare-metal application
> as I did on Zephyr, I don't think that the translation would not be
> a big issue, especially on RTOS's.
> It will be some kind of abstraction layer of interrupt handling
> (or nothing but a callback mechanism).
>
> > Also, eventfd is very Linux-centric and we are trying to design an
> > interface that could work well for RTOSes too. If we want to do
> > something different, both OS-agnostic and hypervisor-agnostic, perhaps
> > we could design a new interface. One that could be implementable in the
> > Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> > too.
> >
> >
> > There is also another problem. IOREQ is probably not be the only
> > interface needed. Have a look at
> > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > an interface for the backend to inject interrupts into the frontend? And
> > if the backend requires dynamic memory mappings of frontend pages, then
> > we would also need an interface to map/unmap domU pages.
>
> My proposal document might help here; All the interfaces required for
> virtio-proxy (or hypervisor-related interfaces) are listed as
> RPC protocols :)
>
> > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > and self-contained. It is easy to add anywhere. A new interface to
> > inject interrupts or map pages is more difficult to manage because it
> > would require changes scattered across the various emulators.
>
> Exactly. I have no confident yet that my approach will also apply
> to other hypervisors than Xen.
> Technically, yes, but whether people can accept it or not is a different
> matter.
>
> Thanks,
> -Takahiro Akashi
>
> --
> Stratos-dev mailing list
> Stratos-dev@op-lists.linaro.org
> https://op-lists.linaro.org/mailman/listinfo/stratos-dev
>


-- 
François-Frédéric Ozog | *Director Business Development*
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

[-- Attachment #2: Type: text/html, Size: 19096 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-08-12  7:55     ` [Stratos-dev] " François Ozog
@ 2021-08-13  5:10       ` AKASHI Takahiro
  2021-09-01  8:57           ` [virtio-dev] " Alex Bennée
  0 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-13  5:10 UTC (permalink / raw)
  To: Fran??ois Ozog
  Cc: Stefano Stabellini, paul, Stratos Mailing List, virtio-dev,
	Jan Kiszka, Arnd Bergmann, jgross, julien, Carl van Schaik,
	Bertrand.Marquis, stefanha, Artem_Mygaiev, xen-devel, olekstysh,
	Oleksandr_Tyshchenko

Hi François,

On Thu, Aug 12, 2021 at 09:55:52AM +0200, Fran??ois Ozog wrote:
> I top post as I find it difficult to identify where to make the comments.

Thank you for the posting. 
I think that we should first discuss more about the goal/requirements/
practical use cases for the framework.

> 1) BE acceleration
> Network and storage backends may actually be executed in SmartNICs. As
> virtio 1.1 is hardware friendly, there may be SmartNICs with virtio 1.1 PCI
> VFs. Is it a valid use case for the generic BE framework to be used in this
> context?
> DPDK is used in some BE to significantly accelerate switching. DPDK is also
> used sometimes in guests. In that case, there are no event injection but
> just high performance memory scheme. Is this considered as a use case?

I'm not quite familiar with DPDK but it seems to be heavily reliant
on not only virtqueues but also kvm/linux features/functionality, say,
according to [1].
I'm afraid that DPDK is not suitable for primary (at least, initial)
target use.
# In my proposal, virtio-proxy, I have in mind the assumption that we would
# create BE VM as a baremetal application on RTOS (and/or unikernel.)

But as far as virtqueue is concerned, I think we can discuss in general
technical details as Alex suggested, including:
- sharing or mapping memory regions for data payload
- efficient notification mechanism

[1] https://www.redhat.com/en/blog/journey-vhost-users-realm

> 2) Virtio as OS HAL
> Panasonic CTO has been calling for a virtio based HAL and based on the
> teachings of Google GKI, an internal HAL seem inevitable in the long term.
> Virtio is then a contender to Google promoted Android HAL. Could the
> framework be used in that context?

In this case, where will the implementation of "HAL" reside?
I don't think the portability of "HAL" code (as a set of virtio BEs)
is a requirement here.

-Takahiro Akashi

> On Wed, 11 Aug 2021 at 08:28, AKASHI Takahiro via Stratos-dev <
> stratos-dev@op-lists.linaro.org> wrote:
> 
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> > > email to let them read the full context.
> > >
> > > My comments below are related to a potential Xen implementation, not
> > > because it is the only implementation that matters, but because it is
> > > the one I know best.
> >
> > Please note that my proposal (and hence the working prototype)[1]
> > is based on Xen's virtio implementation (i.e. IOREQ) and particularly
> > EPAM's virtio-disk application (backend server).
> > It has been, I believe, well generalized but is still a bit biased
> > toward this original design.
> >
> > So I hope you like my approach :)
> >
> > [1]
> > https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
> >
> > Let me take this opportunity to explain a bit more about my approach below.
> >
> > > Also, please see this relevant email thread:
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > >
> > >
> > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > Hi,
> > > >
> > > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > > backends so we can enable as much re-use of code as possible and avoid
> > > > repeating ourselves. This is the flip side of the front end where
> > > > multiple front-end implementations are required - one per OS, assuming
> > > > you don't just want Linux guests. The resultant guests are trivially
> > > > movable between hypervisors modulo any abstracted paravirt type
> > > > interfaces.
> > > >
> > > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > > daemons running in a broadly POSIX like environment. The interface to
> > > > the daemon is fairly simple requiring only some mapped memory and some
> > > > sort of signalling for events (on Linux this is eventfd). The idea was
> > a
> > > > stub binary would be responsible for any hypervisor specific setup and
> > > > then launch a common binary to deal with the actual virtqueue requests
> > > > themselves.
> > > >
> > > > Since that original sketch we've seen an expansion in the sort of ways
> > > > backends could be created. There is interest in encapsulating backends
> > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > > has prompted ideas of using the trait interface to abstract differences
> > > > away as well as the idea of bare-metal Rust backends.
> > > >
> > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > calls for a description of the APIs needed from the hypervisor side to
> > > > support VirtIO guests and their backends. However we are some way off
> > > > from that at the moment as I think we need to at least demonstrate one
> > > > portable backend before we start codifying requirements. To that end I
> > > > want to think about what we need for a backend to function.
> > > >
> > > > Configuration
> > > > =============
> > > >
> > > > In the type-2 setup this is typically fairly simple because the host
> > > > system can orchestrate the various modules that make up the complete
> > > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > > we need some sort of mechanism to inform the backend VM about key
> > > > details about the system:
> > > >
> > > >   - where virt queue memory is in it's address space
> > > >   - how it's going to receive (interrupt) and trigger (kick) events
> > > >   - what (if any) resources the backend needs to connect to
> > > >
> > > > Obviously you can elide over configuration issues by having static
> > > > configurations and baking the assumptions into your guest images
> > however
> > > > this isn't scalable in the long term. The obvious solution seems to be
> > > > extending a subset of Device Tree data to user space but perhaps there
> > > > are other approaches?
> > > >
> > > > Before any virtio transactions can take place the appropriate memory
> > > > mappings need to be made between the FE guest and the BE guest.
> > >
> > > > Currently the whole of the FE guests address space needs to be visible
> > > > to whatever is serving the virtio requests. I can envision 3
> > approaches:
> > > >
> > > >  * BE guest boots with memory already mapped
> > > >
> > > >  This would entail the guest OS knowing where in it's Guest Physical
> > > >  Address space is already taken up and avoiding clashing. I would
> > assume
> > > >  in this case you would want a standard interface to userspace to then
> > > >  make that address space visible to the backend daemon.
> >
> > Yet another way here is that we would have well known "shared memory"
> > between
> > VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
> > and that it can even be an alternative for hypervisor-agnostic solution.
> >
> > (Please note memory regions in ivshmem appear as a PCI device and can be
> > mapped locally.)
> >
> > I want to add this shared memory aspect to my virtio-proxy, but
> > the resultant solution would eventually look similar to ivshmem.
> >
> > > >  * BE guests boots with a hypervisor handle to memory
> > > >
> > > >  The BE guest is then free to map the FE's memory to where it wants in
> > > >  the BE's guest physical address space.
> > >
> > > I cannot see how this could work for Xen. There is no "handle" to give
> > > to the backend if the backend is not running in dom0. So for Xen I think
> > > the memory has to be already mapped
> >
> > In Xen's IOREQ solution (virtio-blk), the following information is expected
> > to be exposed to BE via Xenstore:
> > (I know that this is a tentative approach though.)
> >    - the start address of configuration space
> >    - interrupt number
> >    - file path for backing storage
> >    - read-only flag
> > And the BE server have to call a particular hypervisor interface to
> > map the configuration space.
> >
> > In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
> > stuffs are contained in virtio-proxy, yet another VM, to hide all details.
> >
> > # My point is that a "handle" is not mandatory for executing mapping.
> >
> > > and the mapping probably done by the
> > > toolstack (also see below.) Or we would have to invent a new Xen
> > > hypervisor interface and Xen virtual machine privileges to allow this
> > > kind of mapping.
> >
> > > If we run the backend in Dom0 that we have no problems of course.
> >
> > One of difficulties on Xen that I found in my approach is that calling
> > such hypervisor intefaces (registering IOREQ, mapping memory) is only
> > allowed on BE servers themselvies and so we will have to extend those
> > interfaces.
> > This, however, will raise some concern on security and privilege
> > distribution
> > as Stefan suggested.
> > >
> > >
> > > > To activate the mapping will
> > > >  require some sort of hypercall to the hypervisor. I can see two
> > options
> > > >  at this point:
> > > >
> > > >   - expose the handle to userspace for daemon/helper to trigger the
> > > >     mapping via existing hypercall interfaces. If using a helper you
> > > >     would have a hypervisor specific one to avoid the daemon having to
> > > >     care too much about the details or push that complexity into a
> > > >     compile time option for the daemon which would result in different
> > > >     binaries although a common source base.
> > > >
> > > >   - expose a new kernel ABI to abstract the hypercall differences away
> > > >     in the guest kernel. In this case the userspace would essentially
> > > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > > >     the kernel deal with the different hypercall interfaces. This of
> > > >     course assumes the majority of BE guests would be Linux kernels and
> > > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > >
> > > > Operation
> > > > =========
> > > >
> > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > vhost-user feature negotiation is done it's a case of receiving update
> > > > events and parsing the resultant virt queue for data. The vhost-user
> > > > specification handles a bunch of setup before that point, mostly to
> > > > detail where the virt queues are set up FD's for memory and event
> > > > communication. This is where the envisioned stub process would be
> > > > responsible for getting the daemon up and ready to run. This is
> > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > approach would be to use the rust-vmm vhost crate. It would then either
> > > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > > build option for the various hypervisors.
> > >
> > > One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> > > way they are typically envisioned and described in other environments.
> > > Instead, Xen has IOREQ servers. Each of them connects independently to
> > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> > > emulators for a single Xen VM, each of them connecting to Xen
> > > independently via the IOREQ interface.
> > >
> > > The component responsible for starting a daemon and/or setting up shared
> > > interfaces is the toolstack: the xl command and the libxl/libxc
> > > libraries.
> >
> > I think that VM configuration management (or orchestration in Startos
> > jargon?) is a subject to debate in parallel.
> > Otherwise, is there any good assumption to avoid it right now?
> >
> > > Oleksandr and others I CCed have been working on ways for the toolstack
> > > to create virtio backends and setup memory mappings. They might be able
> > > to provide more info on the subject. I do think we miss a way to provide
> > > the configuration to the backend and anything else that the backend
> > > might require to start doing its job.
> > >
> > >
> > > > One question is how to best handle notification and kicks. The existing
> > > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > > IOREQ mechanism. However latency is an important factor and having
> > > > events go through the stub would add quite a lot.
> > >
> > > Yeah I think, regardless of anything else, we want the backends to
> > > connect directly to the Xen hypervisor.
> >
> > In my approach,
> >  a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
> >               via virtio-proxy
> >  b) FE -> BE: MMIO to config raises events (in event channels), which is
> >               converted to a callback to BE via virtio-proxy
> >               (Xen's event channel is internnally implemented by
> > interrupts.)
> >
> > I don't know what "connect directly" means here, but sending interrupts
> > to the opposite side would be best efficient.
> > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
> >
> > >
> > > > Could we consider the kernel internally converting IOREQ messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with other
> > kernel
> > > > hypercall interfaces?
> > > >
> > > > So any thoughts on what directions are worth experimenting with?
> > >
> > > One option we should consider is for each backend to connect to Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > hypervisor agnostic. The interface is really trivial and easy to add.
> >
> > As I said above, my proposal does the same thing that you mentioned here :)
> > The difference is that I do call hypervisor interfaces via virtio-proxy.
> >
> > > The only Xen-specific part is the notification mechanism, which is an
> > > event channel. If we replaced the event channel with something else the
> > > interface would be generic. See:
> > >
> > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > >
> > > I don't think that translating IOREQs to eventfd in the kernel is a
> > > good idea: if feels like it would be extra complexity and that the
> > > kernel shouldn't be involved as this is a backend-hypervisor interface.
> >
> > Given that we may want to implement BE as a bare-metal application
> > as I did on Zephyr, I don't think that the translation would not be
> > a big issue, especially on RTOS's.
> > It will be some kind of abstraction layer of interrupt handling
> > (or nothing but a callback mechanism).
> >
> > > Also, eventfd is very Linux-centric and we are trying to design an
> > > interface that could work well for RTOSes too. If we want to do
> > > something different, both OS-agnostic and hypervisor-agnostic, perhaps
> > > we could design a new interface. One that could be implementable in the
> > > Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> > > too.
> > >
> > >
> > > There is also another problem. IOREQ is probably not be the only
> > > interface needed. Have a look at
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > an interface for the backend to inject interrupts into the frontend? And
> > > if the backend requires dynamic memory mappings of frontend pages, then
> > > we would also need an interface to map/unmap domU pages.
> >
> > My proposal document might help here; All the interfaces required for
> > virtio-proxy (or hypervisor-related interfaces) are listed as
> > RPC protocols :)
> >
> > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > and self-contained. It is easy to add anywhere. A new interface to
> > > inject interrupts or map pages is more difficult to manage because it
> > > would require changes scattered across the various emulators.
> >
> > Exactly. I have no confident yet that my approach will also apply
> > to other hypervisors than Xen.
> > Technically, yes, but whether people can accept it or not is a different
> > matter.
> >
> > Thanks,
> > -Takahiro Akashi
> >
> > --
> > Stratos-dev mailing list
> > Stratos-dev@op-lists.linaro.org
> > https://op-lists.linaro.org/mailman/listinfo/stratos-dev
> >
> 
> 
> -- 
> François-Frédéric Ozog | *Director Business Development*
> T: +33.67221.6485
> francois.ozog@linaro.org | Skype: ffozog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-11  6:27   ` AKASHI Takahiro
@ 2021-08-14 15:37     ` Oleksandr Tyshchenko
  2021-08-16 10:04       ` Wei Chen
  0 siblings, 1 reply; 66+ messages in thread
From: Oleksandr Tyshchenko @ 2021-08-14 15:37 UTC (permalink / raw)
  To: AKASHI Takahiro, Stefano Stabellini
  Cc: Alex Benn??e, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei Chen,
	Oleksandr Tyshchenko, Bertrand Marquis, Artem Mygaiev,
	Julien Grall, Juergen Gross, Paul Durrant, Xen Devel

[-- Attachment #1: Type: text/plain, Size: 17092 bytes --]

Hello, all.

Please see some comments below. And sorry for the possible format issues.

On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
wrote:

> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> > email to let them read the full context.
> >
> > My comments below are related to a potential Xen implementation, not
> > because it is the only implementation that matters, but because it is
> > the one I know best.
>
> Please note that my proposal (and hence the working prototype)[1]
> is based on Xen's virtio implementation (i.e. IOREQ) and particularly
> EPAM's virtio-disk application (backend server).
> It has been, I believe, well generalized but is still a bit biased
> toward this original design.
>
> So I hope you like my approach :)
>
> [1]
> https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
>
> Let me take this opportunity to explain a bit more about my approach below.
>
> > Also, please see this relevant email thread:
> > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> >
> >
> > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > Hi,
> > >
> > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > backends so we can enable as much re-use of code as possible and avoid
> > > repeating ourselves. This is the flip side of the front end where
> > > multiple front-end implementations are required - one per OS, assuming
> > > you don't just want Linux guests. The resultant guests are trivially
> > > movable between hypervisors modulo any abstracted paravirt type
> > > interfaces.
> > >
> > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > daemons running in a broadly POSIX like environment. The interface to
> > > the daemon is fairly simple requiring only some mapped memory and some
> > > sort of signalling for events (on Linux this is eventfd). The idea was
> a
> > > stub binary would be responsible for any hypervisor specific setup and
> > > then launch a common binary to deal with the actual virtqueue requests
> > > themselves.
> > >
> > > Since that original sketch we've seen an expansion in the sort of ways
> > > backends could be created. There is interest in encapsulating backends
> > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > has prompted ideas of using the trait interface to abstract differences
> > > away as well as the idea of bare-metal Rust backends.
> > >
> > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > calls for a description of the APIs needed from the hypervisor side to
> > > support VirtIO guests and their backends. However we are some way off
> > > from that at the moment as I think we need to at least demonstrate one
> > > portable backend before we start codifying requirements. To that end I
> > > want to think about what we need for a backend to function.
> > >
> > > Configuration
> > > =============
> > >
> > > In the type-2 setup this is typically fairly simple because the host
> > > system can orchestrate the various modules that make up the complete
> > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > we need some sort of mechanism to inform the backend VM about key
> > > details about the system:
> > >
> > >   - where virt queue memory is in it's address space
> > >   - how it's going to receive (interrupt) and trigger (kick) events
> > >   - what (if any) resources the backend needs to connect to
> > >
> > > Obviously you can elide over configuration issues by having static
> > > configurations and baking the assumptions into your guest images
> however
> > > this isn't scalable in the long term. The obvious solution seems to be
> > > extending a subset of Device Tree data to user space but perhaps there
> > > are other approaches?
> > >
> > > Before any virtio transactions can take place the appropriate memory
> > > mappings need to be made between the FE guest and the BE guest.
> >
> > > Currently the whole of the FE guests address space needs to be visible
> > > to whatever is serving the virtio requests. I can envision 3
> approaches:
> > >
> > >  * BE guest boots with memory already mapped
> > >
> > >  This would entail the guest OS knowing where in it's Guest Physical
> > >  Address space is already taken up and avoiding clashing. I would
> assume
> > >  in this case you would want a standard interface to userspace to then
> > >  make that address space visible to the backend daemon.
>
> Yet another way here is that we would have well known "shared memory"
> between
> VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
> and that it can even be an alternative for hypervisor-agnostic solution.
>
> (Please note memory regions in ivshmem appear as a PCI device and can be
> mapped locally.)
>
> I want to add this shared memory aspect to my virtio-proxy, but
> the resultant solution would eventually look similar to ivshmem.
>
> > >  * BE guests boots with a hypervisor handle to memory
> > >
> > >  The BE guest is then free to map the FE's memory to where it wants in
> > >  the BE's guest physical address space.
> >
> > I cannot see how this could work for Xen. There is no "handle" to give
> > to the backend if the backend is not running in dom0. So for Xen I think
> > the memory has to be already mapped
>
> In Xen's IOREQ solution (virtio-blk), the following information is expected
> to be exposed to BE via Xenstore:
> (I know that this is a tentative approach though.)
>    - the start address of configuration space
>    - interrupt number
>    - file path for backing storage
>    - read-only flag
> And the BE server have to call a particular hypervisor interface to
> map the configuration space.
>

Yes, Xenstore was chosen as a simple way to pass configuration info to the
backend running in a non-toolstack domain.
I remember, there was a wish to avoid using Xenstore in Virtio backend
itself if possible, so for non-toolstack domain, this could done with
adjusting devd (daemon that listens for devices and launches backends)
to read backend configuration from the Xenstore anyway and pass it to the
backend via command line arguments.

But, if ...


>
> In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
> stuffs are contained in virtio-proxy, yet another VM, to hide all details.
>

... the solution how to overcome that is already found and proven to work
then even better.



>
> # My point is that a "handle" is not mandatory for executing mapping.
>
> > and the mapping probably done by the
> > toolstack (also see below.) Or we would have to invent a new Xen
> > hypervisor interface and Xen virtual machine privileges to allow this
> > kind of mapping.
>
> > If we run the backend in Dom0 that we have no problems of course.
>
> One of difficulties on Xen that I found in my approach is that calling
> such hypervisor intefaces (registering IOREQ, mapping memory) is only
> allowed on BE servers themselvies and so we will have to extend those
> interfaces.
> This, however, will raise some concern on security and privilege
> distribution
> as Stefan suggested.
>

We also faced policy related issues with Virtio backend running in other
than Dom0 domain in a "dummy" xsm mode. In our target system we run the
backend in a driver
domain (we call it DomD) where the underlying H/W resides. We trust it, so
we wrote policy rules (to be used in "flask" xsm mode) to provide it with a
little bit more privileges than a simple DomU had.
Now it is permitted to issue device-model, resource and memory mappings,
etc calls.


> >
> >
> > > To activate the mapping will
> > >  require some sort of hypercall to the hypervisor. I can see two
> options
> > >  at this point:
> > >
> > >   - expose the handle to userspace for daemon/helper to trigger the
> > >     mapping via existing hypercall interfaces. If using a helper you
> > >     would have a hypervisor specific one to avoid the daemon having to
> > >     care too much about the details or push that complexity into a
> > >     compile time option for the daemon which would result in different
> > >     binaries although a common source base.
> > >
> > >   - expose a new kernel ABI to abstract the hypercall differences away
> > >     in the guest kernel. In this case the userspace would essentially
> > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > >     the kernel deal with the different hypercall interfaces. This of
> > >     course assumes the majority of BE guests would be Linux kernels and
> > >     leaves the bare-metal/unikernel approaches to their own devices.
> > >
> > > Operation
> > > =========
> > >
> > > The core of the operation of VirtIO is fairly simple. Once the
> > > vhost-user feature negotiation is done it's a case of receiving update
> > > events and parsing the resultant virt queue for data. The vhost-user
> > > specification handles a bunch of setup before that point, mostly to
> > > detail where the virt queues are set up FD's for memory and event
> > > communication. This is where the envisioned stub process would be
> > > responsible for getting the daemon up and ready to run. This is
> > > currently done inside a big VMM like QEMU but I suspect a modern
> > > approach would be to use the rust-vmm vhost crate. It would then either
> > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > build option for the various hypervisors.
> >
> > One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> > way they are typically envisioned and described in other environments.
> > Instead, Xen has IOREQ servers. Each of them connects independently to
> > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> > emulators for a single Xen VM, each of them connecting to Xen
> > independently via the IOREQ interface.
> >
> > The component responsible for starting a daemon and/or setting up shared
> > interfaces is the toolstack: the xl command and the libxl/libxc
> > libraries.
>
> I think that VM configuration management (or orchestration in Startos
> jargon?) is a subject to debate in parallel.
> Otherwise, is there any good assumption to avoid it right now?
>
> > Oleksandr and others I CCed have been working on ways for the toolstack
> > to create virtio backends and setup memory mappings. They might be able
> > to provide more info on the subject. I do think we miss a way to provide
> > the configuration to the backend and anything else that the backend
> > might require to start doing its job.
>

Yes, some work has been done for the toolstack to handle Virtio MMIO
devices in
general and Virtio block devices in particular. However, it has not been
upstreaned yet.
Updated patches on review now:
https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstysh@gmail.com/

There is an additional (also important) activity to improve/fix foreign
memory mapping on Arm which I am also involved in.
The foreign memory mapping is proposed to be used for Virtio backends
(device emulators) if there is a need to run guest OS completely unmodified.
Of course, the more secure way would be to use grant memory mapping.
Brietly, the main difference between them is that with foreign mapping the
backend
can map any guest memory it wants to map, but with grant mapping it is
allowed to map only what was previously granted by the frontend.

So, there might be a problem if we want to pre-map some guest memory in
advance or to cache mappings in the backend in order to improve performance
(because the mapping/unmapping guest pages every request requires a lot of
back and forth to Xen + P2M updates). In a nutshell, currently, in order to
map a guest page into the backend address space we need to steal a real
physical page from the backend domain. So, with the said optimizations we
might end up with no free memory in the backend domain (see XSA-300).
And what we try to achieve is to not waste a real domain memory at all by
providing safe non-allocated-yet (so unused) address space for the foreign
(and grant) pages to be mapped into, this enabling work implies Xen and
Linux (and likely DTB bindings) changes. However, as it turned out, for
this to work in a proper and safe way some prereq work needs to be done.
You can find the related Xen discussion at:
https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstysh@gmail.com/



> >
> >
> > > One question is how to best handle notification and kicks. The existing
> > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > IOREQ mechanism. However latency is an important factor and having
> > > events go through the stub would add quite a lot.
> >
> > Yeah I think, regardless of anything else, we want the backends to
> > connect directly to the Xen hypervisor.
>
> In my approach,
>  a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
>               via virtio-proxy
>  b) FE -> BE: MMIO to config raises events (in event channels), which is
>               converted to a callback to BE via virtio-proxy
>               (Xen's event channel is internnally implemented by
> interrupts.)
>
> I don't know what "connect directly" means here, but sending interrupts
> to the opposite side would be best efficient.
> Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
>

Agree that MSI would be more efficient than SPI...
At the moment, in order to notify the frontend, the backend issues a
specific device-model call to query Xen to inject a corresponding SPI to
the guest.



>
> >
> > > Could we consider the kernel internally converting IOREQ messages from
> > > the Xen hypervisor to eventfd events? Would this scale with other
> kernel
> > > hypercall interfaces?
> > >
> > > So any thoughts on what directions are worth experimenting with?
> >
> > One option we should consider is for each backend to connect to Xen via
> > the IOREQ interface. We could generalize the IOREQ interface and make it
> > hypervisor agnostic. The interface is really trivial and easy to add.
>
> As I said above, my proposal does the same thing that you mentioned here :)
> The difference is that I do call hypervisor interfaces via virtio-proxy.
>
> > The only Xen-specific part is the notification mechanism, which is an
> > event channel. If we replaced the event channel with something else the
> > interface would be generic. See:
> >
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >
> > I don't think that translating IOREQs to eventfd in the kernel is a
> > good idea: if feels like it would be extra complexity and that the
> > kernel shouldn't be involved as this is a backend-hypervisor interface.
>
> Given that we may want to implement BE as a bare-metal application
> as I did on Zephyr, I don't think that the translation would not be
> a big issue, especially on RTOS's.
> It will be some kind of abstraction layer of interrupt handling
> (or nothing but a callback mechanism).
>
> > Also, eventfd is very Linux-centric and we are trying to design an
> > interface that could work well for RTOSes too. If we want to do
> > something different, both OS-agnostic and hypervisor-agnostic, perhaps
> > we could design a new interface. One that could be implementable in the
> > Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> > too.
> >
> >
> > There is also another problem. IOREQ is probably not be the only
> > interface needed. Have a look at
> > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > an interface for the backend to inject interrupts into the frontend? And
> > if the backend requires dynamic memory mappings of frontend pages, then
> > we would also need an interface to map/unmap domU pages.
>
> My proposal document might help here; All the interfaces required for
> virtio-proxy (or hypervisor-related interfaces) are listed as
> RPC protocols :)
>
> > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > and self-contained. It is easy to add anywhere. A new interface to
> > inject interrupts or map pages is more difficult to manage because it
> > would require changes scattered across the various emulators.
>
> Exactly. I have no confident yet that my approach will also apply
> to other hypervisors than Xen.
> Technically, yes, but whether people can accept it or not is a different
> matter.
>
> Thanks,
> -Takahiro Akashi
>
>

-- 
Regards,

Oleksandr Tyshchenko

[-- Attachment #2: Type: text/html, Size: 21165 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-14 15:37     ` Oleksandr Tyshchenko
@ 2021-08-16 10:04       ` Wei Chen
  2021-08-17  8:07         ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: Wei Chen @ 2021-08-16 10:04 UTC (permalink / raw)
  To: Oleksandr Tyshchenko, AKASHI Takahiro, Stefano Stabellini
  Cc: Alex Benn??e, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel

Hi All,

Thanks for Stefano to link my kvmtool for Xen proposal here.
This proposal is still discussing in Xen and KVM communities.
The main work is to decouple the kvmtool from KVM and make
other hypervisors can reuse the virtual device implementations.

In this case, we need to introduce an intermediate hypervisor
layer for VMM abstraction, Which is, I think it's very close
to stratos' virtio hypervisor agnosticism work.


> From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> Sent: 2021年8月14日 23:38
> To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano Stabellini <sstabellini@kernel.org>
> Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>; pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>
> Hello, all.
>
> Please see some comments below. And sorry for the possible format issues.
>
> > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro <mailto:takahiro.akashi@linaro.org> wrote:
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> > > email to let them read the full context.
> > >
> > > My comments below are related to a potential Xen implementation, not
> > > because it is the only implementation that matters, but because it is
> > > the one I know best.
> >
> > Please note that my proposal (and hence the working prototype)[1]
> > is based on Xen's virtio implementation (i.e. IOREQ) and particularly
> > EPAM's virtio-disk application (backend server).
> > It has been, I believe, well generalized but is still a bit biased
> > toward this original design.
> >
> > So I hope you like my approach :)
> >
> > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
> >
> > Let me take this opportunity to explain a bit more about my approach below.
> >
> > > Also, please see this relevant email thread:
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > >
> > >
> > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > Hi,
> > > >
> > > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > > backends so we can enable as much re-use of code as possible and avoid
> > > > repeating ourselves. This is the flip side of the front end where
> > > > multiple front-end implementations are required - one per OS, assuming
> > > > you don't just want Linux guests. The resultant guests are trivially
> > > > movable between hypervisors modulo any abstracted paravirt type
> > > > interfaces.
> > > >
> > > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > > daemons running in a broadly POSIX like environment. The interface to
> > > > the daemon is fairly simple requiring only some mapped memory and some
> > > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > > stub binary would be responsible for any hypervisor specific setup and
> > > > then launch a common binary to deal with the actual virtqueue requests
> > > > themselves.
> > > >
> > > > Since that original sketch we've seen an expansion in the sort of ways
> > > > backends could be created. There is interest in encapsulating backends
> > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > > has prompted ideas of using the trait interface to abstract differences
> > > > away as well as the idea of bare-metal Rust backends.
> > > >
> > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > calls for a description of the APIs needed from the hypervisor side to
> > > > support VirtIO guests and their backends. However we are some way off
> > > > from that at the moment as I think we need to at least demonstrate one
> > > > portable backend before we start codifying requirements. To that end I
> > > > want to think about what we need for a backend to function.
> > > >
> > > > Configuration
> > > > =============
> > > >
> > > > In the type-2 setup this is typically fairly simple because the host
> > > > system can orchestrate the various modules that make up the complete
> > > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > > we need some sort of mechanism to inform the backend VM about key
> > > > details about the system:
> > > >
> > > >   - where virt queue memory is in it's address space
> > > >   - how it's going to receive (interrupt) and trigger (kick) events
> > > >   - what (if any) resources the backend needs to connect to
> > > >
> > > > Obviously you can elide over configuration issues by having static
> > > > configurations and baking the assumptions into your guest images however
> > > > this isn't scalable in the long term. The obvious solution seems to be
> > > > extending a subset of Device Tree data to user space but perhaps there
> > > > are other approaches?
> > > >
> > > > Before any virtio transactions can take place the appropriate memory
> > > > mappings need to be made between the FE guest and the BE guest.
> > >
> > > > Currently the whole of the FE guests address space needs to be visible
> > > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > >
> > > >  * BE guest boots with memory already mapped
> > > >
> > > >  This would entail the guest OS knowing where in it's Guest Physical
> > > >  Address space is already taken up and avoiding clashing. I would assume
> > > >  in this case you would want a standard interface to userspace to then
> > > >  make that address space visible to the backend daemon.
> >
> > Yet another way here is that we would have well known "shared memory" between
> > VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
> > and that it can even be an alternative for hypervisor-agnostic solution.
> >
> > (Please note memory regions in ivshmem appear as a PCI device and can be
> > mapped locally.)
> >
> > I want to add this shared memory aspect to my virtio-proxy, but
> > the resultant solution would eventually look similar to ivshmem.
> >
> > > >  * BE guests boots with a hypervisor handle to memory
> > > >
> > > >  The BE guest is then free to map the FE's memory to where it wants in
> > > >  the BE's guest physical address space.
> > >
> > > I cannot see how this could work for Xen. There is no "handle" to give
> > > to the backend if the backend is not running in dom0. So for Xen I think
> > > the memory has to be already mapped
> >
> > In Xen's IOREQ solution (virtio-blk), the following information is expected
> > to be exposed to BE via Xenstore:
> > (I know that this is a tentative approach though.)
> >    - the start address of configuration space
> >    - interrupt number
> >    - file path for backing storage
> >    - read-only flag
> > And the BE server have to call a particular hypervisor interface to
> > map the configuration space.
>
> Yes, Xenstore was chosen as a simple way to pass configuration info to the backend running in a non-toolstack domain.
> I remember, there was a wish to avoid using Xenstore in Virtio backend itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
> to read backend configuration from the Xenstore anyway and pass it to the backend via command line arguments.
>

Yes, in current PoC code we're using xenstore to pass device configuration.
We also designed a static device configuration parse method for Dom0less or
other scenarios don't have xentool. yes, it's from device model command line
or a config file.

> But, if ...
>
> >
> > In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
> > stuffs are contained in virtio-proxy, yet another VM, to hide all details.
>
> ... the solution how to overcome that is already found and proven to work then even better.
>
>
>
> > # My point is that a "handle" is not mandatory for executing mapping.
> >
> > > and the mapping probably done by the
> > > toolstack (also see below.) Or we would have to invent a new Xen
> > > hypervisor interface and Xen virtual machine privileges to allow this
> > > kind of mapping.
> >
> > > If we run the backend in Dom0 that we have no problems of course.
> >
> > One of difficulties on Xen that I found in my approach is that calling
> > such hypervisor intefaces (registering IOREQ, mapping memory) is only
> > allowed on BE servers themselvies and so we will have to extend those
> > interfaces.
> > This, however, will raise some concern on security and privilege distribution
> > as Stefan suggested.
>
> We also faced policy related issues with Virtio backend running in other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver
> domain (we call it DomD) where the underlying H/W resides. We trust it, so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had.
> Now it is permitted to issue device-model, resource and memory mappings, etc calls.
>
> > >
> > >
> > > > To activate the mapping will
> > > >  require some sort of hypercall to the hypervisor. I can see two options
> > > >  at this point:
> > > >
> > > >   - expose the handle to userspace for daemon/helper to trigger the
> > > >     mapping via existing hypercall interfaces. If using a helper you
> > > >     would have a hypervisor specific one to avoid the daemon having to
> > > >     care too much about the details or push that complexity into a
> > > >     compile time option for the daemon which would result in different
> > > >     binaries although a common source base.
> > > >
> > > >   - expose a new kernel ABI to abstract the hypercall differences away
> > > >     in the guest kernel. In this case the userspace would essentially
> > > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > > >     the kernel deal with the different hypercall interfaces. This of
> > > >     course assumes the majority of BE guests would be Linux kernels and
> > > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > >
> > > > Operation
> > > > =========
> > > >
> > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > vhost-user feature negotiation is done it's a case of receiving update
> > > > events and parsing the resultant virt queue for data. The vhost-user
> > > > specification handles a bunch of setup before that point, mostly to
> > > > detail where the virt queues are set up FD's for memory and event
> > > > communication. This is where the envisioned stub process would be
> > > > responsible for getting the daemon up and ready to run. This is
> > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > approach would be to use the rust-vmm vhost crate. It would then either
> > > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > > build option for the various hypervisors.
> > >
> > > One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> > > way they are typically envisioned and described in other environments.
> > > Instead, Xen has IOREQ servers. Each of them connects independently to
> > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> > > emulators for a single Xen VM, each of them connecting to Xen
> > > independently via the IOREQ interface.
> > >
> > > The component responsible for starting a daemon and/or setting up shared
> > > interfaces is the toolstack: the xl command and the libxl/libxc
> > > libraries.
> >
> > I think that VM configuration management (or orchestration in Startos
> > jargon?) is a subject to debate in parallel.
> > Otherwise, is there any good assumption to avoid it right now?
> >
> > > Oleksandr and others I CCed have been working on ways for the toolstack
> > > to create virtio backends and setup memory mappings. They might be able
> > > to provide more info on the subject. I do think we miss a way to provide
> > > the configuration to the backend and anything else that the backend
> > > might require to start doing its job.
>
> Yes, some work has been done for the toolstack to handle Virtio MMIO devices in
> general and Virtio block devices in particular. However, it has not been upstreaned yet.
> Updated patches on review now:
> https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstysh@gmail.com/
>
> There is an additional (also important) activity to improve/fix foreign memory mapping on Arm which I am also involved in.
> The foreign memory mapping is proposed to be used for Virtio backends (device emulators) if there is a need to run guest OS completely unmodified.
> Of course, the more secure way would be to use grant memory mapping. Brietly, the main difference between them is that with foreign mapping the backend
> can map any guest memory it wants to map, but with grant mapping it is allowed to map only what was previously granted by the frontend.
>
> So, there might be a problem if we want to pre-map some guest memory in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done.
> You can find the related Xen discussion at:
> https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstysh@gmail.com/
>
>
> > >
> > >
> > > > One question is how to best handle notification and kicks. The existing
> > > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > > IOREQ mechanism. However latency is an important factor and having
> > > > events go through the stub would add quite a lot.
> > >
> > > Yeah I think, regardless of anything else, we want the backends to
> > > connect directly to the Xen hypervisor.
> >
> > In my approach,
> >  a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
> >               via virtio-proxy
> >  b) FE -> BE: MMIO to config raises events (in event channels), which is
> >               converted to a callback to BE via virtio-proxy
> >               (Xen's event channel is internnally implemented by interrupts.)
> >
> > I don't know what "connect directly" means here, but sending interrupts
> > to the opposite side would be best efficient.
> > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
>
> Agree that MSI would be more efficient than SPI...
> At the moment, in order to notify the frontend, the backend issues a specific device-model call to query Xen to inject a corresponding SPI to the guest.
>
>
>
> > >
> > > > Could we consider the kernel internally converting IOREQ messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > hypercall interfaces?
> > > >
> > > > So any thoughts on what directions are worth experimenting with?
> > >
> > > One option we should consider is for each backend to connect to Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > hypervisor agnostic. The interface is really trivial and easy to add.
> >
> > As I said above, my proposal does the same thing that you mentioned here :)
> > The difference is that I do call hypervisor interfaces via virtio-proxy.
> >
> > > The only Xen-specific part is the notification mechanism, which is an
> > > event channel. If we replaced the event channel with something else the
> > > interface would be generic. See:
> > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > >
> > > I don't think that translating IOREQs to eventfd in the kernel is a
> > > good idea: if feels like it would be extra complexity and that the
> > > kernel shouldn't be involved as this is a backend-hypervisor interface.
> >
> > Given that we may want to implement BE as a bare-metal application
> > as I did on Zephyr, I don't think that the translation would not be
> > a big issue, especially on RTOS's.
> > It will be some kind of abstraction layer of interrupt handling
> > (or nothing but a callback mechanism).
> >
> > > Also, eventfd is very Linux-centric and we are trying to design an
> > > interface that could work well for RTOSes too. If we want to do
> > > something different, both OS-agnostic and hypervisor-agnostic, perhaps
> > > we could design a new interface. One that could be implementable in the
> > > Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> > > too.
> > >
> > >
> > > There is also another problem. IOREQ is probably not be the only
> > > interface needed. Have a look at
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > an interface for the backend to inject interrupts into the frontend? And
> > > if the backend requires dynamic memory mappings of frontend pages, then
> > > we would also need an interface to map/unmap domU pages.
> >
> > My proposal document might help here; All the interfaces required for
> > virtio-proxy (or hypervisor-related interfaces) are listed as
> > RPC protocols :)
> >
> > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > and self-contained. It is easy to add anywhere. A new interface to
> > > inject interrupts or map pages is more difficult to manage because it
> > > would require changes scattered across the various emulators.
> >
> > Exactly. I have no confident yet that my approach will also apply
> > to other hypervisors than Xen.
> > Technically, yes, but whether people can accept it or not is a different
> > matter.
> >
> > Thanks,
> > -Takahiro Akashi
>
>
>
> --
> Regards,
>
> Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-16 10:04       ` Wei Chen
@ 2021-08-17  8:07         ` AKASHI Takahiro
  2021-08-17  8:39           ` Wei Chen
  0 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-17  8:07 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Hi Wei, Oleksandr,

On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> Hi All,
> 
> Thanks for Stefano to link my kvmtool for Xen proposal here.
> This proposal is still discussing in Xen and KVM communities.
> The main work is to decouple the kvmtool from KVM and make
> other hypervisors can reuse the virtual device implementations.
> 
> In this case, we need to introduce an intermediate hypervisor
> layer for VMM abstraction, Which is, I think it's very close
> to stratos' virtio hypervisor agnosticism work.

# My proposal[1] comes from my own idea and doesn't always represent
# Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,

Your idea and my proposal seem to share the same background.
Both have the similar goal and currently start with, at first, Xen
and are based on kvm-tool. (Actually, my work is derived from
EPAM's virtio-disk, which is also based on kvm-tool.)

In particular, the abstraction of hypervisor interfaces has a same
set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
This is not co-incident as we both share the same origin as I said above.
And so we will also share the same issues. One of them is a way of
"sharing/mapping FE's memory". There is some trade-off between
the portability and the performance impact.
So we can discuss the topic here in this ML, too.
(See Alex's original email, too).

On the other hand, my approach aims to create a "single-binary" solution
in which the same binary of BE vm could run on any hypervisors.
Somehow similar to your "proposal-#2" in [2], but in my solution, all
the hypervisor-specific code would be put into another entity (VM),
named "virtio-proxy" and the abstracted operations are served via RPC.
(In this sense, BE is hypervisor-agnostic but might have OS dependency.)
But I know that we need discuss if this is a requirement even
in Stratos project or not. (Maybe not)

Specifically speaking about kvm-tool, I have a concern about its
license term; Targeting different hypervisors and different OSs
(which I assume includes RTOS's), the resultant library should be
license permissive and GPL for kvm-tool might be an issue.
Any thoughts?

-Takahiro Akashi


[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html
[2] https://marc.info/?l=xen-devel&m=162373754705233&w=2

> 
> > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > Sent: 2021年8月14日 23:38
> > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>; pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > Hello, all.
> >
> > Please see some comments below. And sorry for the possible format issues.
> >
> > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro <mailto:takahiro.akashi@linaro.org> wrote:
> > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the original
> > > > email to let them read the full context.
> > > >
> > > > My comments below are related to a potential Xen implementation, not
> > > > because it is the only implementation that matters, but because it is
> > > > the one I know best.
> > >
> > > Please note that my proposal (and hence the working prototype)[1]
> > > is based on Xen's virtio implementation (i.e. IOREQ) and particularly
> > > EPAM's virtio-disk application (backend server).
> > > It has been, I believe, well generalized but is still a bit biased
> > > toward this original design.
> > >
> > > So I hope you like my approach :)
> > >
> > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
> > >
> > > Let me take this opportunity to explain a bit more about my approach below.
> > >
> > > > Also, please see this relevant email thread:
> > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > >
> > > >
> > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > Hi,
> > > > >
> > > > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > > > backends so we can enable as much re-use of code as possible and avoid
> > > > > repeating ourselves. This is the flip side of the front end where
> > > > > multiple front-end implementations are required - one per OS, assuming
> > > > > you don't just want Linux guests. The resultant guests are trivially
> > > > > movable between hypervisors modulo any abstracted paravirt type
> > > > > interfaces.
> > > > >
> > > > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > > > daemons running in a broadly POSIX like environment. The interface to
> > > > > the daemon is fairly simple requiring only some mapped memory and some
> > > > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > > > stub binary would be responsible for any hypervisor specific setup and
> > > > > then launch a common binary to deal with the actual virtqueue requests
> > > > > themselves.
> > > > >
> > > > > Since that original sketch we've seen an expansion in the sort of ways
> > > > > backends could be created. There is interest in encapsulating backends
> > > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > > > has prompted ideas of using the trait interface to abstract differences
> > > > > away as well as the idea of bare-metal Rust backends.
> > > > >
> > > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > > calls for a description of the APIs needed from the hypervisor side to
> > > > > support VirtIO guests and their backends. However we are some way off
> > > > > from that at the moment as I think we need to at least demonstrate one
> > > > > portable backend before we start codifying requirements. To that end I
> > > > > want to think about what we need for a backend to function.
> > > > >
> > > > > Configuration
> > > > > =============
> > > > >
> > > > > In the type-2 setup this is typically fairly simple because the host
> > > > > system can orchestrate the various modules that make up the complete
> > > > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > > > we need some sort of mechanism to inform the backend VM about key
> > > > > details about the system:
> > > > >
> > > > >   - where virt queue memory is in it's address space
> > > > >   - how it's going to receive (interrupt) and trigger (kick) events
> > > > >   - what (if any) resources the backend needs to connect to
> > > > >
> > > > > Obviously you can elide over configuration issues by having static
> > > > > configurations and baking the assumptions into your guest images however
> > > > > this isn't scalable in the long term. The obvious solution seems to be
> > > > > extending a subset of Device Tree data to user space but perhaps there
> > > > > are other approaches?
> > > > >
> > > > > Before any virtio transactions can take place the appropriate memory
> > > > > mappings need to be made between the FE guest and the BE guest.
> > > >
> > > > > Currently the whole of the FE guests address space needs to be visible
> > > > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > > >
> > > > >  * BE guest boots with memory already mapped
> > > > >
> > > > >  This would entail the guest OS knowing where in it's Guest Physical
> > > > >  Address space is already taken up and avoiding clashing. I would assume
> > > > >  in this case you would want a standard interface to userspace to then
> > > > >  make that address space visible to the backend daemon.
> > >
> > > Yet another way here is that we would have well known "shared memory" between
> > > VMs. I think that Jailhouse's ivshmem gives us good insights on this matter
> > > and that it can even be an alternative for hypervisor-agnostic solution.
> > >
> > > (Please note memory regions in ivshmem appear as a PCI device and can be
> > > mapped locally.)
> > >
> > > I want to add this shared memory aspect to my virtio-proxy, but
> > > the resultant solution would eventually look similar to ivshmem.
> > >
> > > > >  * BE guests boots with a hypervisor handle to memory
> > > > >
> > > > >  The BE guest is then free to map the FE's memory to where it wants in
> > > > >  the BE's guest physical address space.
> > > >
> > > > I cannot see how this could work for Xen. There is no "handle" to give
> > > > to the backend if the backend is not running in dom0. So for Xen I think
> > > > the memory has to be already mapped
> > >
> > > In Xen's IOREQ solution (virtio-blk), the following information is expected
> > > to be exposed to BE via Xenstore:
> > > (I know that this is a tentative approach though.)
> > >    - the start address of configuration space
> > >    - interrupt number
> > >    - file path for backing storage
> > >    - read-only flag
> > > And the BE server have to call a particular hypervisor interface to
> > > map the configuration space.
> >
> > Yes, Xenstore was chosen as a simple way to pass configuration info to the backend running in a non-toolstack domain.
> > I remember, there was a wish to avoid using Xenstore in Virtio backend itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
> > to read backend configuration from the Xenstore anyway and pass it to the backend via command line arguments.
> >
> 
> Yes, in current PoC code we're using xenstore to pass device configuration.
> We also designed a static device configuration parse method for Dom0less or
> other scenarios don't have xentool. yes, it's from device model command line
> or a config file.
> 
> > But, if ...
> >
> > >
> > > In my approach (virtio-proxy), all those Xen (or hypervisor)-specific
> > > stuffs are contained in virtio-proxy, yet another VM, to hide all details.
> >
> > ... the solution how to overcome that is already found and proven to work then even better.
> >
> >
> >
> > > # My point is that a "handle" is not mandatory for executing mapping.
> > >
> > > > and the mapping probably done by the
> > > > toolstack (also see below.) Or we would have to invent a new Xen
> > > > hypervisor interface and Xen virtual machine privileges to allow this
> > > > kind of mapping.
> > >
> > > > If we run the backend in Dom0 that we have no problems of course.
> > >
> > > One of difficulties on Xen that I found in my approach is that calling
> > > such hypervisor intefaces (registering IOREQ, mapping memory) is only
> > > allowed on BE servers themselvies and so we will have to extend those
> > > interfaces.
> > > This, however, will raise some concern on security and privilege distribution
> > > as Stefan suggested.
> >
> > We also faced policy related issues with Virtio backend running in other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver
> > domain (we call it DomD) where the underlying H/W resides. We trust it, so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had.
> > Now it is permitted to issue device-model, resource and memory mappings, etc calls.
> >
> > > >
> > > >
> > > > > To activate the mapping will
> > > > >  require some sort of hypercall to the hypervisor. I can see two options
> > > > >  at this point:
> > > > >
> > > > >   - expose the handle to userspace for daemon/helper to trigger the
> > > > >     mapping via existing hypercall interfaces. If using a helper you
> > > > >     would have a hypervisor specific one to avoid the daemon having to
> > > > >     care too much about the details or push that complexity into a
> > > > >     compile time option for the daemon which would result in different
> > > > >     binaries although a common source base.
> > > > >
> > > > >   - expose a new kernel ABI to abstract the hypercall differences away
> > > > >     in the guest kernel. In this case the userspace would essentially
> > > > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > > > >     the kernel deal with the different hypercall interfaces. This of
> > > > >     course assumes the majority of BE guests would be Linux kernels and
> > > > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > > >
> > > > > Operation
> > > > > =========
> > > > >
> > > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > > vhost-user feature negotiation is done it's a case of receiving update
> > > > > events and parsing the resultant virt queue for data. The vhost-user
> > > > > specification handles a bunch of setup before that point, mostly to
> > > > > detail where the virt queues are set up FD's for memory and event
> > > > > communication. This is where the envisioned stub process would be
> > > > > responsible for getting the daemon up and ready to run. This is
> > > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > > approach would be to use the rust-vmm vhost crate. It would then either
> > > > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > > > build option for the various hypervisors.
> > > >
> > > > One thing I mentioned before to Alex is that Xen doesn't have VMMs the
> > > > way they are typically envisioned and described in other environments.
> > > > Instead, Xen has IOREQ servers. Each of them connects independently to
> > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as
> > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > independently via the IOREQ interface.
> > > >
> > > > The component responsible for starting a daemon and/or setting up shared
> > > > interfaces is the toolstack: the xl command and the libxl/libxc
> > > > libraries.
> > >
> > > I think that VM configuration management (or orchestration in Startos
> > > jargon?) is a subject to debate in parallel.
> > > Otherwise, is there any good assumption to avoid it right now?
> > >
> > > > Oleksandr and others I CCed have been working on ways for the toolstack
> > > > to create virtio backends and setup memory mappings. They might be able
> > > > to provide more info on the subject. I do think we miss a way to provide
> > > > the configuration to the backend and anything else that the backend
> > > > might require to start doing its job.
> >
> > Yes, some work has been done for the toolstack to handle Virtio MMIO devices in
> > general and Virtio block devices in particular. However, it has not been upstreaned yet.
> > Updated patches on review now:
> > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstysh@gmail.com/
> >
> > There is an additional (also important) activity to improve/fix foreign memory mapping on Arm which I am also involved in.
> > The foreign memory mapping is proposed to be used for Virtio backends (device emulators) if there is a need to run guest OS completely unmodified.
> > Of course, the more secure way would be to use grant memory mapping. Brietly, the main difference between them is that with foreign mapping the backend
> > can map any guest memory it wants to map, but with grant mapping it is allowed to map only what was previously granted by the frontend.
> >
> > So, there might be a problem if we want to pre-map some guest memory in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done.
> > You can find the related Xen discussion at:
> > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstysh@gmail.com/
> >
> >
> > > >
> > > >
> > > > > One question is how to best handle notification and kicks. The existing
> > > > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > > > IOREQ mechanism. However latency is an important factor and having
> > > > > events go through the stub would add quite a lot.
> > > >
> > > > Yeah I think, regardless of anything else, we want the backends to
> > > > connect directly to the Xen hypervisor.
> > >
> > > In my approach,
> > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor interface
> > >               via virtio-proxy
> > >  b) FE -> BE: MMIO to config raises events (in event channels), which is
> > >               converted to a callback to BE via virtio-proxy
> > >               (Xen's event channel is internnally implemented by interrupts.)
> > >
> > > I don't know what "connect directly" means here, but sending interrupts
> > > to the opposite side would be best efficient.
> > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
> >
> > Agree that MSI would be more efficient than SPI...
> > At the moment, in order to notify the frontend, the backend issues a specific device-model call to query Xen to inject a corresponding SPI to the guest.
> >
> >
> >
> > > >
> > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > hypercall interfaces?
> > > > >
> > > > > So any thoughts on what directions are worth experimenting with?
> > > >
> > > > One option we should consider is for each backend to connect to Xen via
> > > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > > hypervisor agnostic. The interface is really trivial and easy to add.
> > >
> > > As I said above, my proposal does the same thing that you mentioned here :)
> > > The difference is that I do call hypervisor interfaces via virtio-proxy.
> > >
> > > > The only Xen-specific part is the notification mechanism, which is an
> > > > event channel. If we replaced the event channel with something else the
> > > > interface would be generic. See:
> > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > >
> > > > I don't think that translating IOREQs to eventfd in the kernel is a
> > > > good idea: if feels like it would be extra complexity and that the
> > > > kernel shouldn't be involved as this is a backend-hypervisor interface.
> > >
> > > Given that we may want to implement BE as a bare-metal application
> > > as I did on Zephyr, I don't think that the translation would not be
> > > a big issue, especially on RTOS's.
> > > It will be some kind of abstraction layer of interrupt handling
> > > (or nothing but a callback mechanism).
> > >
> > > > Also, eventfd is very Linux-centric and we are trying to design an
> > > > interface that could work well for RTOSes too. If we want to do
> > > > something different, both OS-agnostic and hypervisor-agnostic, perhaps
> > > > we could design a new interface. One that could be implementable in the
> > > > Xen hypervisor itself (like IOREQ) and of course any other hypervisor
> > > > too.
> > > >
> > > >
> > > > There is also another problem. IOREQ is probably not be the only
> > > > interface needed. Have a look at
> > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > > an interface for the backend to inject interrupts into the frontend? And
> > > > if the backend requires dynamic memory mappings of frontend pages, then
> > > > we would also need an interface to map/unmap domU pages.
> > >
> > > My proposal document might help here; All the interfaces required for
> > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > RPC protocols :)
> > >
> > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > inject interrupts or map pages is more difficult to manage because it
> > > > would require changes scattered across the various emulators.
> > >
> > > Exactly. I have no confident yet that my approach will also apply
> > > to other hypervisors than Xen.
> > > Technically, yes, but whether people can accept it or not is a different
> > > matter.
> > >
> > > Thanks,
> > > -Takahiro Akashi
> >
> >
> >
> > --
> > Regards,
> >
> > Oleksandr Tyshchenko
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-17  8:07         ` AKASHI Takahiro
@ 2021-08-17  8:39           ` Wei Chen
  2021-08-18  5:38             ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: Wei Chen @ 2021-08-17  8:39 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Hi Akashi,

> -----Original Message-----
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Sent: 2021年8月17日 16:08
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Stratos
> Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>
> Hi Wei, Oleksandr,
>
> On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > Hi All,
> >
> > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > This proposal is still discussing in Xen and KVM communities.
> > The main work is to decouple the kvmtool from KVM and make
> > other hypervisors can reuse the virtual device implementations.
> >
> > In this case, we need to introduce an intermediate hypervisor
> > layer for VMM abstraction, Which is, I think it's very close
> > to stratos' virtio hypervisor agnosticism work.
>
> # My proposal[1] comes from my own idea and doesn't always represent
> # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
>
> Your idea and my proposal seem to share the same background.
> Both have the similar goal and currently start with, at first, Xen
> and are based on kvm-tool. (Actually, my work is derived from
> EPAM's virtio-disk, which is also based on kvm-tool.)
>
> In particular, the abstraction of hypervisor interfaces has a same
> set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
> This is not co-incident as we both share the same origin as I said above.
> And so we will also share the same issues. One of them is a way of
> "sharing/mapping FE's memory". There is some trade-off between
> the portability and the performance impact.
> So we can discuss the topic here in this ML, too.
> (See Alex's original email, too).
>
Yes, I agree.

> On the other hand, my approach aims to create a "single-binary" solution
> in which the same binary of BE vm could run on any hypervisors.
> Somehow similar to your "proposal-#2" in [2], but in my solution, all
> the hypervisor-specific code would be put into another entity (VM),
> named "virtio-proxy" and the abstracted operations are served via RPC.
> (In this sense, BE is hypervisor-agnostic but might have OS dependency.)
> But I know that we need discuss if this is a requirement even
> in Stratos project or not. (Maybe not)
>

Sorry, I haven't had time to finish reading your virtio-proxy completely
(I will do it ASAP). But from your description, it seems we need a
3rd VM between FE and BE? My concern is that, if my assumption is right,
will it increase the latency in data transport path? Even if we're
using some lightweight guest like RTOS or Unikernel,

> Specifically speaking about kvm-tool, I have a concern about its
> license term; Targeting different hypervisors and different OSs
> (which I assume includes RTOS's), the resultant library should be
> license permissive and GPL for kvm-tool might be an issue.
> Any thoughts?
>

Yes. If user want to implement a FreeBSD device model, but the virtio
library is GPL. Then GPL would be a problem. If we have another good
candidate, I am open to it.

> -Takahiro Akashi
>
>
> [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> August/000548.html
> [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
>
> >
> > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > Sent: 2021年8月14日 23:38
> > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano Stabellini
> <sstabellini@kernel.org>
> > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List
> <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org; Arnd
> Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr
> Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > Hello, all.
> > >
> > > Please see some comments below. And sorry for the possible format
> issues.
> > >
> > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> <mailto:takahiro.akashi@linaro.org> wrote:
> > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the
> original
> > > > > email to let them read the full context.
> > > > >
> > > > > My comments below are related to a potential Xen implementation,
> not
> > > > > because it is the only implementation that matters, but because it
> is
> > > > > the one I know best.
> > > >
> > > > Please note that my proposal (and hence the working prototype)[1]
> > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> particularly
> > > > EPAM's virtio-disk application (backend server).
> > > > It has been, I believe, well generalized but is still a bit biased
> > > > toward this original design.
> > > >
> > > > So I hope you like my approach :)
> > > >
> > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> August/000546.html
> > > >
> > > > Let me take this opportunity to explain a bit more about my approach
> below.
> > > >
> > > > > Also, please see this relevant email thread:
> > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > >
> > > > >
> > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > Hi,
> > > > > >
> > > > > > One of the goals of Project Stratos is to enable hypervisor
> agnostic
> > > > > > backends so we can enable as much re-use of code as possible and
> avoid
> > > > > > repeating ourselves. This is the flip side of the front end
> where
> > > > > > multiple front-end implementations are required - one per OS,
> assuming
> > > > > > you don't just want Linux guests. The resultant guests are
> trivially
> > > > > > movable between hypervisors modulo any abstracted paravirt type
> > > > > > interfaces.
> > > > > >
> > > > > > In my original thumb nail sketch of a solution I envisioned
> vhost-user
> > > > > > daemons running in a broadly POSIX like environment. The
> interface to
> > > > > > the daemon is fairly simple requiring only some mapped memory
> and some
> > > > > > sort of signalling for events (on Linux this is eventfd). The
> idea was a
> > > > > > stub binary would be responsible for any hypervisor specific
> setup and
> > > > > > then launch a common binary to deal with the actual virtqueue
> requests
> > > > > > themselves.
> > > > > >
> > > > > > Since that original sketch we've seen an expansion in the sort
> of ways
> > > > > > backends could be created. There is interest in encapsulating
> backends
> > > > > > in RTOSes or unikernels for solutions like SCMI. There interest
> in Rust
> > > > > > has prompted ideas of using the trait interface to abstract
> differences
> > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > >
> > > > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > > > calls for a description of the APIs needed from the hypervisor
> side to
> > > > > > support VirtIO guests and their backends. However we are some
> way off
> > > > > > from that at the moment as I think we need to at least
> demonstrate one
> > > > > > portable backend before we start codifying requirements. To that
> end I
> > > > > > want to think about what we need for a backend to function.
> > > > > >
> > > > > > Configuration
> > > > > > =============
> > > > > >
> > > > > > In the type-2 setup this is typically fairly simple because the
> host
> > > > > > system can orchestrate the various modules that make up the
> complete
> > > > > > system. In the type-1 case (or even type-2 with delegated
> service VMs)
> > > > > > we need some sort of mechanism to inform the backend VM about
> key
> > > > > > details about the system:
> > > > > >
> > > > > >   - where virt queue memory is in it's address space
> > > > > >   - how it's going to receive (interrupt) and trigger (kick)
> events
> > > > > >   - what (if any) resources the backend needs to connect to
> > > > > >
> > > > > > Obviously you can elide over configuration issues by having
> static
> > > > > > configurations and baking the assumptions into your guest images
> however
> > > > > > this isn't scalable in the long term. The obvious solution seems
> to be
> > > > > > extending a subset of Device Tree data to user space but perhaps
> there
> > > > > > are other approaches?
> > > > > >
> > > > > > Before any virtio transactions can take place the appropriate
> memory
> > > > > > mappings need to be made between the FE guest and the BE guest.
> > > > >
> > > > > > Currently the whole of the FE guests address space needs to be
> visible
> > > > > > to whatever is serving the virtio requests. I can envision 3
> approaches:
> > > > > >
> > > > > >  * BE guest boots with memory already mapped
> > > > > >
> > > > > >  This would entail the guest OS knowing where in it's Guest
> Physical
> > > > > >  Address space is already taken up and avoiding clashing. I
> would assume
> > > > > >  in this case you would want a standard interface to userspace
> to then
> > > > > >  make that address space visible to the backend daemon.
> > > >
> > > > Yet another way here is that we would have well known "shared
> memory" between
> > > > VMs. I think that Jailhouse's ivshmem gives us good insights on this
> matter
> > > > and that it can even be an alternative for hypervisor-agnostic
> solution.
> > > >
> > > > (Please note memory regions in ivshmem appear as a PCI device and
> can be
> > > > mapped locally.)
> > > >
> > > > I want to add this shared memory aspect to my virtio-proxy, but
> > > > the resultant solution would eventually look similar to ivshmem.
> > > >
> > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > >
> > > > > >  The BE guest is then free to map the FE's memory to where it
> wants in
> > > > > >  the BE's guest physical address space.
> > > > >
> > > > > I cannot see how this could work for Xen. There is no "handle" to
> give
> > > > > to the backend if the backend is not running in dom0. So for Xen I
> think
> > > > > the memory has to be already mapped
> > > >
> > > > In Xen's IOREQ solution (virtio-blk), the following information is
> expected
> > > > to be exposed to BE via Xenstore:
> > > > (I know that this is a tentative approach though.)
> > > >    - the start address of configuration space
> > > >    - interrupt number
> > > >    - file path for backing storage
> > > >    - read-only flag
> > > > And the BE server have to call a particular hypervisor interface to
> > > > map the configuration space.
> > >
> > > Yes, Xenstore was chosen as a simple way to pass configuration info to
> the backend running in a non-toolstack domain.
> > > I remember, there was a wish to avoid using Xenstore in Virtio backend
> itself if possible, so for non-toolstack domain, this could done with
> adjusting devd (daemon that listens for devices and launches backends)
> > > to read backend configuration from the Xenstore anyway and pass it to
> the backend via command line arguments.
> > >
> >
> > Yes, in current PoC code we're using xenstore to pass device
> configuration.
> > We also designed a static device configuration parse method for Dom0less
> or
> > other scenarios don't have xentool. yes, it's from device model command
> line
> > or a config file.
> >
> > > But, if ...
> > >
> > > >
> > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
> specific
> > > > stuffs are contained in virtio-proxy, yet another VM, to hide all
> details.
> > >
> > > ... the solution how to overcome that is already found and proven to
> work then even better.
> > >
> > >
> > >
> > > > # My point is that a "handle" is not mandatory for executing mapping.
> > > >
> > > > > and the mapping probably done by the
> > > > > toolstack (also see below.) Or we would have to invent a new Xen
> > > > > hypervisor interface and Xen virtual machine privileges to allow
> this
> > > > > kind of mapping.
> > > >
> > > > > If we run the backend in Dom0 that we have no problems of course.
> > > >
> > > > One of difficulties on Xen that I found in my approach is that
> calling
> > > > such hypervisor intefaces (registering IOREQ, mapping memory) is
> only
> > > > allowed on BE servers themselvies and so we will have to extend
> those
> > > > interfaces.
> > > > This, however, will raise some concern on security and privilege
> distribution
> > > > as Stefan suggested.
> > >
> > > We also faced policy related issues with Virtio backend running in
> other than Dom0 domain in a "dummy" xsm mode. In our target system we run
> the backend in a driver
> > > domain (we call it DomD) where the underlying H/W resides. We trust it,
> so we wrote policy rules (to be used in "flask" xsm mode) to provide it
> with a little bit more privileges than a simple DomU had.
> > > Now it is permitted to issue device-model, resource and memory
> mappings, etc calls.
> > >
> > > > >
> > > > >
> > > > > > To activate the mapping will
> > > > > >  require some sort of hypercall to the hypervisor. I can see two
> options
> > > > > >  at this point:
> > > > > >
> > > > > >   - expose the handle to userspace for daemon/helper to trigger
> the
> > > > > >     mapping via existing hypercall interfaces. If using a helper
> you
> > > > > >     would have a hypervisor specific one to avoid the daemon
> having to
> > > > > >     care too much about the details or push that complexity into
> a
> > > > > >     compile time option for the daemon which would result in
> different
> > > > > >     binaries although a common source base.
> > > > > >
> > > > > >   - expose a new kernel ABI to abstract the hypercall
> differences away
> > > > > >     in the guest kernel. In this case the userspace would
> essentially
> > > > > >     ask for an abstract "map guest N memory to userspace ptr"
> and let
> > > > > >     the kernel deal with the different hypercall interfaces.
> This of
> > > > > >     course assumes the majority of BE guests would be Linux
> kernels and
> > > > > >     leaves the bare-metal/unikernel approaches to their own
> devices.
> > > > > >
> > > > > > Operation
> > > > > > =========
> > > > > >
> > > > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > > > vhost-user feature negotiation is done it's a case of receiving
> update
> > > > > > events and parsing the resultant virt queue for data. The vhost-
> user
> > > > > > specification handles a bunch of setup before that point, mostly
> to
> > > > > > detail where the virt queues are set up FD's for memory and
> event
> > > > > > communication. This is where the envisioned stub process would
> be
> > > > > > responsible for getting the daemon up and ready to run. This is
> > > > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > > > approach would be to use the rust-vmm vhost crate. It would then
> either
> > > > > > communicate with the kernel's abstracted ABI or be re-targeted
> as a
> > > > > > build option for the various hypervisors.
> > > > >
> > > > > One thing I mentioned before to Alex is that Xen doesn't have VMMs
> the
> > > > > way they are typically envisioned and described in other
> environments.
> > > > > Instead, Xen has IOREQ servers. Each of them connects
> independently to
> > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be
> used as
> > > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > > independently via the IOREQ interface.
> > > > >
> > > > > The component responsible for starting a daemon and/or setting up
> shared
> > > > > interfaces is the toolstack: the xl command and the libxl/libxc
> > > > > libraries.
> > > >
> > > > I think that VM configuration management (or orchestration in
> Startos
> > > > jargon?) is a subject to debate in parallel.
> > > > Otherwise, is there any good assumption to avoid it right now?
> > > >
> > > > > Oleksandr and others I CCed have been working on ways for the
> toolstack
> > > > > to create virtio backends and setup memory mappings. They might be
> able
> > > > > to provide more info on the subject. I do think we miss a way to
> provide
> > > > > the configuration to the backend and anything else that the
> backend
> > > > > might require to start doing its job.
> > >
> > > Yes, some work has been done for the toolstack to handle Virtio MMIO
> devices in
> > > general and Virtio block devices in particular. However, it has not
> been upstreaned yet.
> > > Updated patches on review now:
> > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-
> olekstysh@gmail.com/
> > >
> > > There is an additional (also important) activity to improve/fix
> foreign memory mapping on Arm which I am also involved in.
> > > The foreign memory mapping is proposed to be used for Virtio backends
> (device emulators) if there is a need to run guest OS completely
> unmodified.
> > > Of course, the more secure way would be to use grant memory mapping.
> Brietly, the main difference between them is that with foreign mapping the
> backend
> > > can map any guest memory it wants to map, but with grant mapping it is
> allowed to map only what was previously granted by the frontend.
> > >
> > > So, there might be a problem if we want to pre-map some guest memory
> in advance or to cache mappings in the backend in order to improve
> performance (because the mapping/unmapping guest pages every request
> requires a lot of back and forth to Xen + P2M updates). In a nutshell,
> currently, in order to map a guest page into the backend address space we
> need to steal a real physical page from the backend domain. So, with the
> said optimizations we might end up with no free memory in the backend
> domain (see XSA-300). And what we try to achieve is to not waste a real
> domain memory at all by providing safe non-allocated-yet (so unused)
> address space for the foreign (and grant) pages to be mapped into, this
> enabling work implies Xen and Linux (and likely DTB bindings) changes.
> However, as it turned out, for this to work in a proper and safe way some
> prereq work needs to be done.
> > > You can find the related Xen discussion at:
> > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-
> olekstysh@gmail.com/
> > >
> > >
> > > > >
> > > > >
> > > > > > One question is how to best handle notification and kicks. The
> existing
> > > > > > vhost-user framework uses eventfd to signal the daemon (although
> QEMU
> > > > > > is quite capable of simulating them when you use TCG). Xen has
> it's own
> > > > > > IOREQ mechanism. However latency is an important factor and
> having
> > > > > > events go through the stub would add quite a lot.
> > > > >
> > > > > Yeah I think, regardless of anything else, we want the backends to
> > > > > connect directly to the Xen hypervisor.
> > > >
> > > > In my approach,
> > > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor
> interface
> > > >               via virtio-proxy
> > > >  b) FE -> BE: MMIO to config raises events (in event channels),
> which is
> > > >               converted to a callback to BE via virtio-proxy
> > > >               (Xen's event channel is internnally implemented by
> interrupts.)
> > > >
> > > > I don't know what "connect directly" means here, but sending
> interrupts
> > > > to the opposite side would be best efficient.
> > > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
> mechanism.
> > >
> > > Agree that MSI would be more efficient than SPI...
> > > At the moment, in order to notify the frontend, the backend issues a
> specific device-model call to query Xen to inject a corresponding SPI to
> the guest.
> > >
> > >
> > >
> > > > >
> > > > > > Could we consider the kernel internally converting IOREQ
> messages from
> > > > > > the Xen hypervisor to eventfd events? Would this scale with
> other kernel
> > > > > > hypercall interfaces?
> > > > > >
> > > > > > So any thoughts on what directions are worth experimenting with?
> > > > >
> > > > > One option we should consider is for each backend to connect to
> Xen via
> > > > > the IOREQ interface. We could generalize the IOREQ interface and
> make it
> > > > > hypervisor agnostic. The interface is really trivial and easy to
> add.
> > > >
> > > > As I said above, my proposal does the same thing that you mentioned
> here :)
> > > > The difference is that I do call hypervisor interfaces via virtio-
> proxy.
> > > >
> > > > > The only Xen-specific part is the notification mechanism, which is
> an
> > > > > event channel. If we replaced the event channel with something
> else the
> > > > > interface would be generic. See:
> > > > > https://gitlab.com/xen-project/xen/-
> /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > >
> > > > > I don't think that translating IOREQs to eventfd in the kernel is
> a
> > > > > good idea: if feels like it would be extra complexity and that the
> > > > > kernel shouldn't be involved as this is a backend-hypervisor
> interface.
> > > >
> > > > Given that we may want to implement BE as a bare-metal application
> > > > as I did on Zephyr, I don't think that the translation would not be
> > > > a big issue, especially on RTOS's.
> > > > It will be some kind of abstraction layer of interrupt handling
> > > > (or nothing but a callback mechanism).
> > > >
> > > > > Also, eventfd is very Linux-centric and we are trying to design an
> > > > > interface that could work well for RTOSes too. If we want to do
> > > > > something different, both OS-agnostic and hypervisor-agnostic,
> perhaps
> > > > > we could design a new interface. One that could be implementable
> in the
> > > > > Xen hypervisor itself (like IOREQ) and of course any other
> hypervisor
> > > > > too.
> > > > >
> > > > >
> > > > > There is also another problem. IOREQ is probably not be the only
> > > > > interface needed. Have a look at
> > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
> also need
> > > > > an interface for the backend to inject interrupts into the
> frontend? And
> > > > > if the backend requires dynamic memory mappings of frontend pages,
> then
> > > > > we would also need an interface to map/unmap domU pages.
> > > >
> > > > My proposal document might help here; All the interfaces required
> for
> > > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > > RPC protocols :)
> > > >
> > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is
> tiny
> > > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > > inject interrupts or map pages is more difficult to manage because
> it
> > > > > would require changes scattered across the various emulators.
> > > >
> > > > Exactly. I have no confident yet that my approach will also apply
> > > > to other hypervisors than Xen.
> > > > Technically, yes, but whether people can accept it or not is a
> different
> > > > matter.
> > > >
> > > > Thanks,
> > > > -Takahiro Akashi
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Oleksandr Tyshchenko
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-04 19:20 ` Stefano Stabellini
@ 2021-08-17 10:41     ` Stefan Hajnoczi
       [not found]   ` <0100017b33e585a5-06d4248e-b1a7-485e-800c-7ead89e5f916-000000@email.amazonses.com>
  2021-08-17 10:41     ` [virtio-dev] " Stefan Hajnoczi
  2 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-17 10:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Alex Bennée, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > Could we consider the kernel internally converting IOREQ messages from
> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > hypercall interfaces?
> > 
> > So any thoughts on what directions are worth experimenting with?
>  
> One option we should consider is for each backend to connect to Xen via
> the IOREQ interface. We could generalize the IOREQ interface and make it
> hypervisor agnostic. The interface is really trivial and easy to add.
> The only Xen-specific part is the notification mechanism, which is an
> event channel. If we replaced the event channel with something else the
> interface would be generic. See:
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52

There have been experiments with something kind of similar in KVM
recently (see struct ioregionfd_cmd):
https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/

> There is also another problem. IOREQ is probably not be the only
> interface needed. Have a look at
> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> an interface for the backend to inject interrupts into the frontend? And
> if the backend requires dynamic memory mappings of frontend pages, then
> we would also need an interface to map/unmap domU pages.
> 
> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> and self-contained. It is easy to add anywhere. A new interface to
> inject interrupts or map pages is more difficult to manage because it
> would require changes scattered across the various emulators.

Something like ioreq is indeed necessary to implement arbitrary devices,
but if you are willing to restrict yourself to VIRTIO then other
interfaces are possible too because the VIRTIO device model is different
from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-08-17 10:41     ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-17 10:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Alex Bennée, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > Could we consider the kernel internally converting IOREQ messages from
> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > hypercall interfaces?
> > 
> > So any thoughts on what directions are worth experimenting with?
>  
> One option we should consider is for each backend to connect to Xen via
> the IOREQ interface. We could generalize the IOREQ interface and make it
> hypervisor agnostic. The interface is really trivial and easy to add.
> The only Xen-specific part is the notification mechanism, which is an
> event channel. If we replaced the event channel with something else the
> interface would be generic. See:
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52

There have been experiments with something kind of similar in KVM
recently (see struct ioregionfd_cmd):
https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/

> There is also another problem. IOREQ is probably not be the only
> interface needed. Have a look at
> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> an interface for the backend to inject interrupts into the frontend? And
> if the backend requires dynamic memory mappings of frontend pages, then
> we would also need an interface to map/unmap domU pages.
> 
> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> and self-contained. It is easy to add anywhere. A new interface to
> inject interrupts or map pages is more difficult to manage because it
> would require changes scattered across the various emulators.

Something like ioreq is indeed necessary to implement arbitrary devices,
but if you are willing to restrict yourself to VIRTIO then other
interfaces are possible too because the VIRTIO device model is different
from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-17  8:39           ` Wei Chen
@ 2021-08-18  5:38             ` AKASHI Takahiro
  2021-08-18  8:35               ` Wei Chen
  0 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-18  5:38 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> Hi Akashi,
> 
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年8月17日 16:08
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > Hi Wei, Oleksandr,
> >
> > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > Hi All,
> > >
> > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > This proposal is still discussing in Xen and KVM communities.
> > > The main work is to decouple the kvmtool from KVM and make
> > > other hypervisors can reuse the virtual device implementations.
> > >
> > > In this case, we need to introduce an intermediate hypervisor
> > > layer for VMM abstraction, Which is, I think it's very close
> > > to stratos' virtio hypervisor agnosticism work.
> >
> > # My proposal[1] comes from my own idea and doesn't always represent
> > # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
> >
> > Your idea and my proposal seem to share the same background.
> > Both have the similar goal and currently start with, at first, Xen
> > and are based on kvm-tool. (Actually, my work is derived from
> > EPAM's virtio-disk, which is also based on kvm-tool.)
> >
> > In particular, the abstraction of hypervisor interfaces has a same
> > set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
> > This is not co-incident as we both share the same origin as I said above.
> > And so we will also share the same issues. One of them is a way of
> > "sharing/mapping FE's memory". There is some trade-off between
> > the portability and the performance impact.
> > So we can discuss the topic here in this ML, too.
> > (See Alex's original email, too).
> >
> Yes, I agree.
> 
> > On the other hand, my approach aims to create a "single-binary" solution
> > in which the same binary of BE vm could run on any hypervisors.
> > Somehow similar to your "proposal-#2" in [2], but in my solution, all
> > the hypervisor-specific code would be put into another entity (VM),
> > named "virtio-proxy" and the abstracted operations are served via RPC.
> > (In this sense, BE is hypervisor-agnostic but might have OS dependency.)
> > But I know that we need discuss if this is a requirement even
> > in Stratos project or not. (Maybe not)
> >
> 
> Sorry, I haven't had time to finish reading your virtio-proxy completely
> (I will do it ASAP). But from your description, it seems we need a
> 3rd VM between FE and BE? My concern is that, if my assumption is right,
> will it increase the latency in data transport path? Even if we're
> using some lightweight guest like RTOS or Unikernel,

Yes, you're right. But I'm afraid that it is a matter of degree.
As far as we execute 'mapping' operations at every fetch of payload,
we will see latency issue (even in your case) and if we have some solution
for it, we won't see it neither in my proposal :)

> > Specifically speaking about kvm-tool, I have a concern about its
> > license term; Targeting different hypervisors and different OSs
> > (which I assume includes RTOS's), the resultant library should be
> > license permissive and GPL for kvm-tool might be an issue.
> > Any thoughts?
> >
> 
> Yes. If user want to implement a FreeBSD device model, but the virtio
> library is GPL. Then GPL would be a problem. If we have another good
> candidate, I am open to it.

I have some candidates, particularly for vq/vring, in my mind:
* Open-AMP, or
* corresponding Free-BSD code

-Takahiro Akashi


> > -Takahiro Akashi
> >
> >
> > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > August/000548.html
> > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> >
> > >
> > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > Sent: 2021年8月14日 23:38
> > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano Stabellini
> > <sstabellini@kernel.org>
> > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List
> > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org; Arnd
> > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr
> > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > Hello, all.
> > > >
> > > > Please see some comments below. And sorry for the possible format
> > issues.
> > > >
> > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming the
> > original
> > > > > > email to let them read the full context.
> > > > > >
> > > > > > My comments below are related to a potential Xen implementation,
> > not
> > > > > > because it is the only implementation that matters, but because it
> > is
> > > > > > the one I know best.
> > > > >
> > > > > Please note that my proposal (and hence the working prototype)[1]
> > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > particularly
> > > > > EPAM's virtio-disk application (backend server).
> > > > > It has been, I believe, well generalized but is still a bit biased
> > > > > toward this original design.
> > > > >
> > > > > So I hope you like my approach :)
> > > > >
> > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > August/000546.html
> > > > >
> > > > > Let me take this opportunity to explain a bit more about my approach
> > below.
> > > > >
> > > > > > Also, please see this relevant email thread:
> > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > >
> > > > > >
> > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > One of the goals of Project Stratos is to enable hypervisor
> > agnostic
> > > > > > > backends so we can enable as much re-use of code as possible and
> > avoid
> > > > > > > repeating ourselves. This is the flip side of the front end
> > where
> > > > > > > multiple front-end implementations are required - one per OS,
> > assuming
> > > > > > > you don't just want Linux guests. The resultant guests are
> > trivially
> > > > > > > movable between hypervisors modulo any abstracted paravirt type
> > > > > > > interfaces.
> > > > > > >
> > > > > > > In my original thumb nail sketch of a solution I envisioned
> > vhost-user
> > > > > > > daemons running in a broadly POSIX like environment. The
> > interface to
> > > > > > > the daemon is fairly simple requiring only some mapped memory
> > and some
> > > > > > > sort of signalling for events (on Linux this is eventfd). The
> > idea was a
> > > > > > > stub binary would be responsible for any hypervisor specific
> > setup and
> > > > > > > then launch a common binary to deal with the actual virtqueue
> > requests
> > > > > > > themselves.
> > > > > > >
> > > > > > > Since that original sketch we've seen an expansion in the sort
> > of ways
> > > > > > > backends could be created. There is interest in encapsulating
> > backends
> > > > > > > in RTOSes or unikernels for solutions like SCMI. There interest
> > in Rust
> > > > > > > has prompted ideas of using the trait interface to abstract
> > differences
> > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > >
> > > > > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > > > > calls for a description of the APIs needed from the hypervisor
> > side to
> > > > > > > support VirtIO guests and their backends. However we are some
> > way off
> > > > > > > from that at the moment as I think we need to at least
> > demonstrate one
> > > > > > > portable backend before we start codifying requirements. To that
> > end I
> > > > > > > want to think about what we need for a backend to function.
> > > > > > >
> > > > > > > Configuration
> > > > > > > =============
> > > > > > >
> > > > > > > In the type-2 setup this is typically fairly simple because the
> > host
> > > > > > > system can orchestrate the various modules that make up the
> > complete
> > > > > > > system. In the type-1 case (or even type-2 with delegated
> > service VMs)
> > > > > > > we need some sort of mechanism to inform the backend VM about
> > key
> > > > > > > details about the system:
> > > > > > >
> > > > > > >   - where virt queue memory is in it's address space
> > > > > > >   - how it's going to receive (interrupt) and trigger (kick)
> > events
> > > > > > >   - what (if any) resources the backend needs to connect to
> > > > > > >
> > > > > > > Obviously you can elide over configuration issues by having
> > static
> > > > > > > configurations and baking the assumptions into your guest images
> > however
> > > > > > > this isn't scalable in the long term. The obvious solution seems
> > to be
> > > > > > > extending a subset of Device Tree data to user space but perhaps
> > there
> > > > > > > are other approaches?
> > > > > > >
> > > > > > > Before any virtio transactions can take place the appropriate
> > memory
> > > > > > > mappings need to be made between the FE guest and the BE guest.
> > > > > >
> > > > > > > Currently the whole of the FE guests address space needs to be
> > visible
> > > > > > > to whatever is serving the virtio requests. I can envision 3
> > approaches:
> > > > > > >
> > > > > > >  * BE guest boots with memory already mapped
> > > > > > >
> > > > > > >  This would entail the guest OS knowing where in it's Guest
> > Physical
> > > > > > >  Address space is already taken up and avoiding clashing. I
> > would assume
> > > > > > >  in this case you would want a standard interface to userspace
> > to then
> > > > > > >  make that address space visible to the backend daemon.
> > > > >
> > > > > Yet another way here is that we would have well known "shared
> > memory" between
> > > > > VMs. I think that Jailhouse's ivshmem gives us good insights on this
> > matter
> > > > > and that it can even be an alternative for hypervisor-agnostic
> > solution.
> > > > >
> > > > > (Please note memory regions in ivshmem appear as a PCI device and
> > can be
> > > > > mapped locally.)
> > > > >
> > > > > I want to add this shared memory aspect to my virtio-proxy, but
> > > > > the resultant solution would eventually look similar to ivshmem.
> > > > >
> > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > >
> > > > > > >  The BE guest is then free to map the FE's memory to where it
> > wants in
> > > > > > >  the BE's guest physical address space.
> > > > > >
> > > > > > I cannot see how this could work for Xen. There is no "handle" to
> > give
> > > > > > to the backend if the backend is not running in dom0. So for Xen I
> > think
> > > > > > the memory has to be already mapped
> > > > >
> > > > > In Xen's IOREQ solution (virtio-blk), the following information is
> > expected
> > > > > to be exposed to BE via Xenstore:
> > > > > (I know that this is a tentative approach though.)
> > > > >    - the start address of configuration space
> > > > >    - interrupt number
> > > > >    - file path for backing storage
> > > > >    - read-only flag
> > > > > And the BE server have to call a particular hypervisor interface to
> > > > > map the configuration space.
> > > >
> > > > Yes, Xenstore was chosen as a simple way to pass configuration info to
> > the backend running in a non-toolstack domain.
> > > > I remember, there was a wish to avoid using Xenstore in Virtio backend
> > itself if possible, so for non-toolstack domain, this could done with
> > adjusting devd (daemon that listens for devices and launches backends)
> > > > to read backend configuration from the Xenstore anyway and pass it to
> > the backend via command line arguments.
> > > >
> > >
> > > Yes, in current PoC code we're using xenstore to pass device
> > configuration.
> > > We also designed a static device configuration parse method for Dom0less
> > or
> > > other scenarios don't have xentool. yes, it's from device model command
> > line
> > > or a config file.
> > >
> > > > But, if ...
> > > >
> > > > >
> > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
> > specific
> > > > > stuffs are contained in virtio-proxy, yet another VM, to hide all
> > details.
> > > >
> > > > ... the solution how to overcome that is already found and proven to
> > work then even better.
> > > >
> > > >
> > > >
> > > > > # My point is that a "handle" is not mandatory for executing mapping.
> > > > >
> > > > > > and the mapping probably done by the
> > > > > > toolstack (also see below.) Or we would have to invent a new Xen
> > > > > > hypervisor interface and Xen virtual machine privileges to allow
> > this
> > > > > > kind of mapping.
> > > > >
> > > > > > If we run the backend in Dom0 that we have no problems of course.
> > > > >
> > > > > One of difficulties on Xen that I found in my approach is that
> > calling
> > > > > such hypervisor intefaces (registering IOREQ, mapping memory) is
> > only
> > > > > allowed on BE servers themselvies and so we will have to extend
> > those
> > > > > interfaces.
> > > > > This, however, will raise some concern on security and privilege
> > distribution
> > > > > as Stefan suggested.
> > > >
> > > > We also faced policy related issues with Virtio backend running in
> > other than Dom0 domain in a "dummy" xsm mode. In our target system we run
> > the backend in a driver
> > > > domain (we call it DomD) where the underlying H/W resides. We trust it,
> > so we wrote policy rules (to be used in "flask" xsm mode) to provide it
> > with a little bit more privileges than a simple DomU had.
> > > > Now it is permitted to issue device-model, resource and memory
> > mappings, etc calls.
> > > >
> > > > > >
> > > > > >
> > > > > > > To activate the mapping will
> > > > > > >  require some sort of hypercall to the hypervisor. I can see two
> > options
> > > > > > >  at this point:
> > > > > > >
> > > > > > >   - expose the handle to userspace for daemon/helper to trigger
> > the
> > > > > > >     mapping via existing hypercall interfaces. If using a helper
> > you
> > > > > > >     would have a hypervisor specific one to avoid the daemon
> > having to
> > > > > > >     care too much about the details or push that complexity into
> > a
> > > > > > >     compile time option for the daemon which would result in
> > different
> > > > > > >     binaries although a common source base.
> > > > > > >
> > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > differences away
> > > > > > >     in the guest kernel. In this case the userspace would
> > essentially
> > > > > > >     ask for an abstract "map guest N memory to userspace ptr"
> > and let
> > > > > > >     the kernel deal with the different hypercall interfaces.
> > This of
> > > > > > >     course assumes the majority of BE guests would be Linux
> > kernels and
> > > > > > >     leaves the bare-metal/unikernel approaches to their own
> > devices.
> > > > > > >
> > > > > > > Operation
> > > > > > > =========
> > > > > > >
> > > > > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > > > > vhost-user feature negotiation is done it's a case of receiving
> > update
> > > > > > > events and parsing the resultant virt queue for data. The vhost-
> > user
> > > > > > > specification handles a bunch of setup before that point, mostly
> > to
> > > > > > > detail where the virt queues are set up FD's for memory and
> > event
> > > > > > > communication. This is where the envisioned stub process would
> > be
> > > > > > > responsible for getting the daemon up and ready to run. This is
> > > > > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > > > > approach would be to use the rust-vmm vhost crate. It would then
> > either
> > > > > > > communicate with the kernel's abstracted ABI or be re-targeted
> > as a
> > > > > > > build option for the various hypervisors.
> > > > > >
> > > > > > One thing I mentioned before to Alex is that Xen doesn't have VMMs
> > the
> > > > > > way they are typically envisioned and described in other
> > environments.
> > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > independently to
> > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could be
> > used as
> > > > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > > > independently via the IOREQ interface.
> > > > > >
> > > > > > The component responsible for starting a daemon and/or setting up
> > shared
> > > > > > interfaces is the toolstack: the xl command and the libxl/libxc
> > > > > > libraries.
> > > > >
> > > > > I think that VM configuration management (or orchestration in
> > Startos
> > > > > jargon?) is a subject to debate in parallel.
> > > > > Otherwise, is there any good assumption to avoid it right now?
> > > > >
> > > > > > Oleksandr and others I CCed have been working on ways for the
> > toolstack
> > > > > > to create virtio backends and setup memory mappings. They might be
> > able
> > > > > > to provide more info on the subject. I do think we miss a way to
> > provide
> > > > > > the configuration to the backend and anything else that the
> > backend
> > > > > > might require to start doing its job.
> > > >
> > > > Yes, some work has been done for the toolstack to handle Virtio MMIO
> > devices in
> > > > general and Virtio block devices in particular. However, it has not
> > been upstreaned yet.
> > > > Updated patches on review now:
> > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-
> > olekstysh@gmail.com/
> > > >
> > > > There is an additional (also important) activity to improve/fix
> > foreign memory mapping on Arm which I am also involved in.
> > > > The foreign memory mapping is proposed to be used for Virtio backends
> > (device emulators) if there is a need to run guest OS completely
> > unmodified.
> > > > Of course, the more secure way would be to use grant memory mapping.
> > Brietly, the main difference between them is that with foreign mapping the
> > backend
> > > > can map any guest memory it wants to map, but with grant mapping it is
> > allowed to map only what was previously granted by the frontend.
> > > >
> > > > So, there might be a problem if we want to pre-map some guest memory
> > in advance or to cache mappings in the backend in order to improve
> > performance (because the mapping/unmapping guest pages every request
> > requires a lot of back and forth to Xen + P2M updates). In a nutshell,
> > currently, in order to map a guest page into the backend address space we
> > need to steal a real physical page from the backend domain. So, with the
> > said optimizations we might end up with no free memory in the backend
> > domain (see XSA-300). And what we try to achieve is to not waste a real
> > domain memory at all by providing safe non-allocated-yet (so unused)
> > address space for the foreign (and grant) pages to be mapped into, this
> > enabling work implies Xen and Linux (and likely DTB bindings) changes.
> > However, as it turned out, for this to work in a proper and safe way some
> > prereq work needs to be done.
> > > > You can find the related Xen discussion at:
> > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-
> > olekstysh@gmail.com/
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > One question is how to best handle notification and kicks. The
> > existing
> > > > > > > vhost-user framework uses eventfd to signal the daemon (although
> > QEMU
> > > > > > > is quite capable of simulating them when you use TCG). Xen has
> > it's own
> > > > > > > IOREQ mechanism. However latency is an important factor and
> > having
> > > > > > > events go through the stub would add quite a lot.
> > > > > >
> > > > > > Yeah I think, regardless of anything else, we want the backends to
> > > > > > connect directly to the Xen hypervisor.
> > > > >
> > > > > In my approach,
> > > > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor
> > interface
> > > > >               via virtio-proxy
> > > > >  b) FE -> BE: MMIO to config raises events (in event channels),
> > which is
> > > > >               converted to a callback to BE via virtio-proxy
> > > > >               (Xen's event channel is internnally implemented by
> > interrupts.)
> > > > >
> > > > > I don't know what "connect directly" means here, but sending
> > interrupts
> > > > > to the opposite side would be best efficient.
> > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
> > mechanism.
> > > >
> > > > Agree that MSI would be more efficient than SPI...
> > > > At the moment, in order to notify the frontend, the backend issues a
> > specific device-model call to query Xen to inject a corresponding SPI to
> > the guest.
> > > >
> > > >
> > > >
> > > > > >
> > > > > > > Could we consider the kernel internally converting IOREQ
> > messages from
> > > > > > > the Xen hypervisor to eventfd events? Would this scale with
> > other kernel
> > > > > > > hypercall interfaces?
> > > > > > >
> > > > > > > So any thoughts on what directions are worth experimenting with?
> > > > > >
> > > > > > One option we should consider is for each backend to connect to
> > Xen via
> > > > > > the IOREQ interface. We could generalize the IOREQ interface and
> > make it
> > > > > > hypervisor agnostic. The interface is really trivial and easy to
> > add.
> > > > >
> > > > > As I said above, my proposal does the same thing that you mentioned
> > here :)
> > > > > The difference is that I do call hypervisor interfaces via virtio-
> > proxy.
> > > > >
> > > > > > The only Xen-specific part is the notification mechanism, which is
> > an
> > > > > > event channel. If we replaced the event channel with something
> > else the
> > > > > > interface would be generic. See:
> > > > > > https://gitlab.com/xen-project/xen/-
> > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > >
> > > > > > I don't think that translating IOREQs to eventfd in the kernel is
> > a
> > > > > > good idea: if feels like it would be extra complexity and that the
> > > > > > kernel shouldn't be involved as this is a backend-hypervisor
> > interface.
> > > > >
> > > > > Given that we may want to implement BE as a bare-metal application
> > > > > as I did on Zephyr, I don't think that the translation would not be
> > > > > a big issue, especially on RTOS's.
> > > > > It will be some kind of abstraction layer of interrupt handling
> > > > > (or nothing but a callback mechanism).
> > > > >
> > > > > > Also, eventfd is very Linux-centric and we are trying to design an
> > > > > > interface that could work well for RTOSes too. If we want to do
> > > > > > something different, both OS-agnostic and hypervisor-agnostic,
> > perhaps
> > > > > > we could design a new interface. One that could be implementable
> > in the
> > > > > > Xen hypervisor itself (like IOREQ) and of course any other
> > hypervisor
> > > > > > too.
> > > > > >
> > > > > >
> > > > > > There is also another problem. IOREQ is probably not be the only
> > > > > > interface needed. Have a look at
> > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
> > also need
> > > > > > an interface for the backend to inject interrupts into the
> > frontend? And
> > > > > > if the backend requires dynamic memory mappings of frontend pages,
> > then
> > > > > > we would also need an interface to map/unmap domU pages.
> > > > >
> > > > > My proposal document might help here; All the interfaces required
> > for
> > > > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > > > RPC protocols :)
> > > > >
> > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is
> > tiny
> > > > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > > > inject interrupts or map pages is more difficult to manage because
> > it
> > > > > > would require changes scattered across the various emulators.
> > > > >
> > > > > Exactly. I have no confident yet that my approach will also apply
> > > > > to other hypervisors than Xen.
> > > > > Technically, yes, but whether people can accept it or not is a
> > different
> > > > > matter.
> > > > >
> > > > > Thanks,
> > > > > -Takahiro Akashi
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Oleksandr Tyshchenko
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-18  5:38             ` AKASHI Takahiro
@ 2021-08-18  8:35               ` Wei Chen
  2021-08-20  6:41                 ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: Wei Chen @ 2021-08-18  8:35 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Hi Akashi,

> -----Original Message-----
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Sent: 2021年8月18日 13:39
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Stratos
> Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>
> On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > Hi Akashi,
> >
> > > -----Original Message-----
> > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > Sent: 2021年8月17日 16:08
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> Stratos
> > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> dev@lists.oasis-
> > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > Hi Wei, Oleksandr,
> > >
> > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > Hi All,
> > > >
> > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > This proposal is still discussing in Xen and KVM communities.
> > > > The main work is to decouple the kvmtool from KVM and make
> > > > other hypervisors can reuse the virtual device implementations.
> > > >
> > > > In this case, we need to introduce an intermediate hypervisor
> > > > layer for VMM abstraction, Which is, I think it's very close
> > > > to stratos' virtio hypervisor agnosticism work.
> > >
> > > # My proposal[1] comes from my own idea and doesn't always represent
> > > # Linaro's view on this subject nor reflect Alex's concerns.
> Nevertheless,
> > >
> > > Your idea and my proposal seem to share the same background.
> > > Both have the similar goal and currently start with, at first, Xen
> > > and are based on kvm-tool. (Actually, my work is derived from
> > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > >
> > > In particular, the abstraction of hypervisor interfaces has a same
> > > set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
> > > This is not co-incident as we both share the same origin as I said
> above.
> > > And so we will also share the same issues. One of them is a way of
> > > "sharing/mapping FE's memory". There is some trade-off between
> > > the portability and the performance impact.
> > > So we can discuss the topic here in this ML, too.
> > > (See Alex's original email, too).
> > >
> > Yes, I agree.
> >
> > > On the other hand, my approach aims to create a "single-binary"
> solution
> > > in which the same binary of BE vm could run on any hypervisors.
> > > Somehow similar to your "proposal-#2" in [2], but in my solution, all
> > > the hypervisor-specific code would be put into another entity (VM),
> > > named "virtio-proxy" and the abstracted operations are served via RPC.
> > > (In this sense, BE is hypervisor-agnostic but might have OS
> dependency.)
> > > But I know that we need discuss if this is a requirement even
> > > in Stratos project or not. (Maybe not)
> > >
> >
> > Sorry, I haven't had time to finish reading your virtio-proxy completely
> > (I will do it ASAP). But from your description, it seems we need a
> > 3rd VM between FE and BE? My concern is that, if my assumption is right,
> > will it increase the latency in data transport path? Even if we're
> > using some lightweight guest like RTOS or Unikernel,
>
> Yes, you're right. But I'm afraid that it is a matter of degree.
> As far as we execute 'mapping' operations at every fetch of payload,
> we will see latency issue (even in your case) and if we have some solution
> for it, we won't see it neither in my proposal :)
>

Oleksandr has sent a proposal to Xen mailing list to reduce this kind
of "mapping/unmapping" operations. So the latency caused by this behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.

> > > Specifically speaking about kvm-tool, I have a concern about its
> > > license term; Targeting different hypervisors and different OSs
> > > (which I assume includes RTOS's), the resultant library should be
> > > license permissive and GPL for kvm-tool might be an issue.
> > > Any thoughts?
> > >
> >
> > Yes. If user want to implement a FreeBSD device model, but the virtio
> > library is GPL. Then GPL would be a problem. If we have another good
> > candidate, I am open to it.
>
> I have some candidates, particularly for vq/vring, in my mind:
> * Open-AMP, or
> * corresponding Free-BSD code
>

Interesting, I will look into them : )

Cheers,
Wei Chen

> -Takahiro Akashi
>
>
> > > -Takahiro Akashi
> > >
> > >
> > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > August/000548.html
> > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > >
> > > >
> > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > Sent: 2021年8月14日 23:38
> > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> Stabellini
> > > <sstabellini@kernel.org>
> > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List
> > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org;
> Arnd
> > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr
> > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > Hello, all.
> > > > >
> > > > > Please see some comments below. And sorry for the possible format
> > > issues.
> > > > >
> > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
> wrote:
> > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming
> the
> > > original
> > > > > > > email to let them read the full context.
> > > > > > >
> > > > > > > My comments below are related to a potential Xen
> implementation,
> > > not
> > > > > > > because it is the only implementation that matters, but
> because it
> > > is
> > > > > > > the one I know best.
> > > > > >
> > > > > > Please note that my proposal (and hence the working prototype)[1]
> > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > particularly
> > > > > > EPAM's virtio-disk application (backend server).
> > > > > > It has been, I believe, well generalized but is still a bit
> biased
> > > > > > toward this original design.
> > > > > >
> > > > > > So I hope you like my approach :)
> > > > > >
> > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > August/000546.html
> > > > > >
> > > > > > Let me take this opportunity to explain a bit more about my
> approach
> > > below.
> > > > > >
> > > > > > > Also, please see this relevant email thread:
> > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > One of the goals of Project Stratos is to enable hypervisor
> > > agnostic
> > > > > > > > backends so we can enable as much re-use of code as possible
> and
> > > avoid
> > > > > > > > repeating ourselves. This is the flip side of the front end
> > > where
> > > > > > > > multiple front-end implementations are required - one per OS,
> > > assuming
> > > > > > > > you don't just want Linux guests. The resultant guests are
> > > trivially
> > > > > > > > movable between hypervisors modulo any abstracted paravirt
> type
> > > > > > > > interfaces.
> > > > > > > >
> > > > > > > > In my original thumb nail sketch of a solution I envisioned
> > > vhost-user
> > > > > > > > daemons running in a broadly POSIX like environment. The
> > > interface to
> > > > > > > > the daemon is fairly simple requiring only some mapped
> memory
> > > and some
> > > > > > > > sort of signalling for events (on Linux this is eventfd).
> The
> > > idea was a
> > > > > > > > stub binary would be responsible for any hypervisor specific
> > > setup and
> > > > > > > > then launch a common binary to deal with the actual
> virtqueue
> > > requests
> > > > > > > > themselves.
> > > > > > > >
> > > > > > > > Since that original sketch we've seen an expansion in the
> sort
> > > of ways
> > > > > > > > backends could be created. There is interest in
> encapsulating
> > > backends
> > > > > > > > in RTOSes or unikernels for solutions like SCMI. There
> interest
> > > in Rust
> > > > > > > > has prompted ideas of using the trait interface to abstract
> > > differences
> > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > >
> > > > > > > > We have a card (STR-12) called "Hypercall Standardisation"
> which
> > > > > > > > calls for a description of the APIs needed from the
> hypervisor
> > > side to
> > > > > > > > support VirtIO guests and their backends. However we are
> some
> > > way off
> > > > > > > > from that at the moment as I think we need to at least
> > > demonstrate one
> > > > > > > > portable backend before we start codifying requirements. To
> that
> > > end I
> > > > > > > > want to think about what we need for a backend to function.
> > > > > > > >
> > > > > > > > Configuration
> > > > > > > > =============
> > > > > > > >
> > > > > > > > In the type-2 setup this is typically fairly simple because
> the
> > > host
> > > > > > > > system can orchestrate the various modules that make up the
> > > complete
> > > > > > > > system. In the type-1 case (or even type-2 with delegated
> > > service VMs)
> > > > > > > > we need some sort of mechanism to inform the backend VM
> about
> > > key
> > > > > > > > details about the system:
> > > > > > > >
> > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > >   - how it's going to receive (interrupt) and trigger (kick)
> > > events
> > > > > > > >   - what (if any) resources the backend needs to connect to
> > > > > > > >
> > > > > > > > Obviously you can elide over configuration issues by having
> > > static
> > > > > > > > configurations and baking the assumptions into your guest
> images
> > > however
> > > > > > > > this isn't scalable in the long term. The obvious solution
> seems
> > > to be
> > > > > > > > extending a subset of Device Tree data to user space but
> perhaps
> > > there
> > > > > > > > are other approaches?
> > > > > > > >
> > > > > > > > Before any virtio transactions can take place the
> appropriate
> > > memory
> > > > > > > > mappings need to be made between the FE guest and the BE
> guest.
> > > > > > >
> > > > > > > > Currently the whole of the FE guests address space needs to
> be
> > > visible
> > > > > > > > to whatever is serving the virtio requests. I can envision 3
> > > approaches:
> > > > > > > >
> > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > >
> > > > > > > >  This would entail the guest OS knowing where in it's Guest
> > > Physical
> > > > > > > >  Address space is already taken up and avoiding clashing. I
> > > would assume
> > > > > > > >  in this case you would want a standard interface to
> userspace
> > > to then
> > > > > > > >  make that address space visible to the backend daemon.
> > > > > >
> > > > > > Yet another way here is that we would have well known "shared
> > > memory" between
> > > > > > VMs. I think that Jailhouse's ivshmem gives us good insights on
> this
> > > matter
> > > > > > and that it can even be an alternative for hypervisor-agnostic
> > > solution.
> > > > > >
> > > > > > (Please note memory regions in ivshmem appear as a PCI device
> and
> > > can be
> > > > > > mapped locally.)
> > > > > >
> > > > > > I want to add this shared memory aspect to my virtio-proxy, but
> > > > > > the resultant solution would eventually look similar to ivshmem.
> > > > > >
> > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > >
> > > > > > > >  The BE guest is then free to map the FE's memory to where
> it
> > > wants in
> > > > > > > >  the BE's guest physical address space.
> > > > > > >
> > > > > > > I cannot see how this could work for Xen. There is no "handle"
> to
> > > give
> > > > > > > to the backend if the backend is not running in dom0. So for
> Xen I
> > > think
> > > > > > > the memory has to be already mapped
> > > > > >
> > > > > > In Xen's IOREQ solution (virtio-blk), the following information
> is
> > > expected
> > > > > > to be exposed to BE via Xenstore:
> > > > > > (I know that this is a tentative approach though.)
> > > > > >    - the start address of configuration space
> > > > > >    - interrupt number
> > > > > >    - file path for backing storage
> > > > > >    - read-only flag
> > > > > > And the BE server have to call a particular hypervisor interface
> to
> > > > > > map the configuration space.
> > > > >
> > > > > Yes, Xenstore was chosen as a simple way to pass configuration
> info to
> > > the backend running in a non-toolstack domain.
> > > > > I remember, there was a wish to avoid using Xenstore in Virtio
> backend
> > > itself if possible, so for non-toolstack domain, this could done with
> > > adjusting devd (daemon that listens for devices and launches backends)
> > > > > to read backend configuration from the Xenstore anyway and pass it
> to
> > > the backend via command line arguments.
> > > > >
> > > >
> > > > Yes, in current PoC code we're using xenstore to pass device
> > > configuration.
> > > > We also designed a static device configuration parse method for
> Dom0less
> > > or
> > > > other scenarios don't have xentool. yes, it's from device model
> command
> > > line
> > > > or a config file.
> > > >
> > > > > But, if ...
> > > > >
> > > > > >
> > > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
> > > specific
> > > > > > stuffs are contained in virtio-proxy, yet another VM, to hide
> all
> > > details.
> > > > >
> > > > > ... the solution how to overcome that is already found and proven
> to
> > > work then even better.
> > > > >
> > > > >
> > > > >
> > > > > > # My point is that a "handle" is not mandatory for executing
> mapping.
> > > > > >
> > > > > > > and the mapping probably done by the
> > > > > > > toolstack (also see below.) Or we would have to invent a new
> Xen
> > > > > > > hypervisor interface and Xen virtual machine privileges to
> allow
> > > this
> > > > > > > kind of mapping.
> > > > > >
> > > > > > > If we run the backend in Dom0 that we have no problems of
> course.
> > > > > >
> > > > > > One of difficulties on Xen that I found in my approach is that
> > > calling
> > > > > > such hypervisor intefaces (registering IOREQ, mapping memory) is
> > > only
> > > > > > allowed on BE servers themselvies and so we will have to extend
> > > those
> > > > > > interfaces.
> > > > > > This, however, will raise some concern on security and privilege
> > > distribution
> > > > > > as Stefan suggested.
> > > > >
> > > > > We also faced policy related issues with Virtio backend running in
> > > other than Dom0 domain in a "dummy" xsm mode. In our target system we
> run
> > > the backend in a driver
> > > > > domain (we call it DomD) where the underlying H/W resides. We
> trust it,
> > > so we wrote policy rules (to be used in "flask" xsm mode) to provide
> it
> > > with a little bit more privileges than a simple DomU had.
> > > > > Now it is permitted to issue device-model, resource and memory
> > > mappings, etc calls.
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > To activate the mapping will
> > > > > > > >  require some sort of hypercall to the hypervisor. I can see
> two
> > > options
> > > > > > > >  at this point:
> > > > > > > >
> > > > > > > >   - expose the handle to userspace for daemon/helper to
> trigger
> > > the
> > > > > > > >     mapping via existing hypercall interfaces. If using a
> helper
> > > you
> > > > > > > >     would have a hypervisor specific one to avoid the daemon
> > > having to
> > > > > > > >     care too much about the details or push that complexity
> into
> > > a
> > > > > > > >     compile time option for the daemon which would result in
> > > different
> > > > > > > >     binaries although a common source base.
> > > > > > > >
> > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > differences away
> > > > > > > >     in the guest kernel. In this case the userspace would
> > > essentially
> > > > > > > >     ask for an abstract "map guest N memory to userspace
> ptr"
> > > and let
> > > > > > > >     the kernel deal with the different hypercall interfaces.
> > > This of
> > > > > > > >     course assumes the majority of BE guests would be Linux
> > > kernels and
> > > > > > > >     leaves the bare-metal/unikernel approaches to their own
> > > devices.
> > > > > > > >
> > > > > > > > Operation
> > > > > > > > =========
> > > > > > > >
> > > > > > > > The core of the operation of VirtIO is fairly simple. Once
> the
> > > > > > > > vhost-user feature negotiation is done it's a case of
> receiving
> > > update
> > > > > > > > events and parsing the resultant virt queue for data. The
> vhost-
> > > user
> > > > > > > > specification handles a bunch of setup before that point,
> mostly
> > > to
> > > > > > > > detail where the virt queues are set up FD's for memory and
> > > event
> > > > > > > > communication. This is where the envisioned stub process
> would
> > > be
> > > > > > > > responsible for getting the daemon up and ready to run. This
> is
> > > > > > > > currently done inside a big VMM like QEMU but I suspect a
> modern
> > > > > > > > approach would be to use the rust-vmm vhost crate. It would
> then
> > > either
> > > > > > > > communicate with the kernel's abstracted ABI or be re-
> targeted
> > > as a
> > > > > > > > build option for the various hypervisors.
> > > > > > >
> > > > > > > One thing I mentioned before to Alex is that Xen doesn't have
> VMMs
> > > the
> > > > > > > way they are typically envisioned and described in other
> > > environments.
> > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > independently to
> > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could
> be
> > > used as
> > > > > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > > > > independently via the IOREQ interface.
> > > > > > >
> > > > > > > The component responsible for starting a daemon and/or setting
> up
> > > shared
> > > > > > > interfaces is the toolstack: the xl command and the
> libxl/libxc
> > > > > > > libraries.
> > > > > >
> > > > > > I think that VM configuration management (or orchestration in
> > > Startos
> > > > > > jargon?) is a subject to debate in parallel.
> > > > > > Otherwise, is there any good assumption to avoid it right now?
> > > > > >
> > > > > > > Oleksandr and others I CCed have been working on ways for the
> > > toolstack
> > > > > > > to create virtio backends and setup memory mappings. They
> might be
> > > able
> > > > > > > to provide more info on the subject. I do think we miss a way
> to
> > > provide
> > > > > > > the configuration to the backend and anything else that the
> > > backend
> > > > > > > might require to start doing its job.
> > > > >
> > > > > Yes, some work has been done for the toolstack to handle Virtio
> MMIO
> > > devices in
> > > > > general and Virtio block devices in particular. However, it has
> not
> > > been upstreaned yet.
> > > > > Updated patches on review now:
> > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
> email-
> > > olekstysh@gmail.com/
> > > > >
> > > > > There is an additional (also important) activity to improve/fix
> > > foreign memory mapping on Arm which I am also involved in.
> > > > > The foreign memory mapping is proposed to be used for Virtio
> backends
> > > (device emulators) if there is a need to run guest OS completely
> > > unmodified.
> > > > > Of course, the more secure way would be to use grant memory
> mapping.
> > > Brietly, the main difference between them is that with foreign mapping
> the
> > > backend
> > > > > can map any guest memory it wants to map, but with grant mapping
> it is
> > > allowed to map only what was previously granted by the frontend.
> > > > >
> > > > > So, there might be a problem if we want to pre-map some guest
> memory
> > > in advance or to cache mappings in the backend in order to improve
> > > performance (because the mapping/unmapping guest pages every request
> > > requires a lot of back and forth to Xen + P2M updates). In a nutshell,
> > > currently, in order to map a guest page into the backend address space
> we
> > > need to steal a real physical page from the backend domain. So, with
> the
> > > said optimizations we might end up with no free memory in the backend
> > > domain (see XSA-300). And what we try to achieve is to not waste a
> real
> > > domain memory at all by providing safe non-allocated-yet (so unused)
> > > address space for the foreign (and grant) pages to be mapped into,
> this
> > > enabling work implies Xen and Linux (and likely DTB bindings) changes.
> > > However, as it turned out, for this to work in a proper and safe way
> some
> > > prereq work needs to be done.
> > > > > You can find the related Xen discussion at:
> > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
> email-
> > > olekstysh@gmail.com/
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > One question is how to best handle notification and kicks.
> The
> > > existing
> > > > > > > > vhost-user framework uses eventfd to signal the daemon
> (although
> > > QEMU
> > > > > > > > is quite capable of simulating them when you use TCG). Xen
> has
> > > it's own
> > > > > > > > IOREQ mechanism. However latency is an important factor and
> > > having
> > > > > > > > events go through the stub would add quite a lot.
> > > > > > >
> > > > > > > Yeah I think, regardless of anything else, we want the
> backends to
> > > > > > > connect directly to the Xen hypervisor.
> > > > > >
> > > > > > In my approach,
> > > > > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor
> > > interface
> > > > > >               via virtio-proxy
> > > > > >  b) FE -> BE: MMIO to config raises events (in event channels),
> > > which is
> > > > > >               converted to a callback to BE via virtio-proxy
> > > > > >               (Xen's event channel is internnally implemented by
> > > interrupts.)
> > > > > >
> > > > > > I don't know what "connect directly" means here, but sending
> > > interrupts
> > > > > > to the opposite side would be best efficient.
> > > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
> > > mechanism.
> > > > >
> > > > > Agree that MSI would be more efficient than SPI...
> > > > > At the moment, in order to notify the frontend, the backend issues
> a
> > > specific device-model call to query Xen to inject a corresponding SPI
> to
> > > the guest.
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > Could we consider the kernel internally converting IOREQ
> > > messages from
> > > > > > > > the Xen hypervisor to eventfd events? Would this scale with
> > > other kernel
> > > > > > > > hypercall interfaces?
> > > > > > > >
> > > > > > > > So any thoughts on what directions are worth experimenting
> with?
> > > > > > >
> > > > > > > One option we should consider is for each backend to connect
> to
> > > Xen via
> > > > > > > the IOREQ interface. We could generalize the IOREQ interface
> and
> > > make it
> > > > > > > hypervisor agnostic. The interface is really trivial and easy
> to
> > > add.
> > > > > >
> > > > > > As I said above, my proposal does the same thing that you
> mentioned
> > > here :)
> > > > > > The difference is that I do call hypervisor interfaces via
> virtio-
> > > proxy.
> > > > > >
> > > > > > > The only Xen-specific part is the notification mechanism,
> which is
> > > an
> > > > > > > event channel. If we replaced the event channel with something
> > > else the
> > > > > > > interface would be generic. See:
> > > > > > > https://gitlab.com/xen-project/xen/-
> > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > >
> > > > > > > I don't think that translating IOREQs to eventfd in the kernel
> is
> > > a
> > > > > > > good idea: if feels like it would be extra complexity and that
> the
> > > > > > > kernel shouldn't be involved as this is a backend-hypervisor
> > > interface.
> > > > > >
> > > > > > Given that we may want to implement BE as a bare-metal
> application
> > > > > > as I did on Zephyr, I don't think that the translation would not
> be
> > > > > > a big issue, especially on RTOS's.
> > > > > > It will be some kind of abstraction layer of interrupt handling
> > > > > > (or nothing but a callback mechanism).
> > > > > >
> > > > > > > Also, eventfd is very Linux-centric and we are trying to
> design an
> > > > > > > interface that could work well for RTOSes too. If we want to
> do
> > > > > > > something different, both OS-agnostic and hypervisor-agnostic,
> > > perhaps
> > > > > > > we could design a new interface. One that could be
> implementable
> > > in the
> > > > > > > Xen hypervisor itself (like IOREQ) and of course any other
> > > hypervisor
> > > > > > > too.
> > > > > > >
> > > > > > >
> > > > > > > There is also another problem. IOREQ is probably not be the
> only
> > > > > > > interface needed. Have a look at
> > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
> > > also need
> > > > > > > an interface for the backend to inject interrupts into the
> > > frontend? And
> > > > > > > if the backend requires dynamic memory mappings of frontend
> pages,
> > > then
> > > > > > > we would also need an interface to map/unmap domU pages.
> > > > > >
> > > > > > My proposal document might help here; All the interfaces
> required
> > > for
> > > > > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > > > > RPC protocols :)
> > > > > >
> > > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ
> is
> > > tiny
> > > > > > > and self-contained. It is easy to add anywhere. A new
> interface to
> > > > > > > inject interrupts or map pages is more difficult to manage
> because
> > > it
> > > > > > > would require changes scattered across the various emulators.
> > > > > >
> > > > > > Exactly. I have no confident yet that my approach will also
> apply
> > > > > > to other hypervisors than Xen.
> > > > > > Technically, yes, but whether people can accept it or not is a
> > > different
> > > > > > matter.
> > > > > >
> > > > > > Thanks,
> > > > > > -Takahiro Akashi
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Oleksandr Tyshchenko
> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose
> the
> > > contents to any other person, use it for any purpose, or store or copy
> the
> > > information in any medium. Thank you.
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-08-04  9:04 [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends Alex Bennée
  2021-08-04 19:20 ` Stefano Stabellini
  2021-08-05 15:48 ` [virtio-dev] " Stefan Hajnoczi
@ 2021-08-19  9:11 ` Matias Ezequiel Vara Larsen
       [not found]   ` <20210820060558.GB13452@laputa>
  2021-09-01  8:43   ` Alex Bennée
  2 siblings, 2 replies; 66+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2021-08-19  9:11 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	AKASHI Takahiro, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier

Hello Alex,

I can tell you my experience from working on a PoC (library) 
to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
I focused on two use cases:
1. type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits.
2. Linux user-space. In this case, the library is just used to
communicate threads. The goal of this use case is merely testing.

I have chosen virtio-mmio as the way to exchange information
between the frontend and backend. I found it hard to synchronize the
access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
the front-end and back-end to synchronize, which is required
during the device-status initialization. These extra bits would not be 
needed in case the hypervisor supports VMExits, e.g., KVM.

Each guest has a memory region that is shared with the backend. 
This memory region is used by the frontend to allocate the io-buffers. This region also 
maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
At some point, the guest shall be able to balloon this region. Notifications between 
the frontend and the backend are implemented by using an hypercall. The hypercall 
mechanism and the memory allocation are abstracted away by a platform layer that 
exposes an interface that is hypervisor/os agnostic.

I split the backend into a virtio-device driver and a
backend driver. The virtio-device driver is the virtqueues and the
backend driver gets packets from the virtqueue for
post-processing. For example, in the case of virtio-net, the backend
driver would decide if the packet goes to the hardware or to another
virtio-net device. The virtio-device drivers may be
implemented in different ways like by using a single thread, multiple threads, 
or one thread for all the virtio-devices.

In this PoC, I just tackled two very simple use-cases. These
use-cases allowed me to extract some requirements for an hypervisor to
support virtio.

Matias

On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> Hi,
> 
> One of the goals of Project Stratos is to enable hypervisor agnostic
> backends so we can enable as much re-use of code as possible and avoid
> repeating ourselves. This is the flip side of the front end where
> multiple front-end implementations are required - one per OS, assuming
> you don't just want Linux guests. The resultant guests are trivially
> movable between hypervisors modulo any abstracted paravirt type
> interfaces.
> 
> In my original thumb nail sketch of a solution I envisioned vhost-user
> daemons running in a broadly POSIX like environment. The interface to
> the daemon is fairly simple requiring only some mapped memory and some
> sort of signalling for events (on Linux this is eventfd). The idea was a
> stub binary would be responsible for any hypervisor specific setup and
> then launch a common binary to deal with the actual virtqueue requests
> themselves.
> 
> Since that original sketch we've seen an expansion in the sort of ways
> backends could be created. There is interest in encapsulating backends
> in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> has prompted ideas of using the trait interface to abstract differences
> away as well as the idea of bare-metal Rust backends.
> 
> We have a card (STR-12) called "Hypercall Standardisation" which
> calls for a description of the APIs needed from the hypervisor side to
> support VirtIO guests and their backends. However we are some way off
> from that at the moment as I think we need to at least demonstrate one
> portable backend before we start codifying requirements. To that end I
> want to think about what we need for a backend to function.
> 
> Configuration
> =============
> 
> In the type-2 setup this is typically fairly simple because the host
> system can orchestrate the various modules that make up the complete
> system. In the type-1 case (or even type-2 with delegated service VMs)
> we need some sort of mechanism to inform the backend VM about key
> details about the system:
> 
>   - where virt queue memory is in it's address space
>   - how it's going to receive (interrupt) and trigger (kick) events
>   - what (if any) resources the backend needs to connect to
> 
> Obviously you can elide over configuration issues by having static
> configurations and baking the assumptions into your guest images however
> this isn't scalable in the long term. The obvious solution seems to be
> extending a subset of Device Tree data to user space but perhaps there
> are other approaches?
> 
> Before any virtio transactions can take place the appropriate memory
> mappings need to be made between the FE guest and the BE guest.
> Currently the whole of the FE guests address space needs to be visible
> to whatever is serving the virtio requests. I can envision 3 approaches:
> 
>  * BE guest boots with memory already mapped
> 
>  This would entail the guest OS knowing where in it's Guest Physical
>  Address space is already taken up and avoiding clashing. I would assume
>  in this case you would want a standard interface to userspace to then
>  make that address space visible to the backend daemon.
> 
>  * BE guests boots with a hypervisor handle to memory
> 
>  The BE guest is then free to map the FE's memory to where it wants in
>  the BE's guest physical address space. To activate the mapping will
>  require some sort of hypercall to the hypervisor. I can see two options
>  at this point:
> 
>   - expose the handle to userspace for daemon/helper to trigger the
>     mapping via existing hypercall interfaces. If using a helper you
>     would have a hypervisor specific one to avoid the daemon having to
>     care too much about the details or push that complexity into a
>     compile time option for the daemon which would result in different
>     binaries although a common source base.
> 
>   - expose a new kernel ABI to abstract the hypercall differences away
>     in the guest kernel. In this case the userspace would essentially
>     ask for an abstract "map guest N memory to userspace ptr" and let
>     the kernel deal with the different hypercall interfaces. This of
>     course assumes the majority of BE guests would be Linux kernels and
>     leaves the bare-metal/unikernel approaches to their own devices.
> 
> Operation
> =========
> 
> The core of the operation of VirtIO is fairly simple. Once the
> vhost-user feature negotiation is done it's a case of receiving update
> events and parsing the resultant virt queue for data. The vhost-user
> specification handles a bunch of setup before that point, mostly to
> detail where the virt queues are set up FD's for memory and event
> communication. This is where the envisioned stub process would be
> responsible for getting the daemon up and ready to run. This is
> currently done inside a big VMM like QEMU but I suspect a modern
> approach would be to use the rust-vmm vhost crate. It would then either
> communicate with the kernel's abstracted ABI or be re-targeted as a
> build option for the various hypervisors.
> 
> One question is how to best handle notification and kicks. The existing
> vhost-user framework uses eventfd to signal the daemon (although QEMU
> is quite capable of simulating them when you use TCG). Xen has it's own
> IOREQ mechanism. However latency is an important factor and having
> events go through the stub would add quite a lot.
> 
> Could we consider the kernel internally converting IOREQ messages from
> the Xen hypervisor to eventfd events? Would this scale with other kernel
> hypercall interfaces?
> 
> So any thoughts on what directions are worth experimenting with?
> 
> -- 
> Alex Bennée
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-18  8:35               ` Wei Chen
@ 2021-08-20  6:41                 ` AKASHI Takahiro
  2021-08-26  9:40                   ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-20  6:41 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> Hi Akashi,
> 
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年8月18日 13:39
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > Hi Akashi,
> > >
> > > > -----Original Message-----
> > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > Sent: 2021年8月17日 16:08
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > Stratos
> > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > dev@lists.oasis-
> > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > Julien
> > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > Hi Wei, Oleksandr,
> > > >
> > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > Hi All,
> > > > >
> > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > other hypervisors can reuse the virtual device implementations.
> > > > >
> > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > to stratos' virtio hypervisor agnosticism work.
> > > >
> > > > # My proposal[1] comes from my own idea and doesn't always represent
> > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > Nevertheless,
> > > >
> > > > Your idea and my proposal seem to share the same background.
> > > > Both have the similar goal and currently start with, at first, Xen
> > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > >
> > > > In particular, the abstraction of hypervisor interfaces has a same
> > > > set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
> > > > This is not co-incident as we both share the same origin as I said
> > above.
> > > > And so we will also share the same issues. One of them is a way of
> > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > the portability and the performance impact.
> > > > So we can discuss the topic here in this ML, too.
> > > > (See Alex's original email, too).
> > > >
> > > Yes, I agree.
> > >
> > > > On the other hand, my approach aims to create a "single-binary"
> > solution
> > > > in which the same binary of BE vm could run on any hypervisors.
> > > > Somehow similar to your "proposal-#2" in [2], but in my solution, all
> > > > the hypervisor-specific code would be put into another entity (VM),
> > > > named "virtio-proxy" and the abstracted operations are served via RPC.
> > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > dependency.)
> > > > But I know that we need discuss if this is a requirement even
> > > > in Stratos project or not. (Maybe not)
> > > >
> > >
> > > Sorry, I haven't had time to finish reading your virtio-proxy completely
> > > (I will do it ASAP). But from your description, it seems we need a
> > > 3rd VM between FE and BE? My concern is that, if my assumption is right,
> > > will it increase the latency in data transport path? Even if we're
> > > using some lightweight guest like RTOS or Unikernel,
> >
> > Yes, you're right. But I'm afraid that it is a matter of degree.
> > As far as we execute 'mapping' operations at every fetch of payload,
> > we will see latency issue (even in your case) and if we have some solution
> > for it, we won't see it neither in my proposal :)
> >
> 
> Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> of "mapping/unmapping" operations. So the latency caused by this behavior
> on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.

Obviously, I have not yet caught up there in the discussion.
Which patch specifically?

-Takahiro Akashi

> > > > Specifically speaking about kvm-tool, I have a concern about its
> > > > license term; Targeting different hypervisors and different OSs
> > > > (which I assume includes RTOS's), the resultant library should be
> > > > license permissive and GPL for kvm-tool might be an issue.
> > > > Any thoughts?
> > > >
> > >
> > > Yes. If user want to implement a FreeBSD device model, but the virtio
> > > library is GPL. Then GPL would be a problem. If we have another good
> > > candidate, I am open to it.
> >
> > I have some candidates, particularly for vq/vring, in my mind:
> > * Open-AMP, or
> > * corresponding Free-BSD code
> >
> 
> Interesting, I will look into them : )
> 
> Cheers,
> Wei Chen
> 
> > -Takahiro Akashi
> >
> >
> > > > -Takahiro Akashi
> > > >
> > > >
> > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > August/000548.html
> > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > >
> > > > >
> > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > Sent: 2021年8月14日 23:38
> > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> > Stabellini
> > > > <sstabellini@kernel.org>
> > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List
> > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org;
> > Arnd
> > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr
> > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > Julien
> > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > >
> > > > > > Hello, all.
> > > > > >
> > > > > > Please see some comments below. And sorry for the possible format
> > > > issues.
> > > > > >
> > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
> > wrote:
> > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming
> > the
> > > > original
> > > > > > > > email to let them read the full context.
> > > > > > > >
> > > > > > > > My comments below are related to a potential Xen
> > implementation,
> > > > not
> > > > > > > > because it is the only implementation that matters, but
> > because it
> > > > is
> > > > > > > > the one I know best.
> > > > > > >
> > > > > > > Please note that my proposal (and hence the working prototype)[1]
> > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > > particularly
> > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > It has been, I believe, well generalized but is still a bit
> > biased
> > > > > > > toward this original design.
> > > > > > >
> > > > > > > So I hope you like my approach :)
> > > > > > >
> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > August/000546.html
> > > > > > >
> > > > > > > Let me take this opportunity to explain a bit more about my
> > approach
> > > > below.
> > > > > > >
> > > > > > > > Also, please see this relevant email thread:
> > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > One of the goals of Project Stratos is to enable hypervisor
> > > > agnostic
> > > > > > > > > backends so we can enable as much re-use of code as possible
> > and
> > > > avoid
> > > > > > > > > repeating ourselves. This is the flip side of the front end
> > > > where
> > > > > > > > > multiple front-end implementations are required - one per OS,
> > > > assuming
> > > > > > > > > you don't just want Linux guests. The resultant guests are
> > > > trivially
> > > > > > > > > movable between hypervisors modulo any abstracted paravirt
> > type
> > > > > > > > > interfaces.
> > > > > > > > >
> > > > > > > > > In my original thumb nail sketch of a solution I envisioned
> > > > vhost-user
> > > > > > > > > daemons running in a broadly POSIX like environment. The
> > > > interface to
> > > > > > > > > the daemon is fairly simple requiring only some mapped
> > memory
> > > > and some
> > > > > > > > > sort of signalling for events (on Linux this is eventfd).
> > The
> > > > idea was a
> > > > > > > > > stub binary would be responsible for any hypervisor specific
> > > > setup and
> > > > > > > > > then launch a common binary to deal with the actual
> > virtqueue
> > > > requests
> > > > > > > > > themselves.
> > > > > > > > >
> > > > > > > > > Since that original sketch we've seen an expansion in the
> > sort
> > > > of ways
> > > > > > > > > backends could be created. There is interest in
> > encapsulating
> > > > backends
> > > > > > > > > in RTOSes or unikernels for solutions like SCMI. There
> > interest
> > > > in Rust
> > > > > > > > > has prompted ideas of using the trait interface to abstract
> > > > differences
> > > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > > >
> > > > > > > > > We have a card (STR-12) called "Hypercall Standardisation"
> > which
> > > > > > > > > calls for a description of the APIs needed from the
> > hypervisor
> > > > side to
> > > > > > > > > support VirtIO guests and their backends. However we are
> > some
> > > > way off
> > > > > > > > > from that at the moment as I think we need to at least
> > > > demonstrate one
> > > > > > > > > portable backend before we start codifying requirements. To
> > that
> > > > end I
> > > > > > > > > want to think about what we need for a backend to function.
> > > > > > > > >
> > > > > > > > > Configuration
> > > > > > > > > =============
> > > > > > > > >
> > > > > > > > > In the type-2 setup this is typically fairly simple because
> > the
> > > > host
> > > > > > > > > system can orchestrate the various modules that make up the
> > > > complete
> > > > > > > > > system. In the type-1 case (or even type-2 with delegated
> > > > service VMs)
> > > > > > > > > we need some sort of mechanism to inform the backend VM
> > about
> > > > key
> > > > > > > > > details about the system:
> > > > > > > > >
> > > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > > >   - how it's going to receive (interrupt) and trigger (kick)
> > > > events
> > > > > > > > >   - what (if any) resources the backend needs to connect to
> > > > > > > > >
> > > > > > > > > Obviously you can elide over configuration issues by having
> > > > static
> > > > > > > > > configurations and baking the assumptions into your guest
> > images
> > > > however
> > > > > > > > > this isn't scalable in the long term. The obvious solution
> > seems
> > > > to be
> > > > > > > > > extending a subset of Device Tree data to user space but
> > perhaps
> > > > there
> > > > > > > > > are other approaches?
> > > > > > > > >
> > > > > > > > > Before any virtio transactions can take place the
> > appropriate
> > > > memory
> > > > > > > > > mappings need to be made between the FE guest and the BE
> > guest.
> > > > > > > >
> > > > > > > > > Currently the whole of the FE guests address space needs to
> > be
> > > > visible
> > > > > > > > > to whatever is serving the virtio requests. I can envision 3
> > > > approaches:
> > > > > > > > >
> > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > >
> > > > > > > > >  This would entail the guest OS knowing where in it's Guest
> > > > Physical
> > > > > > > > >  Address space is already taken up and avoiding clashing. I
> > > > would assume
> > > > > > > > >  in this case you would want a standard interface to
> > userspace
> > > > to then
> > > > > > > > >  make that address space visible to the backend daemon.
> > > > > > >
> > > > > > > Yet another way here is that we would have well known "shared
> > > > memory" between
> > > > > > > VMs. I think that Jailhouse's ivshmem gives us good insights on
> > this
> > > > matter
> > > > > > > and that it can even be an alternative for hypervisor-agnostic
> > > > solution.
> > > > > > >
> > > > > > > (Please note memory regions in ivshmem appear as a PCI device
> > and
> > > > can be
> > > > > > > mapped locally.)
> > > > > > >
> > > > > > > I want to add this shared memory aspect to my virtio-proxy, but
> > > > > > > the resultant solution would eventually look similar to ivshmem.
> > > > > > >
> > > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > > >
> > > > > > > > >  The BE guest is then free to map the FE's memory to where
> > it
> > > > wants in
> > > > > > > > >  the BE's guest physical address space.
> > > > > > > >
> > > > > > > > I cannot see how this could work for Xen. There is no "handle"
> > to
> > > > give
> > > > > > > > to the backend if the backend is not running in dom0. So for
> > Xen I
> > > > think
> > > > > > > > the memory has to be already mapped
> > > > > > >
> > > > > > > In Xen's IOREQ solution (virtio-blk), the following information
> > is
> > > > expected
> > > > > > > to be exposed to BE via Xenstore:
> > > > > > > (I know that this is a tentative approach though.)
> > > > > > >    - the start address of configuration space
> > > > > > >    - interrupt number
> > > > > > >    - file path for backing storage
> > > > > > >    - read-only flag
> > > > > > > And the BE server have to call a particular hypervisor interface
> > to
> > > > > > > map the configuration space.
> > > > > >
> > > > > > Yes, Xenstore was chosen as a simple way to pass configuration
> > info to
> > > > the backend running in a non-toolstack domain.
> > > > > > I remember, there was a wish to avoid using Xenstore in Virtio
> > backend
> > > > itself if possible, so for non-toolstack domain, this could done with
> > > > adjusting devd (daemon that listens for devices and launches backends)
> > > > > > to read backend configuration from the Xenstore anyway and pass it
> > to
> > > > the backend via command line arguments.
> > > > > >
> > > > >
> > > > > Yes, in current PoC code we're using xenstore to pass device
> > > > configuration.
> > > > > We also designed a static device configuration parse method for
> > Dom0less
> > > > or
> > > > > other scenarios don't have xentool. yes, it's from device model
> > command
> > > > line
> > > > > or a config file.
> > > > >
> > > > > > But, if ...
> > > > > >
> > > > > > >
> > > > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
> > > > specific
> > > > > > > stuffs are contained in virtio-proxy, yet another VM, to hide
> > all
> > > > details.
> > > > > >
> > > > > > ... the solution how to overcome that is already found and proven
> > to
> > > > work then even better.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > # My point is that a "handle" is not mandatory for executing
> > mapping.
> > > > > > >
> > > > > > > > and the mapping probably done by the
> > > > > > > > toolstack (also see below.) Or we would have to invent a new
> > Xen
> > > > > > > > hypervisor interface and Xen virtual machine privileges to
> > allow
> > > > this
> > > > > > > > kind of mapping.
> > > > > > >
> > > > > > > > If we run the backend in Dom0 that we have no problems of
> > course.
> > > > > > >
> > > > > > > One of difficulties on Xen that I found in my approach is that
> > > > calling
> > > > > > > such hypervisor intefaces (registering IOREQ, mapping memory) is
> > > > only
> > > > > > > allowed on BE servers themselvies and so we will have to extend
> > > > those
> > > > > > > interfaces.
> > > > > > > This, however, will raise some concern on security and privilege
> > > > distribution
> > > > > > > as Stefan suggested.
> > > > > >
> > > > > > We also faced policy related issues with Virtio backend running in
> > > > other than Dom0 domain in a "dummy" xsm mode. In our target system we
> > run
> > > > the backend in a driver
> > > > > > domain (we call it DomD) where the underlying H/W resides. We
> > trust it,
> > > > so we wrote policy rules (to be used in "flask" xsm mode) to provide
> > it
> > > > with a little bit more privileges than a simple DomU had.
> > > > > > Now it is permitted to issue device-model, resource and memory
> > > > mappings, etc calls.
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > To activate the mapping will
> > > > > > > > >  require some sort of hypercall to the hypervisor. I can see
> > two
> > > > options
> > > > > > > > >  at this point:
> > > > > > > > >
> > > > > > > > >   - expose the handle to userspace for daemon/helper to
> > trigger
> > > > the
> > > > > > > > >     mapping via existing hypercall interfaces. If using a
> > helper
> > > > you
> > > > > > > > >     would have a hypervisor specific one to avoid the daemon
> > > > having to
> > > > > > > > >     care too much about the details or push that complexity
> > into
> > > > a
> > > > > > > > >     compile time option for the daemon which would result in
> > > > different
> > > > > > > > >     binaries although a common source base.
> > > > > > > > >
> > > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > > differences away
> > > > > > > > >     in the guest kernel. In this case the userspace would
> > > > essentially
> > > > > > > > >     ask for an abstract "map guest N memory to userspace
> > ptr"
> > > > and let
> > > > > > > > >     the kernel deal with the different hypercall interfaces.
> > > > This of
> > > > > > > > >     course assumes the majority of BE guests would be Linux
> > > > kernels and
> > > > > > > > >     leaves the bare-metal/unikernel approaches to their own
> > > > devices.
> > > > > > > > >
> > > > > > > > > Operation
> > > > > > > > > =========
> > > > > > > > >
> > > > > > > > > The core of the operation of VirtIO is fairly simple. Once
> > the
> > > > > > > > > vhost-user feature negotiation is done it's a case of
> > receiving
> > > > update
> > > > > > > > > events and parsing the resultant virt queue for data. The
> > vhost-
> > > > user
> > > > > > > > > specification handles a bunch of setup before that point,
> > mostly
> > > > to
> > > > > > > > > detail where the virt queues are set up FD's for memory and
> > > > event
> > > > > > > > > communication. This is where the envisioned stub process
> > would
> > > > be
> > > > > > > > > responsible for getting the daemon up and ready to run. This
> > is
> > > > > > > > > currently done inside a big VMM like QEMU but I suspect a
> > modern
> > > > > > > > > approach would be to use the rust-vmm vhost crate. It would
> > then
> > > > either
> > > > > > > > > communicate with the kernel's abstracted ABI or be re-
> > targeted
> > > > as a
> > > > > > > > > build option for the various hypervisors.
> > > > > > > >
> > > > > > > > One thing I mentioned before to Alex is that Xen doesn't have
> > VMMs
> > > > the
> > > > > > > > way they are typically envisioned and described in other
> > > > environments.
> > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > > independently to
> > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could
> > be
> > > > used as
> > > > > > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > > > > > independently via the IOREQ interface.
> > > > > > > >
> > > > > > > > The component responsible for starting a daemon and/or setting
> > up
> > > > shared
> > > > > > > > interfaces is the toolstack: the xl command and the
> > libxl/libxc
> > > > > > > > libraries.
> > > > > > >
> > > > > > > I think that VM configuration management (or orchestration in
> > > > Startos
> > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > Otherwise, is there any good assumption to avoid it right now?
> > > > > > >
> > > > > > > > Oleksandr and others I CCed have been working on ways for the
> > > > toolstack
> > > > > > > > to create virtio backends and setup memory mappings. They
> > might be
> > > > able
> > > > > > > > to provide more info on the subject. I do think we miss a way
> > to
> > > > provide
> > > > > > > > the configuration to the backend and anything else that the
> > > > backend
> > > > > > > > might require to start doing its job.
> > > > > >
> > > > > > Yes, some work has been done for the toolstack to handle Virtio
> > MMIO
> > > > devices in
> > > > > > general and Virtio block devices in particular. However, it has
> > not
> > > > been upstreaned yet.
> > > > > > Updated patches on review now:
> > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
> > email-
> > > > olekstysh@gmail.com/
> > > > > >
> > > > > > There is an additional (also important) activity to improve/fix
> > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > The foreign memory mapping is proposed to be used for Virtio
> > backends
> > > > (device emulators) if there is a need to run guest OS completely
> > > > unmodified.
> > > > > > Of course, the more secure way would be to use grant memory
> > mapping.
> > > > Brietly, the main difference between them is that with foreign mapping
> > the
> > > > backend
> > > > > > can map any guest memory it wants to map, but with grant mapping
> > it is
> > > > allowed to map only what was previously granted by the frontend.
> > > > > >
> > > > > > So, there might be a problem if we want to pre-map some guest
> > memory
> > > > in advance or to cache mappings in the backend in order to improve
> > > > performance (because the mapping/unmapping guest pages every request
> > > > requires a lot of back and forth to Xen + P2M updates). In a nutshell,
> > > > currently, in order to map a guest page into the backend address space
> > we
> > > > need to steal a real physical page from the backend domain. So, with
> > the
> > > > said optimizations we might end up with no free memory in the backend
> > > > domain (see XSA-300). And what we try to achieve is to not waste a
> > real
> > > > domain memory at all by providing safe non-allocated-yet (so unused)
> > > > address space for the foreign (and grant) pages to be mapped into,
> > this
> > > > enabling work implies Xen and Linux (and likely DTB bindings) changes.
> > > > However, as it turned out, for this to work in a proper and safe way
> > some
> > > > prereq work needs to be done.
> > > > > > You can find the related Xen discussion at:
> > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
> > email-
> > > > olekstysh@gmail.com/
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > One question is how to best handle notification and kicks.
> > The
> > > > existing
> > > > > > > > > vhost-user framework uses eventfd to signal the daemon
> > (although
> > > > QEMU
> > > > > > > > > is quite capable of simulating them when you use TCG). Xen
> > has
> > > > it's own
> > > > > > > > > IOREQ mechanism. However latency is an important factor and
> > > > having
> > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > >
> > > > > > > > Yeah I think, regardless of anything else, we want the
> > backends to
> > > > > > > > connect directly to the Xen hypervisor.
> > > > > > >
> > > > > > > In my approach,
> > > > > > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor
> > > > interface
> > > > > > >               via virtio-proxy
> > > > > > >  b) FE -> BE: MMIO to config raises events (in event channels),
> > > > which is
> > > > > > >               converted to a callback to BE via virtio-proxy
> > > > > > >               (Xen's event channel is internnally implemented by
> > > > interrupts.)
> > > > > > >
> > > > > > > I don't know what "connect directly" means here, but sending
> > > > interrupts
> > > > > > > to the opposite side would be best efficient.
> > > > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
> > > > mechanism.
> > > > > >
> > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > At the moment, in order to notify the frontend, the backend issues
> > a
> > > > specific device-model call to query Xen to inject a corresponding SPI
> > to
> > > > the guest.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > Could we consider the kernel internally converting IOREQ
> > > > messages from
> > > > > > > > > the Xen hypervisor to eventfd events? Would this scale with
> > > > other kernel
> > > > > > > > > hypercall interfaces?
> > > > > > > > >
> > > > > > > > > So any thoughts on what directions are worth experimenting
> > with?
> > > > > > > >
> > > > > > > > One option we should consider is for each backend to connect
> > to
> > > > Xen via
> > > > > > > > the IOREQ interface. We could generalize the IOREQ interface
> > and
> > > > make it
> > > > > > > > hypervisor agnostic. The interface is really trivial and easy
> > to
> > > > add.
> > > > > > >
> > > > > > > As I said above, my proposal does the same thing that you
> > mentioned
> > > > here :)
> > > > > > > The difference is that I do call hypervisor interfaces via
> > virtio-
> > > > proxy.
> > > > > > >
> > > > > > > > The only Xen-specific part is the notification mechanism,
> > which is
> > > > an
> > > > > > > > event channel. If we replaced the event channel with something
> > > > else the
> > > > > > > > interface would be generic. See:
> > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > >
> > > > > > > > I don't think that translating IOREQs to eventfd in the kernel
> > is
> > > > a
> > > > > > > > good idea: if feels like it would be extra complexity and that
> > the
> > > > > > > > kernel shouldn't be involved as this is a backend-hypervisor
> > > > interface.
> > > > > > >
> > > > > > > Given that we may want to implement BE as a bare-metal
> > application
> > > > > > > as I did on Zephyr, I don't think that the translation would not
> > be
> > > > > > > a big issue, especially on RTOS's.
> > > > > > > It will be some kind of abstraction layer of interrupt handling
> > > > > > > (or nothing but a callback mechanism).
> > > > > > >
> > > > > > > > Also, eventfd is very Linux-centric and we are trying to
> > design an
> > > > > > > > interface that could work well for RTOSes too. If we want to
> > do
> > > > > > > > something different, both OS-agnostic and hypervisor-agnostic,
> > > > perhaps
> > > > > > > > we could design a new interface. One that could be
> > implementable
> > > > in the
> > > > > > > > Xen hypervisor itself (like IOREQ) and of course any other
> > > > hypervisor
> > > > > > > > too.
> > > > > > > >
> > > > > > > >
> > > > > > > > There is also another problem. IOREQ is probably not be the
> > only
> > > > > > > > interface needed. Have a look at
> > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
> > > > also need
> > > > > > > > an interface for the backend to inject interrupts into the
> > > > frontend? And
> > > > > > > > if the backend requires dynamic memory mappings of frontend
> > pages,
> > > > then
> > > > > > > > we would also need an interface to map/unmap domU pages.
> > > > > > >
> > > > > > > My proposal document might help here; All the interfaces
> > required
> > > > for
> > > > > > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > > > > > RPC protocols :)
> > > > > > >
> > > > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ
> > is
> > > > tiny
> > > > > > > > and self-contained. It is easy to add anywhere. A new
> > interface to
> > > > > > > > inject interrupts or map pages is more difficult to manage
> > because
> > > > it
> > > > > > > > would require changes scattered across the various emulators.
> > > > > > >
> > > > > > > Exactly. I have no confident yet that my approach will also
> > apply
> > > > > > > to other hypervisors than Xen.
> > > > > > > Technically, yes, but whether people can accept it or not is a
> > > > different
> > > > > > > matter.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > -Takahiro Akashi
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Oleksandr Tyshchenko
> > > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > > confidential and may also be privileged. If you are not the intended
> > > > recipient, please notify the sender immediately and do not disclose
> > the
> > > > contents to any other person, use it for any purpose, or store or copy
> > the
> > > > information in any medium. Thank you.
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
       [not found]   ` <20210820060558.GB13452@laputa>
@ 2021-08-21 14:08     ` Matias Ezequiel Vara Larsen
       [not found]       ` <20210823012029.GB40863@laputa>
  0 siblings, 1 reply; 66+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2021-08-21 14:08 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Alex Benn??e, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier

Hello,

On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
> Hi Matias,
> 
> On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
> > Hello Alex,
> > 
> > I can tell you my experience from working on a PoC (library) 
> > to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
> 
> What hypervisor are you using for your PoC here?
> 

I am using an in-house hypervisor, which is similar to Jailhouse.

> > I focused on two use cases:
> > 1. type-I hypervisor in which the backend is running as a VM. This
> > is an in-house hypervisor that does not support VMExits.
> > 2. Linux user-space. In this case, the library is just used to
> > communicate threads. The goal of this use case is merely testing.
> > 
> > I have chosen virtio-mmio as the way to exchange information
> > between the frontend and backend. I found it hard to synchronize the
> > access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
> 
> Can you explain how MMIOs to registers in virito-mmio layout
> (which I think means a configuration space?) will be propagated to BE?
> 

In this PoC, the BE guest is created with a fixed number of regions
of memory that represents each device. The BE initializes these regions, and then, waits
for the FEs to begin the initialization. 

> > the front-end and back-end to synchronize, which is required
> > during the device-status initialization. These extra bits would not be 
> > needed in case the hypervisor supports VMExits, e.g., KVM.
> > 
> > Each guest has a memory region that is shared with the backend. 
> > This memory region is used by the frontend to allocate the io-buffers. This region also 
> > maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
> > is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
> 
> So in summary, you have a single memory region that is used
> for virtio-mmio layout and io-buffers (I think they are for payload)
> and you assume that the region will be (at lease for now) statically
> shared between FE and BE so that you can eliminate 'mmap' at every
> time to access the payload.
> Correct?
>

Yes, It is. 

> If so, it can be an alternative solution for memory access issue,
> and a similar technique is used in some implementations:
> - (Jailhouse's) ivshmem
> - Arnd's fat virtqueue
>
> In either case, however, you will have to allocate payload from the region
> and so you will see some impact on FE code (at least at some low level).
> (In ivshmem, dma_ops in the kernel is defined for this purpose.)
> Correct?

Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that
memory region.

Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and 
the BE are VMs. The use of VMExits may require to involve the hypervisor.

Matias
> 
> -Takahiro Akashi
> 
> > At some point, the guest shall be able to balloon this region. Notifications between 
> > the frontend and the backend are implemented by using an hypercall. The hypercall 
> > mechanism and the memory allocation are abstracted away by a platform layer that 
> > exposes an interface that is hypervisor/os agnostic.
> > 
> > I split the backend into a virtio-device driver and a
> > backend driver. The virtio-device driver is the virtqueues and the
> > backend driver gets packets from the virtqueue for
> > post-processing. For example, in the case of virtio-net, the backend
> > driver would decide if the packet goes to the hardware or to another
> > virtio-net device. The virtio-device drivers may be
> > implemented in different ways like by using a single thread, multiple threads, 
> > or one thread for all the virtio-devices.
> > 
> > In this PoC, I just tackled two very simple use-cases. These
> > use-cases allowed me to extract some requirements for an hypervisor to
> > support virtio.
> > 
> > Matias
> > 
> > On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> > > Hi,
> > > 
> > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > backends so we can enable as much re-use of code as possible and avoid
> > > repeating ourselves. This is the flip side of the front end where
> > > multiple front-end implementations are required - one per OS, assuming
> > > you don't just want Linux guests. The resultant guests are trivially
> > > movable between hypervisors modulo any abstracted paravirt type
> > > interfaces.
> > > 
> > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > daemons running in a broadly POSIX like environment. The interface to
> > > the daemon is fairly simple requiring only some mapped memory and some
> > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > stub binary would be responsible for any hypervisor specific setup and
> > > then launch a common binary to deal with the actual virtqueue requests
> > > themselves.
> > > 
> > > Since that original sketch we've seen an expansion in the sort of ways
> > > backends could be created. There is interest in encapsulating backends
> > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > has prompted ideas of using the trait interface to abstract differences
> > > away as well as the idea of bare-metal Rust backends.
> > > 
> > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > calls for a description of the APIs needed from the hypervisor side to
> > > support VirtIO guests and their backends. However we are some way off
> > > from that at the moment as I think we need to at least demonstrate one
> > > portable backend before we start codifying requirements. To that end I
> > > want to think about what we need for a backend to function.
> > > 
> > > Configuration
> > > =============
> > > 
> > > In the type-2 setup this is typically fairly simple because the host
> > > system can orchestrate the various modules that make up the complete
> > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > we need some sort of mechanism to inform the backend VM about key
> > > details about the system:
> > > 
> > >   - where virt queue memory is in it's address space
> > >   - how it's going to receive (interrupt) and trigger (kick) events
> > >   - what (if any) resources the backend needs to connect to
> > > 
> > > Obviously you can elide over configuration issues by having static
> > > configurations and baking the assumptions into your guest images however
> > > this isn't scalable in the long term. The obvious solution seems to be
> > > extending a subset of Device Tree data to user space but perhaps there
> > > are other approaches?
> > > 
> > > Before any virtio transactions can take place the appropriate memory
> > > mappings need to be made between the FE guest and the BE guest.
> > > Currently the whole of the FE guests address space needs to be visible
> > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > 
> > >  * BE guest boots with memory already mapped
> > > 
> > >  This would entail the guest OS knowing where in it's Guest Physical
> > >  Address space is already taken up and avoiding clashing. I would assume
> > >  in this case you would want a standard interface to userspace to then
> > >  make that address space visible to the backend daemon.
> > > 
> > >  * BE guests boots with a hypervisor handle to memory
> > > 
> > >  The BE guest is then free to map the FE's memory to where it wants in
> > >  the BE's guest physical address space. To activate the mapping will
> > >  require some sort of hypercall to the hypervisor. I can see two options
> > >  at this point:
> > > 
> > >   - expose the handle to userspace for daemon/helper to trigger the
> > >     mapping via existing hypercall interfaces. If using a helper you
> > >     would have a hypervisor specific one to avoid the daemon having to
> > >     care too much about the details or push that complexity into a
> > >     compile time option for the daemon which would result in different
> > >     binaries although a common source base.
> > > 
> > >   - expose a new kernel ABI to abstract the hypercall differences away
> > >     in the guest kernel. In this case the userspace would essentially
> > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > >     the kernel deal with the different hypercall interfaces. This of
> > >     course assumes the majority of BE guests would be Linux kernels and
> > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > 
> > > Operation
> > > =========
> > > 
> > > The core of the operation of VirtIO is fairly simple. Once the
> > > vhost-user feature negotiation is done it's a case of receiving update
> > > events and parsing the resultant virt queue for data. The vhost-user
> > > specification handles a bunch of setup before that point, mostly to
> > > detail where the virt queues are set up FD's for memory and event
> > > communication. This is where the envisioned stub process would be
> > > responsible for getting the daemon up and ready to run. This is
> > > currently done inside a big VMM like QEMU but I suspect a modern
> > > approach would be to use the rust-vmm vhost crate. It would then either
> > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > build option for the various hypervisors.
> > > 
> > > One question is how to best handle notification and kicks. The existing
> > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > IOREQ mechanism. However latency is an important factor and having
> > > events go through the stub would add quite a lot.
> > > 
> > > Could we consider the kernel internally converting IOREQ messages from
> > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > hypercall interfaces?
> > > 
> > > So any thoughts on what directions are worth experimenting with?
> > > 
> > > -- 
> > > Alex Bennée
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-17 10:41     ` [virtio-dev] " Stefan Hajnoczi
  (?)
@ 2021-08-23  6:25     ` AKASHI Takahiro
  2021-08-23  9:58         ` [virtio-dev] " Stefan Hajnoczi
  -1 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-23  6:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

Hi Stefan,

On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > Could we consider the kernel internally converting IOREQ messages from
> > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > hypercall interfaces?
> > > 
> > > So any thoughts on what directions are worth experimenting with?
> >  
> > One option we should consider is for each backend to connect to Xen via
> > the IOREQ interface. We could generalize the IOREQ interface and make it
> > hypervisor agnostic. The interface is really trivial and easy to add.
> > The only Xen-specific part is the notification mechanism, which is an
> > event channel. If we replaced the event channel with something else the
> > interface would be generic. See:
> > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> 
> There have been experiments with something kind of similar in KVM
> recently (see struct ioregionfd_cmd):
> https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/

Do you know the current status of Elena's work?
It was last February that she posted her latest patch
and it has not been merged upstream yet.

> > There is also another problem. IOREQ is probably not be the only
> > interface needed. Have a look at
> > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > an interface for the backend to inject interrupts into the frontend? And
> > if the backend requires dynamic memory mappings of frontend pages, then
> > we would also need an interface to map/unmap domU pages.
> > 
> > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > and self-contained. It is easy to add anywhere. A new interface to
> > inject interrupts or map pages is more difficult to manage because it
> > would require changes scattered across the various emulators.
> 
> Something like ioreq is indeed necessary to implement arbitrary devices,
> but if you are willing to restrict yourself to VIRTIO then other
> interfaces are possible too because the VIRTIO device model is different
> from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.

Can you please elaborate your thoughts a bit more here?

It seems to me that trapping MMIOs to configuration space and
forwarding those events to BE (or device emulation) is a quite
straight-forward way to emulate device MMIOs.
Or do you think of something of protocols used in vhost-user?

# On the contrary, virtio-ivshmem only requires a driver to explicitly
# forward a "write" request of MMIO accesses to BE. But I don't think
# it's your point. 

-Takahiro Akashi

> Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-23  6:25     ` AKASHI Takahiro
@ 2021-08-23  9:58         ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-23  9:58 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3995 bytes --]

On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> Hi Stefan,
> 
> On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > Could we consider the kernel internally converting IOREQ messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > hypercall interfaces?
> > > > 
> > > > So any thoughts on what directions are worth experimenting with?
> > >  
> > > One option we should consider is for each backend to connect to Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > The only Xen-specific part is the notification mechanism, which is an
> > > event channel. If we replaced the event channel with something else the
> > > interface would be generic. See:
> > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > 
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> 
> Do you know the current status of Elena's work?
> It was last February that she posted her latest patch
> and it has not been merged upstream yet.

Elena worked on this during her Outreachy internship. At the moment no
one is actively working on the patches.

> > > There is also another problem. IOREQ is probably not be the only
> > > interface needed. Have a look at
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > an interface for the backend to inject interrupts into the frontend? And
> > > if the backend requires dynamic memory mappings of frontend pages, then
> > > we would also need an interface to map/unmap domU pages.
> > > 
> > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > and self-contained. It is easy to add anywhere. A new interface to
> > > inject interrupts or map pages is more difficult to manage because it
> > > would require changes scattered across the various emulators.
> > 
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> 
> Can you please elaborate your thoughts a bit more here?
> 
> It seems to me that trapping MMIOs to configuration space and
> forwarding those events to BE (or device emulation) is a quite
> straight-forward way to emulate device MMIOs.
> Or do you think of something of protocols used in vhost-user?
> 
> # On the contrary, virtio-ivshmem only requires a driver to explicitly
> # forward a "write" request of MMIO accesses to BE. But I don't think
> # it's your point. 

See my first reply to this email thread about alternative interfaces for
VIRTIO device emulation. The main thing to note was that although the
shared memory vring is used by VIRTIO transports today, the device model
actually allows transports to implement virtqueues differently (e.g.
making it possible to create a VIRTIO over TCP transport without shared
memory in the future).

It's possible to define a hypercall interface as a new VIRTIO transport
that provides higher-level virtqueue operations. Doing this is more work
than using vrings though since existing guest driver and device
emulation code already supports vrings.

I don't know the requirements of Stratos so I can't say if creating a
new hypervisor-independent interface (VIRTIO transport) that doesn't
rely on shared memory vrings makes sense. I just wanted to raise the
idea in case you find that VIRTIO's vrings don't meet your requirements.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-08-23  9:58         ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-23  9:58 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3995 bytes --]

On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> Hi Stefan,
> 
> On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > Could we consider the kernel internally converting IOREQ messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > hypercall interfaces?
> > > > 
> > > > So any thoughts on what directions are worth experimenting with?
> > >  
> > > One option we should consider is for each backend to connect to Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > The only Xen-specific part is the notification mechanism, which is an
> > > event channel. If we replaced the event channel with something else the
> > > interface would be generic. See:
> > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > 
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> 
> Do you know the current status of Elena's work?
> It was last February that she posted her latest patch
> and it has not been merged upstream yet.

Elena worked on this during her Outreachy internship. At the moment no
one is actively working on the patches.

> > > There is also another problem. IOREQ is probably not be the only
> > > interface needed. Have a look at
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > an interface for the backend to inject interrupts into the frontend? And
> > > if the backend requires dynamic memory mappings of frontend pages, then
> > > we would also need an interface to map/unmap domU pages.
> > > 
> > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > and self-contained. It is easy to add anywhere. A new interface to
> > > inject interrupts or map pages is more difficult to manage because it
> > > would require changes scattered across the various emulators.
> > 
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> 
> Can you please elaborate your thoughts a bit more here?
> 
> It seems to me that trapping MMIOs to configuration space and
> forwarding those events to BE (or device emulation) is a quite
> straight-forward way to emulate device MMIOs.
> Or do you think of something of protocols used in vhost-user?
> 
> # On the contrary, virtio-ivshmem only requires a driver to explicitly
> # forward a "write" request of MMIO accesses to BE. But I don't think
> # it's your point. 

See my first reply to this email thread about alternative interfaces for
VIRTIO device emulation. The main thing to note was that although the
shared memory vring is used by VIRTIO transports today, the device model
actually allows transports to implement virtqueues differently (e.g.
making it possible to create a VIRTIO over TCP transport without shared
memory in the future).

It's possible to define a hypercall interface as a new VIRTIO transport
that provides higher-level virtqueue operations. Doing this is more work
than using vrings though since existing guest driver and device
emulation code already supports vrings.

I don't know the requirements of Stratos so I can't say if creating a
new hypervisor-independent interface (VIRTIO transport) that doesn't
rely on shared memory vrings makes sense. I just wanted to raise the
idea in case you find that VIRTIO's vrings don't meet your requirements.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-23  9:58         ` [virtio-dev] " Stefan Hajnoczi
  (?)
@ 2021-08-25 10:29         ` AKASHI Takahiro
  2021-08-25 15:02             ` [virtio-dev] " Stefan Hajnoczi
  -1 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-25 10:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

Hi Stefan,

On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> > Hi Stefan,
> > 
> > On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > hypercall interfaces?
> > > > > 
> > > > > So any thoughts on what directions are worth experimenting with?
> > > >  
> > > > One option we should consider is for each backend to connect to Xen via
> > > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > > The only Xen-specific part is the notification mechanism, which is an
> > > > event channel. If we replaced the event channel with something else the
> > > > interface would be generic. See:
> > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > 
> > > There have been experiments with something kind of similar in KVM
> > > recently (see struct ioregionfd_cmd):
> > > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > 
> > Do you know the current status of Elena's work?
> > It was last February that she posted her latest patch
> > and it has not been merged upstream yet.
> 
> Elena worked on this during her Outreachy internship. At the moment no
> one is actively working on the patches.

Does RedHat plan to take over or follow up her work hereafter?
# I'm simply asking from my curiosity.

> > > > There is also another problem. IOREQ is probably not be the only
> > > > interface needed. Have a look at
> > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > > an interface for the backend to inject interrupts into the frontend? And
> > > > if the backend requires dynamic memory mappings of frontend pages, then
> > > > we would also need an interface to map/unmap domU pages.
> > > > 
> > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > inject interrupts or map pages is more difficult to manage because it
> > > > would require changes scattered across the various emulators.
> > > 
> > > Something like ioreq is indeed necessary to implement arbitrary devices,
> > > but if you are willing to restrict yourself to VIRTIO then other
> > > interfaces are possible too because the VIRTIO device model is different
> > > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> > 
> > Can you please elaborate your thoughts a bit more here?
> > 
> > It seems to me that trapping MMIOs to configuration space and
> > forwarding those events to BE (or device emulation) is a quite
> > straight-forward way to emulate device MMIOs.
> > Or do you think of something of protocols used in vhost-user?
> > 
> > # On the contrary, virtio-ivshmem only requires a driver to explicitly
> > # forward a "write" request of MMIO accesses to BE. But I don't think
> > # it's your point. 
> 
> See my first reply to this email thread about alternative interfaces for
> VIRTIO device emulation. The main thing to note was that although the
> shared memory vring is used by VIRTIO transports today, the device model
> actually allows transports to implement virtqueues differently (e.g.
> making it possible to create a VIRTIO over TCP transport without shared
> memory in the future).

Do you have any example of such use cases or systems?

> It's possible to define a hypercall interface as a new VIRTIO transport
> that provides higher-level virtqueue operations. Doing this is more work
> than using vrings though since existing guest driver and device
> emulation code already supports vrings.

Personally, I'm open to discuss about your point, but

> I don't know the requirements of Stratos so I can't say if creating a
> new hypervisor-independent interface (VIRTIO transport) that doesn't
> rely on shared memory vrings makes sense. I just wanted to raise the
> idea in case you find that VIRTIO's vrings don't meet your requirements.

While I cannot represent the project's view, what the JIRA task
that is assigned to me describes:
  Deliverables
    * Low level library allowing:
    * management of virtio rings and buffers
  [and so on]
So supporting the shared memory-based vring is one of our assumptions.

In my understanding, the goal of Stratos project is that we would
have several VMs congregated into a SoC, yet sharing most of
physical IPs, where the shared memory should be, I assume, the most
efficient transport for virtio.
One of target applications would be automotive, I guess.

Alex and Mike should have more to say here.

-Takahiro Akashi

> Stefan




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-25 10:29         ` AKASHI Takahiro
@ 2021-08-25 15:02             ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-25 15:02 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5474 bytes --]

On Wed, Aug 25, 2021 at 07:29:45PM +0900, AKASHI Takahiro wrote:
> On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
> > On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> > > Hi Stefan,
> > > 
> > > On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > > hypercall interfaces?
> > > > > > 
> > > > > > So any thoughts on what directions are worth experimenting with?
> > > > >  
> > > > > One option we should consider is for each backend to connect to Xen via
> > > > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > > > The only Xen-specific part is the notification mechanism, which is an
> > > > > event channel. If we replaced the event channel with something else the
> > > > > interface would be generic. See:
> > > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > 
> > > > There have been experiments with something kind of similar in KVM
> > > > recently (see struct ioregionfd_cmd):
> > > > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > > 
> > > Do you know the current status of Elena's work?
> > > It was last February that she posted her latest patch
> > > and it has not been merged upstream yet.
> > 
> > Elena worked on this during her Outreachy internship. At the moment no
> > one is actively working on the patches.
> 
> Does RedHat plan to take over or follow up her work hereafter?
> # I'm simply asking from my curiosity.

At the moment I'm not aware of anyone from Red Hat working on it. If
someone decides they need this KVM API then that could change.

> > > > > There is also another problem. IOREQ is probably not be the only
> > > > > interface needed. Have a look at
> > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > > > an interface for the backend to inject interrupts into the frontend? And
> > > > > if the backend requires dynamic memory mappings of frontend pages, then
> > > > > we would also need an interface to map/unmap domU pages.
> > > > > 
> > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > > inject interrupts or map pages is more difficult to manage because it
> > > > > would require changes scattered across the various emulators.
> > > > 
> > > > Something like ioreq is indeed necessary to implement arbitrary devices,
> > > > but if you are willing to restrict yourself to VIRTIO then other
> > > > interfaces are possible too because the VIRTIO device model is different
> > > > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> > > 
> > > Can you please elaborate your thoughts a bit more here?
> > > 
> > > It seems to me that trapping MMIOs to configuration space and
> > > forwarding those events to BE (or device emulation) is a quite
> > > straight-forward way to emulate device MMIOs.
> > > Or do you think of something of protocols used in vhost-user?
> > > 
> > > # On the contrary, virtio-ivshmem only requires a driver to explicitly
> > > # forward a "write" request of MMIO accesses to BE. But I don't think
> > > # it's your point. 
> > 
> > See my first reply to this email thread about alternative interfaces for
> > VIRTIO device emulation. The main thing to note was that although the
> > shared memory vring is used by VIRTIO transports today, the device model
> > actually allows transports to implement virtqueues differently (e.g.
> > making it possible to create a VIRTIO over TCP transport without shared
> > memory in the future).
> 
> Do you have any example of such use cases or systems?

This aspect of VIRTIO isn't being exploited today AFAIK. But the
layering to allow other virtqueue implementations is there. For example,
Linux's virtqueue API is independent of struct vring, so existing
drivers generally aren't tied to vrings.

> > It's possible to define a hypercall interface as a new VIRTIO transport
> > that provides higher-level virtqueue operations. Doing this is more work
> > than using vrings though since existing guest driver and device
> > emulation code already supports vrings.
> 
> Personally, I'm open to discuss about your point, but
> 
> > I don't know the requirements of Stratos so I can't say if creating a
> > new hypervisor-independent interface (VIRTIO transport) that doesn't
> > rely on shared memory vrings makes sense. I just wanted to raise the
> > idea in case you find that VIRTIO's vrings don't meet your requirements.
> 
> While I cannot represent the project's view, what the JIRA task
> that is assigned to me describes:
>   Deliverables
>     * Low level library allowing:
>     * management of virtio rings and buffers
>   [and so on]
> So supporting the shared memory-based vring is one of our assumptions.

If shared memory is allowed then vrings are the natural choice. That way
existing virtio code will work with minimal modifications.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-08-25 15:02             ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-08-25 15:02 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefano Stabellini, Alex Benn??e, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5474 bytes --]

On Wed, Aug 25, 2021 at 07:29:45PM +0900, AKASHI Takahiro wrote:
> On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
> > On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> > > Hi Stefan,
> > > 
> > > On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > > hypercall interfaces?
> > > > > > 
> > > > > > So any thoughts on what directions are worth experimenting with?
> > > > >  
> > > > > One option we should consider is for each backend to connect to Xen via
> > > > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > > > The only Xen-specific part is the notification mechanism, which is an
> > > > > event channel. If we replaced the event channel with something else the
> > > > > interface would be generic. See:
> > > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > 
> > > > There have been experiments with something kind of similar in KVM
> > > > recently (see struct ioregionfd_cmd):
> > > > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > > 
> > > Do you know the current status of Elena's work?
> > > It was last February that she posted her latest patch
> > > and it has not been merged upstream yet.
> > 
> > Elena worked on this during her Outreachy internship. At the moment no
> > one is actively working on the patches.
> 
> Does RedHat plan to take over or follow up her work hereafter?
> # I'm simply asking from my curiosity.

At the moment I'm not aware of anyone from Red Hat working on it. If
someone decides they need this KVM API then that could change.

> > > > > There is also another problem. IOREQ is probably not be the only
> > > > > interface needed. Have a look at
> > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > > > an interface for the backend to inject interrupts into the frontend? And
> > > > > if the backend requires dynamic memory mappings of frontend pages, then
> > > > > we would also need an interface to map/unmap domU pages.
> > > > > 
> > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > > inject interrupts or map pages is more difficult to manage because it
> > > > > would require changes scattered across the various emulators.
> > > > 
> > > > Something like ioreq is indeed necessary to implement arbitrary devices,
> > > > but if you are willing to restrict yourself to VIRTIO then other
> > > > interfaces are possible too because the VIRTIO device model is different
> > > > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> > > 
> > > Can you please elaborate your thoughts a bit more here?
> > > 
> > > It seems to me that trapping MMIOs to configuration space and
> > > forwarding those events to BE (or device emulation) is a quite
> > > straight-forward way to emulate device MMIOs.
> > > Or do you think of something of protocols used in vhost-user?
> > > 
> > > # On the contrary, virtio-ivshmem only requires a driver to explicitly
> > > # forward a "write" request of MMIO accesses to BE. But I don't think
> > > # it's your point. 
> > 
> > See my first reply to this email thread about alternative interfaces for
> > VIRTIO device emulation. The main thing to note was that although the
> > shared memory vring is used by VIRTIO transports today, the device model
> > actually allows transports to implement virtqueues differently (e.g.
> > making it possible to create a VIRTIO over TCP transport without shared
> > memory in the future).
> 
> Do you have any example of such use cases or systems?

This aspect of VIRTIO isn't being exploited today AFAIK. But the
layering to allow other virtqueue implementations is there. For example,
Linux's virtqueue API is independent of struct vring, so existing
drivers generally aren't tied to vrings.

> > It's possible to define a hypercall interface as a new VIRTIO transport
> > that provides higher-level virtqueue operations. Doing this is more work
> > than using vrings though since existing guest driver and device
> > emulation code already supports vrings.
> 
> Personally, I'm open to discuss about your point, but
> 
> > I don't know the requirements of Stratos so I can't say if creating a
> > new hypervisor-independent interface (VIRTIO transport) that doesn't
> > rely on shared memory vrings makes sense. I just wanted to raise the
> > idea in case you find that VIRTIO's vrings don't meet your requirements.
> 
> While I cannot represent the project's view, what the JIRA task
> that is assigned to me describes:
>   Deliverables
>     * Low level library allowing:
>     * management of virtio rings and buffers
>   [and so on]
> So supporting the shared memory-based vring is one of our assumptions.

If shared memory is allowed then vrings are the natural choice. That way
existing virtio code will work with minimal modifications.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-20  6:41                 ` AKASHI Takahiro
@ 2021-08-26  9:40                   ` AKASHI Takahiro
  2021-08-26 12:10                     ` Wei Chen
  0 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-26  9:40 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Hi Wei,

On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > Hi Akashi,
> > 
> > > -----Original Message-----
> > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > Sent: 2021年8月18日 13:39
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > Hi Akashi,
> > > >
> > > > > -----Original Message-----
> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > Sent: 2021年8月17日 16:08
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > > Stratos
> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > dev@lists.oasis-
> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > > Julien
> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > Hi Wei, Oleksandr,
> > > > >
> > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > other hypervisors can reuse the virtual device implementations.
> > > > > >
> > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > >
> > > > > # My proposal[1] comes from my own idea and doesn't always represent
> > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > Nevertheless,
> > > > >
> > > > > Your idea and my proposal seem to share the same background.
> > > > > Both have the similar goal and currently start with, at first, Xen
> > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > >
> > > > > In particular, the abstraction of hypervisor interfaces has a same
> > > > > set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
> > > > > This is not co-incident as we both share the same origin as I said
> > > above.
> > > > > And so we will also share the same issues. One of them is a way of
> > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > the portability and the performance impact.
> > > > > So we can discuss the topic here in this ML, too.
> > > > > (See Alex's original email, too).
> > > > >
> > > > Yes, I agree.
> > > >
> > > > > On the other hand, my approach aims to create a "single-binary"
> > > solution
> > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > Somehow similar to your "proposal-#2" in [2], but in my solution, all
> > > > > the hypervisor-specific code would be put into another entity (VM),
> > > > > named "virtio-proxy" and the abstracted operations are served via RPC.
> > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > dependency.)
> > > > > But I know that we need discuss if this is a requirement even
> > > > > in Stratos project or not. (Maybe not)
> > > > >
> > > >
> > > > Sorry, I haven't had time to finish reading your virtio-proxy completely
> > > > (I will do it ASAP). But from your description, it seems we need a
> > > > 3rd VM between FE and BE? My concern is that, if my assumption is right,
> > > > will it increase the latency in data transport path? Even if we're
> > > > using some lightweight guest like RTOS or Unikernel,
> > >
> > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > As far as we execute 'mapping' operations at every fetch of payload,
> > > we will see latency issue (even in your case) and if we have some solution
> > > for it, we won't see it neither in my proposal :)
> > >
> > 
> > Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> > of "mapping/unmapping" operations. So the latency caused by this behavior
> > on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.
> 
> Obviously, I have not yet caught up there in the discussion.
> Which patch specifically?

Can you give me the link to the discussion or patch, please?

Thanks,
-Takahiro Akashi

> -Takahiro Akashi
> 
> > > > > Specifically speaking about kvm-tool, I have a concern about its
> > > > > license term; Targeting different hypervisors and different OSs
> > > > > (which I assume includes RTOS's), the resultant library should be
> > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > Any thoughts?
> > > > >
> > > >
> > > > Yes. If user want to implement a FreeBSD device model, but the virtio
> > > > library is GPL. Then GPL would be a problem. If we have another good
> > > > candidate, I am open to it.
> > >
> > > I have some candidates, particularly for vq/vring, in my mind:
> > > * Open-AMP, or
> > > * corresponding Free-BSD code
> > >
> > 
> > Interesting, I will look into them : )
> > 
> > Cheers,
> > Wei Chen
> > 
> > > -Takahiro Akashi
> > >
> > >
> > > > > -Takahiro Akashi
> > > > >
> > > > >
> > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > August/000548.html
> > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > >
> > > > > >
> > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> > > Stabellini
> > > > > <sstabellini@kernel.org>
> > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing List
> > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-open.org;
> > > Arnd
> > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>; Oleksandr
> > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > > Julien
> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > > >
> > > > > > > Hello, all.
> > > > > > >
> > > > > > > Please see some comments below. And sorry for the possible format
> > > > > issues.
> > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
> > > wrote:
> > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not trimming
> > > the
> > > > > original
> > > > > > > > > email to let them read the full context.
> > > > > > > > >
> > > > > > > > > My comments below are related to a potential Xen
> > > implementation,
> > > > > not
> > > > > > > > > because it is the only implementation that matters, but
> > > because it
> > > > > is
> > > > > > > > > the one I know best.
> > > > > > > >
> > > > > > > > Please note that my proposal (and hence the working prototype)[1]
> > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > > > particularly
> > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > It has been, I believe, well generalized but is still a bit
> > > biased
> > > > > > > > toward this original design.
> > > > > > > >
> > > > > > > > So I hope you like my approach :)
> > > > > > > >
> > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > August/000546.html
> > > > > > > >
> > > > > > > > Let me take this opportunity to explain a bit more about my
> > > approach
> > > > > below.
> > > > > > > >
> > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > One of the goals of Project Stratos is to enable hypervisor
> > > > > agnostic
> > > > > > > > > > backends so we can enable as much re-use of code as possible
> > > and
> > > > > avoid
> > > > > > > > > > repeating ourselves. This is the flip side of the front end
> > > > > where
> > > > > > > > > > multiple front-end implementations are required - one per OS,
> > > > > assuming
> > > > > > > > > > you don't just want Linux guests. The resultant guests are
> > > > > trivially
> > > > > > > > > > movable between hypervisors modulo any abstracted paravirt
> > > type
> > > > > > > > > > interfaces.
> > > > > > > > > >
> > > > > > > > > > In my original thumb nail sketch of a solution I envisioned
> > > > > vhost-user
> > > > > > > > > > daemons running in a broadly POSIX like environment. The
> > > > > interface to
> > > > > > > > > > the daemon is fairly simple requiring only some mapped
> > > memory
> > > > > and some
> > > > > > > > > > sort of signalling for events (on Linux this is eventfd).
> > > The
> > > > > idea was a
> > > > > > > > > > stub binary would be responsible for any hypervisor specific
> > > > > setup and
> > > > > > > > > > then launch a common binary to deal with the actual
> > > virtqueue
> > > > > requests
> > > > > > > > > > themselves.
> > > > > > > > > >
> > > > > > > > > > Since that original sketch we've seen an expansion in the
> > > sort
> > > > > of ways
> > > > > > > > > > backends could be created. There is interest in
> > > encapsulating
> > > > > backends
> > > > > > > > > > in RTOSes or unikernels for solutions like SCMI. There
> > > interest
> > > > > in Rust
> > > > > > > > > > has prompted ideas of using the trait interface to abstract
> > > > > differences
> > > > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > > > >
> > > > > > > > > > We have a card (STR-12) called "Hypercall Standardisation"
> > > which
> > > > > > > > > > calls for a description of the APIs needed from the
> > > hypervisor
> > > > > side to
> > > > > > > > > > support VirtIO guests and their backends. However we are
> > > some
> > > > > way off
> > > > > > > > > > from that at the moment as I think we need to at least
> > > > > demonstrate one
> > > > > > > > > > portable backend before we start codifying requirements. To
> > > that
> > > > > end I
> > > > > > > > > > want to think about what we need for a backend to function.
> > > > > > > > > >
> > > > > > > > > > Configuration
> > > > > > > > > > =============
> > > > > > > > > >
> > > > > > > > > > In the type-2 setup this is typically fairly simple because
> > > the
> > > > > host
> > > > > > > > > > system can orchestrate the various modules that make up the
> > > > > complete
> > > > > > > > > > system. In the type-1 case (or even type-2 with delegated
> > > > > service VMs)
> > > > > > > > > > we need some sort of mechanism to inform the backend VM
> > > about
> > > > > key
> > > > > > > > > > details about the system:
> > > > > > > > > >
> > > > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > > > >   - how it's going to receive (interrupt) and trigger (kick)
> > > > > events
> > > > > > > > > >   - what (if any) resources the backend needs to connect to
> > > > > > > > > >
> > > > > > > > > > Obviously you can elide over configuration issues by having
> > > > > static
> > > > > > > > > > configurations and baking the assumptions into your guest
> > > images
> > > > > however
> > > > > > > > > > this isn't scalable in the long term. The obvious solution
> > > seems
> > > > > to be
> > > > > > > > > > extending a subset of Device Tree data to user space but
> > > perhaps
> > > > > there
> > > > > > > > > > are other approaches?
> > > > > > > > > >
> > > > > > > > > > Before any virtio transactions can take place the
> > > appropriate
> > > > > memory
> > > > > > > > > > mappings need to be made between the FE guest and the BE
> > > guest.
> > > > > > > > >
> > > > > > > > > > Currently the whole of the FE guests address space needs to
> > > be
> > > > > visible
> > > > > > > > > > to whatever is serving the virtio requests. I can envision 3
> > > > > approaches:
> > > > > > > > > >
> > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > >
> > > > > > > > > >  This would entail the guest OS knowing where in it's Guest
> > > > > Physical
> > > > > > > > > >  Address space is already taken up and avoiding clashing. I
> > > > > would assume
> > > > > > > > > >  in this case you would want a standard interface to
> > > userspace
> > > > > to then
> > > > > > > > > >  make that address space visible to the backend daemon.
> > > > > > > >
> > > > > > > > Yet another way here is that we would have well known "shared
> > > > > memory" between
> > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good insights on
> > > this
> > > > > matter
> > > > > > > > and that it can even be an alternative for hypervisor-agnostic
> > > > > solution.
> > > > > > > >
> > > > > > > > (Please note memory regions in ivshmem appear as a PCI device
> > > and
> > > > > can be
> > > > > > > > mapped locally.)
> > > > > > > >
> > > > > > > > I want to add this shared memory aspect to my virtio-proxy, but
> > > > > > > > the resultant solution would eventually look similar to ivshmem.
> > > > > > > >
> > > > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > > > >
> > > > > > > > > >  The BE guest is then free to map the FE's memory to where
> > > it
> > > > > wants in
> > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > >
> > > > > > > > > I cannot see how this could work for Xen. There is no "handle"
> > > to
> > > > > give
> > > > > > > > > to the backend if the backend is not running in dom0. So for
> > > Xen I
> > > > > think
> > > > > > > > > the memory has to be already mapped
> > > > > > > >
> > > > > > > > In Xen's IOREQ solution (virtio-blk), the following information
> > > is
> > > > > expected
> > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > >    - the start address of configuration space
> > > > > > > >    - interrupt number
> > > > > > > >    - file path for backing storage
> > > > > > > >    - read-only flag
> > > > > > > > And the BE server have to call a particular hypervisor interface
> > > to
> > > > > > > > map the configuration space.
> > > > > > >
> > > > > > > Yes, Xenstore was chosen as a simple way to pass configuration
> > > info to
> > > > > the backend running in a non-toolstack domain.
> > > > > > > I remember, there was a wish to avoid using Xenstore in Virtio
> > > backend
> > > > > itself if possible, so for non-toolstack domain, this could done with
> > > > > adjusting devd (daemon that listens for devices and launches backends)
> > > > > > > to read backend configuration from the Xenstore anyway and pass it
> > > to
> > > > > the backend via command line arguments.
> > > > > > >
> > > > > >
> > > > > > Yes, in current PoC code we're using xenstore to pass device
> > > > > configuration.
> > > > > > We also designed a static device configuration parse method for
> > > Dom0less
> > > > > or
> > > > > > other scenarios don't have xentool. yes, it's from device model
> > > command
> > > > > line
> > > > > > or a config file.
> > > > > >
> > > > > > > But, if ...
> > > > > > >
> > > > > > > >
> > > > > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
> > > > > specific
> > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to hide
> > > all
> > > > > details.
> > > > > > >
> > > > > > > ... the solution how to overcome that is already found and proven
> > > to
> > > > > work then even better.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > # My point is that a "handle" is not mandatory for executing
> > > mapping.
> > > > > > > >
> > > > > > > > > and the mapping probably done by the
> > > > > > > > > toolstack (also see below.) Or we would have to invent a new
> > > Xen
> > > > > > > > > hypervisor interface and Xen virtual machine privileges to
> > > allow
> > > > > this
> > > > > > > > > kind of mapping.
> > > > > > > >
> > > > > > > > > If we run the backend in Dom0 that we have no problems of
> > > course.
> > > > > > > >
> > > > > > > > One of difficulties on Xen that I found in my approach is that
> > > > > calling
> > > > > > > > such hypervisor intefaces (registering IOREQ, mapping memory) is
> > > > > only
> > > > > > > > allowed on BE servers themselvies and so we will have to extend
> > > > > those
> > > > > > > > interfaces.
> > > > > > > > This, however, will raise some concern on security and privilege
> > > > > distribution
> > > > > > > > as Stefan suggested.
> > > > > > >
> > > > > > > We also faced policy related issues with Virtio backend running in
> > > > > other than Dom0 domain in a "dummy" xsm mode. In our target system we
> > > run
> > > > > the backend in a driver
> > > > > > > domain (we call it DomD) where the underlying H/W resides. We
> > > trust it,
> > > > > so we wrote policy rules (to be used in "flask" xsm mode) to provide
> > > it
> > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > Now it is permitted to issue device-model, resource and memory
> > > > > mappings, etc calls.
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > To activate the mapping will
> > > > > > > > > >  require some sort of hypercall to the hypervisor. I can see
> > > two
> > > > > options
> > > > > > > > > >  at this point:
> > > > > > > > > >
> > > > > > > > > >   - expose the handle to userspace for daemon/helper to
> > > trigger
> > > > > the
> > > > > > > > > >     mapping via existing hypercall interfaces. If using a
> > > helper
> > > > > you
> > > > > > > > > >     would have a hypervisor specific one to avoid the daemon
> > > > > having to
> > > > > > > > > >     care too much about the details or push that complexity
> > > into
> > > > > a
> > > > > > > > > >     compile time option for the daemon which would result in
> > > > > different
> > > > > > > > > >     binaries although a common source base.
> > > > > > > > > >
> > > > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > > > differences away
> > > > > > > > > >     in the guest kernel. In this case the userspace would
> > > > > essentially
> > > > > > > > > >     ask for an abstract "map guest N memory to userspace
> > > ptr"
> > > > > and let
> > > > > > > > > >     the kernel deal with the different hypercall interfaces.
> > > > > This of
> > > > > > > > > >     course assumes the majority of BE guests would be Linux
> > > > > kernels and
> > > > > > > > > >     leaves the bare-metal/unikernel approaches to their own
> > > > > devices.
> > > > > > > > > >
> > > > > > > > > > Operation
> > > > > > > > > > =========
> > > > > > > > > >
> > > > > > > > > > The core of the operation of VirtIO is fairly simple. Once
> > > the
> > > > > > > > > > vhost-user feature negotiation is done it's a case of
> > > receiving
> > > > > update
> > > > > > > > > > events and parsing the resultant virt queue for data. The
> > > vhost-
> > > > > user
> > > > > > > > > > specification handles a bunch of setup before that point,
> > > mostly
> > > > > to
> > > > > > > > > > detail where the virt queues are set up FD's for memory and
> > > > > event
> > > > > > > > > > communication. This is where the envisioned stub process
> > > would
> > > > > be
> > > > > > > > > > responsible for getting the daemon up and ready to run. This
> > > is
> > > > > > > > > > currently done inside a big VMM like QEMU but I suspect a
> > > modern
> > > > > > > > > > approach would be to use the rust-vmm vhost crate. It would
> > > then
> > > > > either
> > > > > > > > > > communicate with the kernel's abstracted ABI or be re-
> > > targeted
> > > > > as a
> > > > > > > > > > build option for the various hypervisors.
> > > > > > > > >
> > > > > > > > > One thing I mentioned before to Alex is that Xen doesn't have
> > > VMMs
> > > > > the
> > > > > > > > > way they are typically envisioned and described in other
> > > > > environments.
> > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > > > independently to
> > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs could
> > > be
> > > > > used as
> > > > > > > > > emulators for a single Xen VM, each of them connecting to Xen
> > > > > > > > > independently via the IOREQ interface.
> > > > > > > > >
> > > > > > > > > The component responsible for starting a daemon and/or setting
> > > up
> > > > > shared
> > > > > > > > > interfaces is the toolstack: the xl command and the
> > > libxl/libxc
> > > > > > > > > libraries.
> > > > > > > >
> > > > > > > > I think that VM configuration management (or orchestration in
> > > > > Startos
> > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > Otherwise, is there any good assumption to avoid it right now?
> > > > > > > >
> > > > > > > > > Oleksandr and others I CCed have been working on ways for the
> > > > > toolstack
> > > > > > > > > to create virtio backends and setup memory mappings. They
> > > might be
> > > > > able
> > > > > > > > > to provide more info on the subject. I do think we miss a way
> > > to
> > > > > provide
> > > > > > > > > the configuration to the backend and anything else that the
> > > > > backend
> > > > > > > > > might require to start doing its job.
> > > > > > >
> > > > > > > Yes, some work has been done for the toolstack to handle Virtio
> > > MMIO
> > > > > devices in
> > > > > > > general and Virtio block devices in particular. However, it has
> > > not
> > > > > been upstreaned yet.
> > > > > > > Updated patches on review now:
> > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
> > > email-
> > > > > olekstysh@gmail.com/
> > > > > > >
> > > > > > > There is an additional (also important) activity to improve/fix
> > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > The foreign memory mapping is proposed to be used for Virtio
> > > backends
> > > > > (device emulators) if there is a need to run guest OS completely
> > > > > unmodified.
> > > > > > > Of course, the more secure way would be to use grant memory
> > > mapping.
> > > > > Brietly, the main difference between them is that with foreign mapping
> > > the
> > > > > backend
> > > > > > > can map any guest memory it wants to map, but with grant mapping
> > > it is
> > > > > allowed to map only what was previously granted by the frontend.
> > > > > > >
> > > > > > > So, there might be a problem if we want to pre-map some guest
> > > memory
> > > > > in advance or to cache mappings in the backend in order to improve
> > > > > performance (because the mapping/unmapping guest pages every request
> > > > > requires a lot of back and forth to Xen + P2M updates). In a nutshell,
> > > > > currently, in order to map a guest page into the backend address space
> > > we
> > > > > need to steal a real physical page from the backend domain. So, with
> > > the
> > > > > said optimizations we might end up with no free memory in the backend
> > > > > domain (see XSA-300). And what we try to achieve is to not waste a
> > > real
> > > > > domain memory at all by providing safe non-allocated-yet (so unused)
> > > > > address space for the foreign (and grant) pages to be mapped into,
> > > this
> > > > > enabling work implies Xen and Linux (and likely DTB bindings) changes.
> > > > > However, as it turned out, for this to work in a proper and safe way
> > > some
> > > > > prereq work needs to be done.
> > > > > > > You can find the related Xen discussion at:
> > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
> > > email-
> > > > > olekstysh@gmail.com/
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > One question is how to best handle notification and kicks.
> > > The
> > > > > existing
> > > > > > > > > > vhost-user framework uses eventfd to signal the daemon
> > > (although
> > > > > QEMU
> > > > > > > > > > is quite capable of simulating them when you use TCG). Xen
> > > has
> > > > > it's own
> > > > > > > > > > IOREQ mechanism. However latency is an important factor and
> > > > > having
> > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > >
> > > > > > > > > Yeah I think, regardless of anything else, we want the
> > > backends to
> > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > >
> > > > > > > > In my approach,
> > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a hypervisor
> > > > > interface
> > > > > > > >               via virtio-proxy
> > > > > > > >  b) FE -> BE: MMIO to config raises events (in event channels),
> > > > > which is
> > > > > > > >               converted to a callback to BE via virtio-proxy
> > > > > > > >               (Xen's event channel is internnally implemented by
> > > > > interrupts.)
> > > > > > > >
> > > > > > > > I don't know what "connect directly" means here, but sending
> > > > > interrupts
> > > > > > > > to the opposite side would be best efficient.
> > > > > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
> > > > > mechanism.
> > > > > > >
> > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > At the moment, in order to notify the frontend, the backend issues
> > > a
> > > > > specific device-model call to query Xen to inject a corresponding SPI
> > > to
> > > > > the guest.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > Could we consider the kernel internally converting IOREQ
> > > > > messages from
> > > > > > > > > > the Xen hypervisor to eventfd events? Would this scale with
> > > > > other kernel
> > > > > > > > > > hypercall interfaces?
> > > > > > > > > >
> > > > > > > > > > So any thoughts on what directions are worth experimenting
> > > with?
> > > > > > > > >
> > > > > > > > > One option we should consider is for each backend to connect
> > > to
> > > > > Xen via
> > > > > > > > > the IOREQ interface. We could generalize the IOREQ interface
> > > and
> > > > > make it
> > > > > > > > > hypervisor agnostic. The interface is really trivial and easy
> > > to
> > > > > add.
> > > > > > > >
> > > > > > > > As I said above, my proposal does the same thing that you
> > > mentioned
> > > > > here :)
> > > > > > > > The difference is that I do call hypervisor interfaces via
> > > virtio-
> > > > > proxy.
> > > > > > > >
> > > > > > > > > The only Xen-specific part is the notification mechanism,
> > > which is
> > > > > an
> > > > > > > > > event channel. If we replaced the event channel with something
> > > > > else the
> > > > > > > > > interface would be generic. See:
> > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > >
> > > > > > > > > I don't think that translating IOREQs to eventfd in the kernel
> > > is
> > > > > a
> > > > > > > > > good idea: if feels like it would be extra complexity and that
> > > the
> > > > > > > > > kernel shouldn't be involved as this is a backend-hypervisor
> > > > > interface.
> > > > > > > >
> > > > > > > > Given that we may want to implement BE as a bare-metal
> > > application
> > > > > > > > as I did on Zephyr, I don't think that the translation would not
> > > be
> > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > It will be some kind of abstraction layer of interrupt handling
> > > > > > > > (or nothing but a callback mechanism).
> > > > > > > >
> > > > > > > > > Also, eventfd is very Linux-centric and we are trying to
> > > design an
> > > > > > > > > interface that could work well for RTOSes too. If we want to
> > > do
> > > > > > > > > something different, both OS-agnostic and hypervisor-agnostic,
> > > > > perhaps
> > > > > > > > > we could design a new interface. One that could be
> > > implementable
> > > > > in the
> > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any other
> > > > > hypervisor
> > > > > > > > > too.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > There is also another problem. IOREQ is probably not be the
> > > only
> > > > > > > > > interface needed. Have a look at
> > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
> > > > > also need
> > > > > > > > > an interface for the backend to inject interrupts into the
> > > > > frontend? And
> > > > > > > > > if the backend requires dynamic memory mappings of frontend
> > > pages,
> > > > > then
> > > > > > > > > we would also need an interface to map/unmap domU pages.
> > > > > > > >
> > > > > > > > My proposal document might help here; All the interfaces
> > > required
> > > > > for
> > > > > > > > virtio-proxy (or hypervisor-related interfaces) are listed as
> > > > > > > > RPC protocols :)
> > > > > > > >
> > > > > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ
> > > is
> > > > > tiny
> > > > > > > > > and self-contained. It is easy to add anywhere. A new
> > > interface to
> > > > > > > > > inject interrupts or map pages is more difficult to manage
> > > because
> > > > > it
> > > > > > > > > would require changes scattered across the various emulators.
> > > > > > > >
> > > > > > > > Exactly. I have no confident yet that my approach will also
> > > apply
> > > > > > > > to other hypervisors than Xen.
> > > > > > > > Technically, yes, but whether people can accept it or not is a
> > > > > different
> > > > > > > > matter.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > -Takahiro Akashi
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Oleksandr Tyshchenko
> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > > > confidential and may also be privileged. If you are not the intended
> > > > > recipient, please notify the sender immediately and do not disclose
> > > the
> > > > > contents to any other person, use it for any purpose, or store or copy
> > > the
> > > > > information in any medium. Thank you.
> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose the
> > > contents to any other person, use it for any purpose, or store or copy the
> > > information in any medium. Thank you.
> > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-26  9:40                   ` AKASHI Takahiro
@ 2021-08-26 12:10                     ` Wei Chen
  2021-08-30 19:36                       ` Christopher Clark
  2021-08-31  6:18                       ` AKASHI Takahiro
  0 siblings, 2 replies; 66+ messages in thread
From: Wei Chen @ 2021-08-26 12:10 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Hi Akashi,

> -----Original Message-----
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Sent: 2021年8月26日 17:41
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>
> Hi Wei,
>
> On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > Hi Akashi,
> > >
> > > > -----Original Message-----
> > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > Sent: 2021年8月18日 13:39
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> Stratos
> > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> dev@lists.oasis-
> > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > <jan.kiszka@siemens.com>; Carl van Schaik
> <cvanscha@qti.qualcomm.com>;
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> Jean-
> > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> Durrant
> > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > Hi Akashi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > Sent: 2021年8月17日 16:08
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> Stabellini
> > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > > > Stratos
> > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > dev@lists.oasis-
> > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> <cvanscha@qti.qualcomm.com>;
> > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> Jean-
> > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> <Artem_Mygaiev@epam.com>;
> > > > Julien
> > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> Durrant
> > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > >
> > > > > > Hi Wei, Oleksandr,
> > > > > >
> > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > other hypervisors can reuse the virtual device implementations.
> > > > > > >
> > > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > >
> > > > > > # My proposal[1] comes from my own idea and doesn't always
> represent
> > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > Nevertheless,
> > > > > >
> > > > > > Your idea and my proposal seem to share the same background.
> > > > > > Both have the similar goal and currently start with, at first,
> Xen
> > > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > >
> > > > > > In particular, the abstraction of hypervisor interfaces has a
> same
> > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> interfaces").
> > > > > > This is not co-incident as we both share the same origin as I
> said
> > > > above.
> > > > > > And so we will also share the same issues. One of them is a way
> of
> > > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > > the portability and the performance impact.
> > > > > > So we can discuss the topic here in this ML, too.
> > > > > > (See Alex's original email, too).
> > > > > >
> > > > > Yes, I agree.
> > > > >
> > > > > > On the other hand, my approach aims to create a "single-binary"
> > > > solution
> > > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > > Somehow similar to your "proposal-#2" in [2], but in my solution,
> all
> > > > > > the hypervisor-specific code would be put into another entity
> (VM),
> > > > > > named "virtio-proxy" and the abstracted operations are served
> via RPC.
> > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > dependency.)
> > > > > > But I know that we need discuss if this is a requirement even
> > > > > > in Stratos project or not. (Maybe not)
> > > > > >
> > > > >
> > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> completely
> > > > > (I will do it ASAP). But from your description, it seems we need a
> > > > > 3rd VM between FE and BE? My concern is that, if my assumption is
> right,
> > > > > will it increase the latency in data transport path? Even if we're
> > > > > using some lightweight guest like RTOS or Unikernel,
> > > >
> > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > > As far as we execute 'mapping' operations at every fetch of payload,
> > > > we will see latency issue (even in your case) and if we have some
> solution
> > > > for it, we won't see it neither in my proposal :)
> > > >
> > >
> > > Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> > > of "mapping/unmapping" operations. So the latency caused by this
> behavior
> > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
> problem.
> >
> > Obviously, I have not yet caught up there in the discussion.
> > Which patch specifically?
>
> Can you give me the link to the discussion or patch, please?
>

It's a RFC discussion. We have tested this RFC patch internally.
https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html

> Thanks,
> -Takahiro Akashi
>
> > -Takahiro Akashi
> >
> > > > > > Specifically speaking about kvm-tool, I have a concern about its
> > > > > > license term; Targeting different hypervisors and different OSs
> > > > > > (which I assume includes RTOS's), the resultant library should
> be
> > > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > > Any thoughts?
> > > > > >
> > > > >
> > > > > Yes. If user want to implement a FreeBSD device model, but the
> virtio
> > > > > library is GPL. Then GPL would be a problem. If we have another
> good
> > > > > candidate, I am open to it.
> > > >
> > > > I have some candidates, particularly for vq/vring, in my mind:
> > > > * Open-AMP, or
> > > > * corresponding Free-BSD code
> > > >
> > >
> > > Interesting, I will look into them : )
> > >
> > > Cheers,
> > > Wei Chen
> > >
> > > > -Takahiro Akashi
> > > >
> > > >
> > > > > > -Takahiro Akashi
> > > > > >
> > > > > >
> > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > > August/000548.html
> > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > >
> > > > > > >
> > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> > > > Stabellini
> > > > > > <sstabellini@kernel.org>
> > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing
> List
> > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> open.org;
> > > > Arnd
> > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> <cvanscha@qti.qualcomm.com>;
> > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> Jean-
> > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> Oleksandr
> > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> <Artem_Mygaiev@epam.com>;
> > > > Julien
> > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> Durrant
> > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> backends
> > > > > > > >
> > > > > > > > Hello, all.
> > > > > > > >
> > > > > > > > Please see some comments below. And sorry for the possible
> format
> > > > > > issues.
> > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> Stabellini
> > > > wrote:
> > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> trimming
> > > > the
> > > > > > original
> > > > > > > > > > email to let them read the full context.
> > > > > > > > > >
> > > > > > > > > > My comments below are related to a potential Xen
> > > > implementation,
> > > > > > not
> > > > > > > > > > because it is the only implementation that matters, but
> > > > because it
> > > > > > is
> > > > > > > > > > the one I know best.
> > > > > > > > >
> > > > > > > > > Please note that my proposal (and hence the working
> prototype)[1]
> > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > > > > particularly
> > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > It has been, I believe, well generalized but is still a
> bit
> > > > biased
> > > > > > > > > toward this original design.
> > > > > > > > >
> > > > > > > > > So I hope you like my approach :)
> > > > > > > > >
> > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> dev/2021-
> > > > > > August/000546.html
> > > > > > > > >
> > > > > > > > > Let me take this opportunity to explain a bit more about
> my
> > > > approach
> > > > > > below.
> > > > > > > > >
> > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > One of the goals of Project Stratos is to enable
> hypervisor
> > > > > > agnostic
> > > > > > > > > > > backends so we can enable as much re-use of code as
> possible
> > > > and
> > > > > > avoid
> > > > > > > > > > > repeating ourselves. This is the flip side of the
> front end
> > > > > > where
> > > > > > > > > > > multiple front-end implementations are required - one
> per OS,
> > > > > > assuming
> > > > > > > > > > > you don't just want Linux guests. The resultant guests
> are
> > > > > > trivially
> > > > > > > > > > > movable between hypervisors modulo any abstracted
> paravirt
> > > > type
> > > > > > > > > > > interfaces.
> > > > > > > > > > >
> > > > > > > > > > > In my original thumb nail sketch of a solution I
> envisioned
> > > > > > vhost-user
> > > > > > > > > > > daemons running in a broadly POSIX like environment.
> The
> > > > > > interface to
> > > > > > > > > > > the daemon is fairly simple requiring only some mapped
> > > > memory
> > > > > > and some
> > > > > > > > > > > sort of signalling for events (on Linux this is
> eventfd).
> > > > The
> > > > > > idea was a
> > > > > > > > > > > stub binary would be responsible for any hypervisor
> specific
> > > > > > setup and
> > > > > > > > > > > then launch a common binary to deal with the actual
> > > > virtqueue
> > > > > > requests
> > > > > > > > > > > themselves.
> > > > > > > > > > >
> > > > > > > > > > > Since that original sketch we've seen an expansion in
> the
> > > > sort
> > > > > > of ways
> > > > > > > > > > > backends could be created. There is interest in
> > > > encapsulating
> > > > > > backends
> > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI. There
> > > > interest
> > > > > > in Rust
> > > > > > > > > > > has prompted ideas of using the trait interface to
> abstract
> > > > > > differences
> > > > > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > > > > >
> > > > > > > > > > > We have a card (STR-12) called "Hypercall
> Standardisation"
> > > > which
> > > > > > > > > > > calls for a description of the APIs needed from the
> > > > hypervisor
> > > > > > side to
> > > > > > > > > > > support VirtIO guests and their backends. However we
> are
> > > > some
> > > > > > way off
> > > > > > > > > > > from that at the moment as I think we need to at least
> > > > > > demonstrate one
> > > > > > > > > > > portable backend before we start codifying
> requirements. To
> > > > that
> > > > > > end I
> > > > > > > > > > > want to think about what we need for a backend to
> function.
> > > > > > > > > > >
> > > > > > > > > > > Configuration
> > > > > > > > > > > =============
> > > > > > > > > > >
> > > > > > > > > > > In the type-2 setup this is typically fairly simple
> because
> > > > the
> > > > > > host
> > > > > > > > > > > system can orchestrate the various modules that make
> up the
> > > > > > complete
> > > > > > > > > > > system. In the type-1 case (or even type-2 with
> delegated
> > > > > > service VMs)
> > > > > > > > > > > we need some sort of mechanism to inform the backend
> VM
> > > > about
> > > > > > key
> > > > > > > > > > > details about the system:
> > > > > > > > > > >
> > > > > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > > > > >   - how it's going to receive (interrupt) and trigger
> (kick)
> > > > > > events
> > > > > > > > > > >   - what (if any) resources the backend needs to
> connect to
> > > > > > > > > > >
> > > > > > > > > > > Obviously you can elide over configuration issues by
> having
> > > > > > static
> > > > > > > > > > > configurations and baking the assumptions into your
> guest
> > > > images
> > > > > > however
> > > > > > > > > > > this isn't scalable in the long term. The obvious
> solution
> > > > seems
> > > > > > to be
> > > > > > > > > > > extending a subset of Device Tree data to user space
> but
> > > > perhaps
> > > > > > there
> > > > > > > > > > > are other approaches?
> > > > > > > > > > >
> > > > > > > > > > > Before any virtio transactions can take place the
> > > > appropriate
> > > > > > memory
> > > > > > > > > > > mappings need to be made between the FE guest and the
> BE
> > > > guest.
> > > > > > > > > >
> > > > > > > > > > > Currently the whole of the FE guests address space
> needs to
> > > > be
> > > > > > visible
> > > > > > > > > > > to whatever is serving the virtio requests. I can
> envision 3
> > > > > > approaches:
> > > > > > > > > > >
> > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > >
> > > > > > > > > > >  This would entail the guest OS knowing where in it's
> Guest
> > > > > > Physical
> > > > > > > > > > >  Address space is already taken up and avoiding
> clashing. I
> > > > > > would assume
> > > > > > > > > > >  in this case you would want a standard interface to
> > > > userspace
> > > > > > to then
> > > > > > > > > > >  make that address space visible to the backend daemon.
> > > > > > > > >
> > > > > > > > > Yet another way here is that we would have well known
> "shared
> > > > > > memory" between
> > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> insights on
> > > > this
> > > > > > matter
> > > > > > > > > and that it can even be an alternative for hypervisor-
> agnostic
> > > > > > solution.
> > > > > > > > >
> > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> device
> > > > and
> > > > > > can be
> > > > > > > > > mapped locally.)
> > > > > > > > >
> > > > > > > > > I want to add this shared memory aspect to my virtio-proxy,
> but
> > > > > > > > > the resultant solution would eventually look similar to
> ivshmem.
> > > > > > > > >
> > > > > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > > > > >
> > > > > > > > > > >  The BE guest is then free to map the FE's memory to
> where
> > > > it
> > > > > > wants in
> > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > >
> > > > > > > > > > I cannot see how this could work for Xen. There is no
> "handle"
> > > > to
> > > > > > give
> > > > > > > > > > to the backend if the backend is not running in dom0. So
> for
> > > > Xen I
> > > > > > think
> > > > > > > > > > the memory has to be already mapped
> > > > > > > > >
> > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> information
> > > > is
> > > > > > expected
> > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > >    - the start address of configuration space
> > > > > > > > >    - interrupt number
> > > > > > > > >    - file path for backing storage
> > > > > > > > >    - read-only flag
> > > > > > > > > And the BE server have to call a particular hypervisor
> interface
> > > > to
> > > > > > > > > map the configuration space.
> > > > > > > >
> > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> configuration
> > > > info to
> > > > > > the backend running in a non-toolstack domain.
> > > > > > > > I remember, there was a wish to avoid using Xenstore in
> Virtio
> > > > backend
> > > > > > itself if possible, so for non-toolstack domain, this could done
> with
> > > > > > adjusting devd (daemon that listens for devices and launches
> backends)
> > > > > > > > to read backend configuration from the Xenstore anyway and
> pass it
> > > > to
> > > > > > the backend via command line arguments.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, in current PoC code we're using xenstore to pass device
> > > > > > configuration.
> > > > > > > We also designed a static device configuration parse method
> for
> > > > Dom0less
> > > > > > or
> > > > > > > other scenarios don't have xentool. yes, it's from device
> model
> > > > command
> > > > > > line
> > > > > > > or a config file.
> > > > > > >
> > > > > > > > But, if ...
> > > > > > > >
> > > > > > > > >
> > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> hypervisor)-
> > > > > > specific
> > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
> hide
> > > > all
> > > > > > details.
> > > > > > > >
> > > > > > > > ... the solution how to overcome that is already found and
> proven
> > > > to
> > > > > > work then even better.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > # My point is that a "handle" is not mandatory for
> executing
> > > > mapping.
> > > > > > > > >
> > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > toolstack (also see below.) Or we would have to invent a
> new
> > > > Xen
> > > > > > > > > > hypervisor interface and Xen virtual machine privileges
> to
> > > > allow
> > > > > > this
> > > > > > > > > > kind of mapping.
> > > > > > > > >
> > > > > > > > > > If we run the backend in Dom0 that we have no problems
> of
> > > > course.
> > > > > > > > >
> > > > > > > > > One of difficulties on Xen that I found in my approach is
> that
> > > > > > calling
> > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> memory) is
> > > > > > only
> > > > > > > > > allowed on BE servers themselvies and so we will have to
> extend
> > > > > > those
> > > > > > > > > interfaces.
> > > > > > > > > This, however, will raise some concern on security and
> privilege
> > > > > > distribution
> > > > > > > > > as Stefan suggested.
> > > > > > > >
> > > > > > > > We also faced policy related issues with Virtio backend
> running in
> > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> system we
> > > > run
> > > > > > the backend in a driver
> > > > > > > > domain (we call it DomD) where the underlying H/W resides.
> We
> > > > trust it,
> > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> provide
> > > > it
> > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > Now it is permitted to issue device-model, resource and
> memory
> > > > > > mappings, etc calls.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > To activate the mapping will
> > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
> can see
> > > > two
> > > > > > options
> > > > > > > > > > >  at this point:
> > > > > > > > > > >
> > > > > > > > > > >   - expose the handle to userspace for daemon/helper
> to
> > > > trigger
> > > > > > the
> > > > > > > > > > >     mapping via existing hypercall interfaces. If
> using a
> > > > helper
> > > > > > you
> > > > > > > > > > >     would have a hypervisor specific one to avoid the
> daemon
> > > > > > having to
> > > > > > > > > > >     care too much about the details or push that
> complexity
> > > > into
> > > > > > a
> > > > > > > > > > >     compile time option for the daemon which would
> result in
> > > > > > different
> > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > >
> > > > > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > > > > differences away
> > > > > > > > > > >     in the guest kernel. In this case the userspace
> would
> > > > > > essentially
> > > > > > > > > > >     ask for an abstract "map guest N memory to
> userspace
> > > > ptr"
> > > > > > and let
> > > > > > > > > > >     the kernel deal with the different hypercall
> interfaces.
> > > > > > This of
> > > > > > > > > > >     course assumes the majority of BE guests would be
> Linux
> > > > > > kernels and
> > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> their own
> > > > > > devices.
> > > > > > > > > > >
> > > > > > > > > > > Operation
> > > > > > > > > > > =========
> > > > > > > > > > >
> > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
> Once
> > > > the
> > > > > > > > > > > vhost-user feature negotiation is done it's a case of
> > > > receiving
> > > > > > update
> > > > > > > > > > > events and parsing the resultant virt queue for data.
> The
> > > > vhost-
> > > > > > user
> > > > > > > > > > > specification handles a bunch of setup before that
> point,
> > > > mostly
> > > > > > to
> > > > > > > > > > > detail where the virt queues are set up FD's for
> memory and
> > > > > > event
> > > > > > > > > > > communication. This is where the envisioned stub
> process
> > > > would
> > > > > > be
> > > > > > > > > > > responsible for getting the daemon up and ready to run.
> This
> > > > is
> > > > > > > > > > > currently done inside a big VMM like QEMU but I
> suspect a
> > > > modern
> > > > > > > > > > > approach would be to use the rust-vmm vhost crate. It
> would
> > > > then
> > > > > > either
> > > > > > > > > > > communicate with the kernel's abstracted ABI or be re-
> > > > targeted
> > > > > > as a
> > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > >
> > > > > > > > > > One thing I mentioned before to Alex is that Xen doesn't
> have
> > > > VMMs
> > > > > > the
> > > > > > > > > > way they are typically envisioned and described in other
> > > > > > environments.
> > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > > > > independently to
> > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
> could
> > > > be
> > > > > > used as
> > > > > > > > > > emulators for a single Xen VM, each of them connecting
> to Xen
> > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > >
> > > > > > > > > > The component responsible for starting a daemon and/or
> setting
> > > > up
> > > > > > shared
> > > > > > > > > > interfaces is the toolstack: the xl command and the
> > > > libxl/libxc
> > > > > > > > > > libraries.
> > > > > > > > >
> > > > > > > > > I think that VM configuration management (or orchestration
> in
> > > > > > Startos
> > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > Otherwise, is there any good assumption to avoid it right
> now?
> > > > > > > > >
> > > > > > > > > > Oleksandr and others I CCed have been working on ways
> for the
> > > > > > toolstack
> > > > > > > > > > to create virtio backends and setup memory mappings.
> They
> > > > might be
> > > > > > able
> > > > > > > > > > to provide more info on the subject. I do think we miss
> a way
> > > > to
> > > > > > provide
> > > > > > > > > > the configuration to the backend and anything else that
> the
> > > > > > backend
> > > > > > > > > > might require to start doing its job.
> > > > > > > >
> > > > > > > > Yes, some work has been done for the toolstack to handle
> Virtio
> > > > MMIO
> > > > > > devices in
> > > > > > > > general and Virtio block devices in particular. However, it
> has
> > > > not
> > > > > > been upstreaned yet.
> > > > > > > > Updated patches on review now:
> > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
> send-
> > > > email-
> > > > > > olekstysh@gmail.com/
> > > > > > > >
> > > > > > > > There is an additional (also important) activity to
> improve/fix
> > > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > > The foreign memory mapping is proposed to be used for Virtio
> > > > backends
> > > > > > (device emulators) if there is a need to run guest OS completely
> > > > > > unmodified.
> > > > > > > > Of course, the more secure way would be to use grant memory
> > > > mapping.
> > > > > > Brietly, the main difference between them is that with foreign
> mapping
> > > > the
> > > > > > backend
> > > > > > > > can map any guest memory it wants to map, but with grant
> mapping
> > > > it is
> > > > > > allowed to map only what was previously granted by the frontend.
> > > > > > > >
> > > > > > > > So, there might be a problem if we want to pre-map some
> guest
> > > > memory
> > > > > > in advance or to cache mappings in the backend in order to
> improve
> > > > > > performance (because the mapping/unmapping guest pages every
> request
> > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> nutshell,
> > > > > > currently, in order to map a guest page into the backend address
> space
> > > > we
> > > > > > need to steal a real physical page from the backend domain. So,
> with
> > > > the
> > > > > > said optimizations we might end up with no free memory in the
> backend
> > > > > > domain (see XSA-300). And what we try to achieve is to not waste
> a
> > > > real
> > > > > > domain memory at all by providing safe non-allocated-yet (so
> unused)
> > > > > > address space for the foreign (and grant) pages to be mapped
> into,
> > > > this
> > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> changes.
> > > > > > However, as it turned out, for this to work in a proper and safe
> way
> > > > some
> > > > > > prereq work needs to be done.
> > > > > > > > You can find the related Xen discussion at:
> > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
> send-
> > > > email-
> > > > > > olekstysh@gmail.com/
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > One question is how to best handle notification and
> kicks.
> > > > The
> > > > > > existing
> > > > > > > > > > > vhost-user framework uses eventfd to signal the daemon
> > > > (although
> > > > > > QEMU
> > > > > > > > > > > is quite capable of simulating them when you use TCG).
> Xen
> > > > has
> > > > > > it's own
> > > > > > > > > > > IOREQ mechanism. However latency is an important
> factor and
> > > > > > having
> > > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > > >
> > > > > > > > > > Yeah I think, regardless of anything else, we want the
> > > > backends to
> > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > >
> > > > > > > > > In my approach,
> > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> hypervisor
> > > > > > interface
> > > > > > > > >               via virtio-proxy
> > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> channels),
> > > > > > which is
> > > > > > > > >               converted to a callback to BE via virtio-
> proxy
> > > > > > > > >               (Xen's event channel is internnally
> implemented by
> > > > > > interrupts.)
> > > > > > > > >
> > > > > > > > > I don't know what "connect directly" means here, but
> sending
> > > > > > interrupts
> > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's
> msi-x
> > > > > > mechanism.
> > > > > > > >
> > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > At the moment, in order to notify the frontend, the backend
> issues
> > > > a
> > > > > > specific device-model call to query Xen to inject a
> corresponding SPI
> > > > to
> > > > > > the guest.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Could we consider the kernel internally converting
> IOREQ
> > > > > > messages from
> > > > > > > > > > > the Xen hypervisor to eventfd events? Would this scale
> with
> > > > > > other kernel
> > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > >
> > > > > > > > > > > So any thoughts on what directions are worth
> experimenting
> > > > with?
> > > > > > > > > >
> > > > > > > > > > One option we should consider is for each backend to
> connect
> > > > to
> > > > > > Xen via
> > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> interface
> > > > and
> > > > > > make it
> > > > > > > > > > hypervisor agnostic. The interface is really trivial and
> easy
> > > > to
> > > > > > add.
> > > > > > > > >
> > > > > > > > > As I said above, my proposal does the same thing that you
> > > > mentioned
> > > > > > here :)
> > > > > > > > > The difference is that I do call hypervisor interfaces via
> > > > virtio-
> > > > > > proxy.
> > > > > > > > >
> > > > > > > > > > The only Xen-specific part is the notification mechanism,
> > > > which is
> > > > > > an
> > > > > > > > > > event channel. If we replaced the event channel with
> something
> > > > > > else the
> > > > > > > > > > interface would be generic. See:
> > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > >
> > > > > > > > > > I don't think that translating IOREQs to eventfd in the
> kernel
> > > > is
> > > > > > a
> > > > > > > > > > good idea: if feels like it would be extra complexity
> and that
> > > > the
> > > > > > > > > > kernel shouldn't be involved as this is a backend-
> hypervisor
> > > > > > interface.
> > > > > > > > >
> > > > > > > > > Given that we may want to implement BE as a bare-metal
> > > > application
> > > > > > > > > as I did on Zephyr, I don't think that the translation
> would not
> > > > be
> > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > It will be some kind of abstraction layer of interrupt
> handling
> > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > >
> > > > > > > > > > Also, eventfd is very Linux-centric and we are trying to
> > > > design an
> > > > > > > > > > interface that could work well for RTOSes too. If we
> want to
> > > > do
> > > > > > > > > > something different, both OS-agnostic and hypervisor-
> agnostic,
> > > > > > perhaps
> > > > > > > > > > we could design a new interface. One that could be
> > > > implementable
> > > > > > in the
> > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> other
> > > > > > hypervisor
> > > > > > > > > > too.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > There is also another problem. IOREQ is probably not be
> the
> > > > only
> > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> Don't we
> > > > > > also need
> > > > > > > > > > an interface for the backend to inject interrupts into
> the
> > > > > > frontend? And
> > > > > > > > > > if the backend requires dynamic memory mappings of
> frontend
> > > > pages,
> > > > > > then
> > > > > > > > > > we would also need an interface to map/unmap domU pages.
> > > > > > > > >
> > > > > > > > > My proposal document might help here; All the interfaces
> > > > required
> > > > > > for
> > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are listed
> as
> > > > > > > > > RPC protocols :)
> > > > > > > > >
> > > > > > > > > > These interfaces are a lot more problematic than IOREQ:
> IOREQ
> > > > is
> > > > > > tiny
> > > > > > > > > > and self-contained. It is easy to add anywhere. A new
> > > > interface to
> > > > > > > > > > inject interrupts or map pages is more difficult to
> manage
> > > > because
> > > > > > it
> > > > > > > > > > would require changes scattered across the various
> emulators.
> > > > > > > > >
> > > > > > > > > Exactly. I have no confident yet that my approach will
> also
> > > > apply
> > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > Technically, yes, but whether people can accept it or not
> is a
> > > > > > different
> > > > > > > > > matter.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > -Takahiro Akashi
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Oleksandr Tyshchenko
> > > > > > > IMPORTANT NOTICE: The contents of this email and any
> attachments are
> > > > > > confidential and may also be privileged. If you are not the
> intended
> > > > > > recipient, please notify the sender immediately and do not
> disclose
> > > > the
> > > > > > contents to any other person, use it for any purpose, or store
> or copy
> > > > the
> > > > > > information in any medium. Thank you.
> > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > > > confidential and may also be privileged. If you are not the intended
> > > > recipient, please notify the sender immediately and do not disclose
> the
> > > > contents to any other person, use it for any purpose, or store or
> copy the
> > > > information in any medium. Thank you.
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-26 12:10                     ` Wei Chen
@ 2021-08-30 19:36                       ` Christopher Clark
  2021-08-30 19:53                           ` [virtio-dev] " Christopher Clark
  2021-08-31  6:18                       ` AKASHI Takahiro
  1 sibling, 1 reply; 66+ messages in thread
From: Christopher Clark @ 2021-08-30 19:36 UTC (permalink / raw)
  To: Wei Chen
  Cc: AKASHI Takahiro, Oleksandr Tyshchenko, Stefano Stabellini,
	Alex Benn??e, Kaly Xin, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, Stefano Stabellini, stefanha,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel

[-- Attachment #1: Type: text/plain, Size: 45733 bytes --]

Apologies for being late to this thread, but I hope to be able to
contribute to
this discussion in a meaningful way. I am grateful for the level of
interest in
this topic. I would like to draw your attention to Argo as a suitable
technology for development of VirtIO's hypervisor-agnostic interfaces.

* Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
that
  can send and receive hypervisor-mediated notifications and messages
between
  domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
over
  all communication between domains. It is derived from the earlier v4v,
which
  has been deployed on millions of machines with the HP/Bromium uXen
hypervisor
  and with OpenXT.

* Argo has a simple interface with a small number of operations that was
  designed for ease of integration into OS primitives on both Linux
(sockets)
  and Windows (ReadFile/WriteFile) [2].
    - A unikernel example of using it has also been developed for XTF. [3]

* There has been recent discussion and support in the Xen community for
making
  revisions to the Argo interface to make it hypervisor-agnostic, and
support
  implementations of Argo on other hypervisors. This will enable a single
  interface for an OS kernel binary to use for inter-VM communication that
will
  work on multiple hypervisors -- this applies equally to both backends and
  frontend implementations. [4]

* Here are the design documents for building VirtIO-over-Argo, to support a
  hypervisor-agnostic frontend VirtIO transport driver using Argo.

The Development Plan to build VirtIO virtual device support over Argo
transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1

A design for using VirtIO over Argo, describing how VirtIO data structures
and communication is handled over the Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo

Diagram (from the above document) showing how VirtIO rings are synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241

Please note that the above design documents show that the existing VirtIO
device drivers, and both vring and virtqueue data structures can be
preserved
while interdomain communication can be performed with no shared memory
required
for most drivers; (the exceptions where further design is required are those
such as virtual framebuffer devices where shared memory regions are
intentionally
added to the communication structure beyond the vrings and virtqueues).

An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO

* Argo can be used for a communication path for configuration between the
backend
  and the toolstack, avoiding the need for a dependency on XenStore, which
is an
  advantage for any hypervisor-agnostic design. It is also amenable to a
notification
  mechanism that is not based on Xen event channels.

* Argo does not use or require shared memory between VMs and provides an
alternative
  to the use of foreign shared memory mappings. It avoids some of the
complexities
  involved with using grants (eg. XSA-300).

* Argo supports Mandatory Access Control by the hypervisor, satisfying a
common
  certification requirement.

* The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but
  accessible via the hypercall interface. The licensing should not present
an obstacle
  to adoption of Argo in guest software or implementation by other
hypervisors.

* Since the interface that Argo presents to a guest VM is similar to DMA, a
VirtIO-Argo
  frontend transport driver should be able to operate with a physical
VirtIO-enabled
  smart-NIC if the toolstack and an Argo-aware backend provide support.

The next Xen Community Call is next week and I would be happy to answer
questions
about Argo and on this topic. I will also be following this thread.

Christopher
(Argo maintainer, Xen Community)

--------------------------------------------------------------------------------
[1]
An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
https://www.youtube.com/watch?v=cnC0Tg3jqJQ
Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen

[2]
OpenXT Linux Argo driver and userspace library:
https://github.com/openxt/linux-xen-argo

Windows V4V at OpenXT wiki:
https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
Windows v4v driver source:
https://github.com/OpenXT/xc-windows/tree/master/xenv4v

HP/Bromium uXen V4V driver:
https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib

[3]
v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html

[4]
Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html

VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1


On Thu, Aug 26, 2021 at 5:11 AM Wei Chen <Wei.Chen@arm.com> wrote:

> Hi Akashi,
>
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年8月26日 17:41
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> Xin
> > <Kaly.Xin@arm.com>; Stratos Mailing List <
> stratos-dev@op-lists.linaro.org>;
> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org
> >;
> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > Hi Wei,
> >
> > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > Hi Akashi,
> > > >
> > > > > -----Original Message-----
> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > Sent: 2021年8月18日 13:39
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > Stratos
> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > dev@lists.oasis-
> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > Jean-
> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com
> >;
> > Julien
> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > Hi Akashi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > Stabellini
> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org
> >;
> > > > > Stratos
> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > dev@lists.oasis-
> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> Kumar
> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> Kiszka
> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
> >;
> > Jean-
> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > <Artem_Mygaiev@epam.com>;
> > > > > Julien
> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> backends
> > > > > > >
> > > > > > > Hi Wei, Oleksandr,
> > > > > > >
> > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > > other hypervisors can reuse the virtual device
> implementations.
> > > > > > > >
> > > > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > >
> > > > > > > # My proposal[1] comes from my own idea and doesn't always
> > represent
> > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > > Nevertheless,
> > > > > > >
> > > > > > > Your idea and my proposal seem to share the same background.
> > > > > > > Both have the similar goal and currently start with, at first,
> > Xen
> > > > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > >
> > > > > > > In particular, the abstraction of hypervisor interfaces has a
> > same
> > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> > interfaces").
> > > > > > > This is not co-incident as we both share the same origin as I
> > said
> > > > > above.
> > > > > > > And so we will also share the same issues. One of them is a way
> > of
> > > > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > > > the portability and the performance impact.
> > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > (See Alex's original email, too).
> > > > > > >
> > > > > > Yes, I agree.
> > > > > >
> > > > > > > On the other hand, my approach aims to create a "single-binary"
> > > > > solution
> > > > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
> solution,
> > all
> > > > > > > the hypervisor-specific code would be put into another entity
> > (VM),
> > > > > > > named "virtio-proxy" and the abstracted operations are served
> > via RPC.
> > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > > dependency.)
> > > > > > > But I know that we need discuss if this is a requirement even
> > > > > > > in Stratos project or not. (Maybe not)
> > > > > > >
> > > > > >
> > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> > completely
> > > > > > (I will do it ASAP). But from your description, it seems we need
> a
> > > > > > 3rd VM between FE and BE? My concern is that, if my assumption is
> > right,
> > > > > > will it increase the latency in data transport path? Even if
> we're
> > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > >
> > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > > > As far as we execute 'mapping' operations at every fetch of
> payload,
> > > > > we will see latency issue (even in your case) and if we have some
> > solution
> > > > > for it, we won't see it neither in my proposal :)
> > > > >
> > > >
> > > > Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> > > > of "mapping/unmapping" operations. So the latency caused by this
> > behavior
> > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
> > problem.
> > >
> > > Obviously, I have not yet caught up there in the discussion.
> > > Which patch specifically?
> >
> > Can you give me the link to the discussion or patch, please?
> >
>
> It's a RFC discussion. We have tested this RFC patch internally.
> https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
>
> > Thanks,
> > -Takahiro Akashi
> >
> > > -Takahiro Akashi
> > >
> > > > > > > Specifically speaking about kvm-tool, I have a concern about
> its
> > > > > > > license term; Targeting different hypervisors and different OSs
> > > > > > > (which I assume includes RTOS's), the resultant library should
> > be
> > > > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > > > Any thoughts?
> > > > > > >
> > > > > >
> > > > > > Yes. If user want to implement a FreeBSD device model, but the
> > virtio
> > > > > > library is GPL. Then GPL would be a problem. If we have another
> > good
> > > > > > candidate, I am open to it.
> > > > >
> > > > > I have some candidates, particularly for vq/vring, in my mind:
> > > > > * Open-AMP, or
> > > > > * corresponding Free-BSD code
> > > > >
> > > >
> > > > Interesting, I will look into them : )
> > > >
> > > > Cheers,
> > > > Wei Chen
> > > >
> > > > > -Takahiro Akashi
> > > > >
> > > > >
> > > > > > > -Takahiro Akashi
> > > > > > >
> > > > > > >
> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > > > August/000548.html
> > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > >
> > > > > > > >
> > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> > > > > Stabellini
> > > > > > > <sstabellini@kernel.org>
> > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing
> > List
> > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > open.org;
> > > > > Arnd
> > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> Kiszka
> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
> >;
> > Jean-
> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> > Oleksandr
> > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > <Artem_Mygaiev@epam.com>;
> > > > > Julien
> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > backends
> > > > > > > > >
> > > > > > > > > Hello, all.
> > > > > > > > >
> > > > > > > > > Please see some comments below. And sorry for the possible
> > format
> > > > > > > issues.
> > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> > Stabellini
> > > > > wrote:
> > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> > trimming
> > > > > the
> > > > > > > original
> > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > >
> > > > > > > > > > > My comments below are related to a potential Xen
> > > > > implementation,
> > > > > > > not
> > > > > > > > > > > because it is the only implementation that matters, but
> > > > > because it
> > > > > > > is
> > > > > > > > > > > the one I know best.
> > > > > > > > > >
> > > > > > > > > > Please note that my proposal (and hence the working
> > prototype)[1]
> > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > > > > > particularly
> > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > It has been, I believe, well generalized but is still a
> > bit
> > > > > biased
> > > > > > > > > > toward this original design.
> > > > > > > > > >
> > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > >
> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > dev/2021-
> > > > > > > August/000546.html
> > > > > > > > > >
> > > > > > > > > > Let me take this opportunity to explain a bit more about
> > my
> > > > > approach
> > > > > > > below.
> > > > > > > > > >
> > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > One of the goals of Project Stratos is to enable
> > hypervisor
> > > > > > > agnostic
> > > > > > > > > > > > backends so we can enable as much re-use of code as
> > possible
> > > > > and
> > > > > > > avoid
> > > > > > > > > > > > repeating ourselves. This is the flip side of the
> > front end
> > > > > > > where
> > > > > > > > > > > > multiple front-end implementations are required - one
> > per OS,
> > > > > > > assuming
> > > > > > > > > > > > you don't just want Linux guests. The resultant
> guests
> > are
> > > > > > > trivially
> > > > > > > > > > > > movable between hypervisors modulo any abstracted
> > paravirt
> > > > > type
> > > > > > > > > > > > interfaces.
> > > > > > > > > > > >
> > > > > > > > > > > > In my original thumb nail sketch of a solution I
> > envisioned
> > > > > > > vhost-user
> > > > > > > > > > > > daemons running in a broadly POSIX like environment.
> > The
> > > > > > > interface to
> > > > > > > > > > > > the daemon is fairly simple requiring only some
> mapped
> > > > > memory
> > > > > > > and some
> > > > > > > > > > > > sort of signalling for events (on Linux this is
> > eventfd).
> > > > > The
> > > > > > > idea was a
> > > > > > > > > > > > stub binary would be responsible for any hypervisor
> > specific
> > > > > > > setup and
> > > > > > > > > > > > then launch a common binary to deal with the actual
> > > > > virtqueue
> > > > > > > requests
> > > > > > > > > > > > themselves.
> > > > > > > > > > > >
> > > > > > > > > > > > Since that original sketch we've seen an expansion in
> > the
> > > > > sort
> > > > > > > of ways
> > > > > > > > > > > > backends could be created. There is interest in
> > > > > encapsulating
> > > > > > > backends
> > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
> There
> > > > > interest
> > > > > > > in Rust
> > > > > > > > > > > > has prompted ideas of using the trait interface to
> > abstract
> > > > > > > differences
> > > > > > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > > > > > >
> > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > Standardisation"
> > > > > which
> > > > > > > > > > > > calls for a description of the APIs needed from the
> > > > > hypervisor
> > > > > > > side to
> > > > > > > > > > > > support VirtIO guests and their backends. However we
> > are
> > > > > some
> > > > > > > way off
> > > > > > > > > > > > from that at the moment as I think we need to at
> least
> > > > > > > demonstrate one
> > > > > > > > > > > > portable backend before we start codifying
> > requirements. To
> > > > > that
> > > > > > > end I
> > > > > > > > > > > > want to think about what we need for a backend to
> > function.
> > > > > > > > > > > >
> > > > > > > > > > > > Configuration
> > > > > > > > > > > > =============
> > > > > > > > > > > >
> > > > > > > > > > > > In the type-2 setup this is typically fairly simple
> > because
> > > > > the
> > > > > > > host
> > > > > > > > > > > > system can orchestrate the various modules that make
> > up the
> > > > > > > complete
> > > > > > > > > > > > system. In the type-1 case (or even type-2 with
> > delegated
> > > > > > > service VMs)
> > > > > > > > > > > > we need some sort of mechanism to inform the backend
> > VM
> > > > > about
> > > > > > > key
> > > > > > > > > > > > details about the system:
> > > > > > > > > > > >
> > > > > > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > > > > > >   - how it's going to receive (interrupt) and trigger
> > (kick)
> > > > > > > events
> > > > > > > > > > > >   - what (if any) resources the backend needs to
> > connect to
> > > > > > > > > > > >
> > > > > > > > > > > > Obviously you can elide over configuration issues by
> > having
> > > > > > > static
> > > > > > > > > > > > configurations and baking the assumptions into your
> > guest
> > > > > images
> > > > > > > however
> > > > > > > > > > > > this isn't scalable in the long term. The obvious
> > solution
> > > > > seems
> > > > > > > to be
> > > > > > > > > > > > extending a subset of Device Tree data to user space
> > but
> > > > > perhaps
> > > > > > > there
> > > > > > > > > > > > are other approaches?
> > > > > > > > > > > >
> > > > > > > > > > > > Before any virtio transactions can take place the
> > > > > appropriate
> > > > > > > memory
> > > > > > > > > > > > mappings need to be made between the FE guest and the
> > BE
> > > > > guest.
> > > > > > > > > > >
> > > > > > > > > > > > Currently the whole of the FE guests address space
> > needs to
> > > > > be
> > > > > > > visible
> > > > > > > > > > > > to whatever is serving the virtio requests. I can
> > envision 3
> > > > > > > approaches:
> > > > > > > > > > > >
> > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > >
> > > > > > > > > > > >  This would entail the guest OS knowing where in it's
> > Guest
> > > > > > > Physical
> > > > > > > > > > > >  Address space is already taken up and avoiding
> > clashing. I
> > > > > > > would assume
> > > > > > > > > > > >  in this case you would want a standard interface to
> > > > > userspace
> > > > > > > to then
> > > > > > > > > > > >  make that address space visible to the backend
> daemon.
> > > > > > > > > >
> > > > > > > > > > Yet another way here is that we would have well known
> > "shared
> > > > > > > memory" between
> > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> > insights on
> > > > > this
> > > > > > > matter
> > > > > > > > > > and that it can even be an alternative for hypervisor-
> > agnostic
> > > > > > > solution.
> > > > > > > > > >
> > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> > device
> > > > > and
> > > > > > > can be
> > > > > > > > > > mapped locally.)
> > > > > > > > > >
> > > > > > > > > > I want to add this shared memory aspect to my
> virtio-proxy,
> > but
> > > > > > > > > > the resultant solution would eventually look similar to
> > ivshmem.
> > > > > > > > > >
> > > > > > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > > > > > >
> > > > > > > > > > > >  The BE guest is then free to map the FE's memory to
> > where
> > > > > it
> > > > > > > wants in
> > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > >
> > > > > > > > > > > I cannot see how this could work for Xen. There is no
> > "handle"
> > > > > to
> > > > > > > give
> > > > > > > > > > > to the backend if the backend is not running in dom0.
> So
> > for
> > > > > Xen I
> > > > > > > think
> > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > >
> > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> > information
> > > > > is
> > > > > > > expected
> > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > > >    - the start address of configuration space
> > > > > > > > > >    - interrupt number
> > > > > > > > > >    - file path for backing storage
> > > > > > > > > >    - read-only flag
> > > > > > > > > > And the BE server have to call a particular hypervisor
> > interface
> > > > > to
> > > > > > > > > > map the configuration space.
> > > > > > > > >
> > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > configuration
> > > > > info to
> > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > I remember, there was a wish to avoid using Xenstore in
> > Virtio
> > > > > backend
> > > > > > > itself if possible, so for non-toolstack domain, this could
> done
> > with
> > > > > > > adjusting devd (daemon that listens for devices and launches
> > backends)
> > > > > > > > > to read backend configuration from the Xenstore anyway and
> > pass it
> > > > > to
> > > > > > > the backend via command line arguments.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, in current PoC code we're using xenstore to pass device
> > > > > > > configuration.
> > > > > > > > We also designed a static device configuration parse method
> > for
> > > > > Dom0less
> > > > > > > or
> > > > > > > > other scenarios don't have xentool. yes, it's from device
> > model
> > > > > command
> > > > > > > line
> > > > > > > > or a config file.
> > > > > > > >
> > > > > > > > > But, if ...
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > hypervisor)-
> > > > > > > specific
> > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
> > hide
> > > > > all
> > > > > > > details.
> > > > > > > > >
> > > > > > > > > ... the solution how to overcome that is already found and
> > proven
> > > > > to
> > > > > > > work then even better.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > # My point is that a "handle" is not mandatory for
> > executing
> > > > > mapping.
> > > > > > > > > >
> > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > toolstack (also see below.) Or we would have to invent
> a
> > new
> > > > > Xen
> > > > > > > > > > > hypervisor interface and Xen virtual machine privileges
> > to
> > > > > allow
> > > > > > > this
> > > > > > > > > > > kind of mapping.
> > > > > > > > > >
> > > > > > > > > > > If we run the backend in Dom0 that we have no problems
> > of
> > > > > course.
> > > > > > > > > >
> > > > > > > > > > One of difficulties on Xen that I found in my approach is
> > that
> > > > > > > calling
> > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> > memory) is
> > > > > > > only
> > > > > > > > > > allowed on BE servers themselvies and so we will have to
> > extend
> > > > > > > those
> > > > > > > > > > interfaces.
> > > > > > > > > > This, however, will raise some concern on security and
> > privilege
> > > > > > > distribution
> > > > > > > > > > as Stefan suggested.
> > > > > > > > >
> > > > > > > > > We also faced policy related issues with Virtio backend
> > running in
> > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> > system we
> > > > > run
> > > > > > > the backend in a driver
> > > > > > > > > domain (we call it DomD) where the underlying H/W resides.
> > We
> > > > > trust it,
> > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> > provide
> > > > > it
> > > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > > Now it is permitted to issue device-model, resource and
> > memory
> > > > > > > mappings, etc calls.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
> > can see
> > > > > two
> > > > > > > options
> > > > > > > > > > > >  at this point:
> > > > > > > > > > > >
> > > > > > > > > > > >   - expose the handle to userspace for daemon/helper
> > to
> > > > > trigger
> > > > > > > the
> > > > > > > > > > > >     mapping via existing hypercall interfaces. If
> > using a
> > > > > helper
> > > > > > > you
> > > > > > > > > > > >     would have a hypervisor specific one to avoid the
> > daemon
> > > > > > > having to
> > > > > > > > > > > >     care too much about the details or push that
> > complexity
> > > > > into
> > > > > > > a
> > > > > > > > > > > >     compile time option for the daemon which would
> > result in
> > > > > > > different
> > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > >
> > > > > > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > > > > > differences away
> > > > > > > > > > > >     in the guest kernel. In this case the userspace
> > would
> > > > > > > essentially
> > > > > > > > > > > >     ask for an abstract "map guest N memory to
> > userspace
> > > > > ptr"
> > > > > > > and let
> > > > > > > > > > > >     the kernel deal with the different hypercall
> > interfaces.
> > > > > > > This of
> > > > > > > > > > > >     course assumes the majority of BE guests would be
> > Linux
> > > > > > > kernels and
> > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> > their own
> > > > > > > devices.
> > > > > > > > > > > >
> > > > > > > > > > > > Operation
> > > > > > > > > > > > =========
> > > > > > > > > > > >
> > > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
> > Once
> > > > > the
> > > > > > > > > > > > vhost-user feature negotiation is done it's a case of
> > > > > receiving
> > > > > > > update
> > > > > > > > > > > > events and parsing the resultant virt queue for data.
> > The
> > > > > vhost-
> > > > > > > user
> > > > > > > > > > > > specification handles a bunch of setup before that
> > point,
> > > > > mostly
> > > > > > > to
> > > > > > > > > > > > detail where the virt queues are set up FD's for
> > memory and
> > > > > > > event
> > > > > > > > > > > > communication. This is where the envisioned stub
> > process
> > > > > would
> > > > > > > be
> > > > > > > > > > > > responsible for getting the daemon up and ready to
> run.
> > This
> > > > > is
> > > > > > > > > > > > currently done inside a big VMM like QEMU but I
> > suspect a
> > > > > modern
> > > > > > > > > > > > approach would be to use the rust-vmm vhost crate. It
> > would
> > > > > then
> > > > > > > either
> > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
> re-
> > > > > targeted
> > > > > > > as a
> > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > >
> > > > > > > > > > > One thing I mentioned before to Alex is that Xen
> doesn't
> > have
> > > > > VMMs
> > > > > > > the
> > > > > > > > > > > way they are typically envisioned and described in
> other
> > > > > > > environments.
> > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > > > > > independently to
> > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
> > could
> > > > > be
> > > > > > > used as
> > > > > > > > > > > emulators for a single Xen VM, each of them connecting
> > to Xen
> > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > >
> > > > > > > > > > > The component responsible for starting a daemon and/or
> > setting
> > > > > up
> > > > > > > shared
> > > > > > > > > > > interfaces is the toolstack: the xl command and the
> > > > > libxl/libxc
> > > > > > > > > > > libraries.
> > > > > > > > > >
> > > > > > > > > > I think that VM configuration management (or
> orchestration
> > in
> > > > > > > Startos
> > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > Otherwise, is there any good assumption to avoid it right
> > now?
> > > > > > > > > >
> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
> > for the
> > > > > > > toolstack
> > > > > > > > > > > to create virtio backends and setup memory mappings.
> > They
> > > > > might be
> > > > > > > able
> > > > > > > > > > > to provide more info on the subject. I do think we miss
> > a way
> > > > > to
> > > > > > > provide
> > > > > > > > > > > the configuration to the backend and anything else that
> > the
> > > > > > > backend
> > > > > > > > > > > might require to start doing its job.
> > > > > > > > >
> > > > > > > > > Yes, some work has been done for the toolstack to handle
> > Virtio
> > > > > MMIO
> > > > > > > devices in
> > > > > > > > > general and Virtio block devices in particular. However, it
> > has
> > > > > not
> > > > > > > been upstreaned yet.
> > > > > > > > > Updated patches on review now:
> > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
> > send-
> > > > > email-
> > > > > > > olekstysh@gmail.com/
> > > > > > > > >
> > > > > > > > > There is an additional (also important) activity to
> > improve/fix
> > > > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > > > The foreign memory mapping is proposed to be used for
> Virtio
> > > > > backends
> > > > > > > (device emulators) if there is a need to run guest OS
> completely
> > > > > > > unmodified.
> > > > > > > > > Of course, the more secure way would be to use grant memory
> > > > > mapping.
> > > > > > > Brietly, the main difference between them is that with foreign
> > mapping
> > > > > the
> > > > > > > backend
> > > > > > > > > can map any guest memory it wants to map, but with grant
> > mapping
> > > > > it is
> > > > > > > allowed to map only what was previously granted by the
> frontend.
> > > > > > > > >
> > > > > > > > > So, there might be a problem if we want to pre-map some
> > guest
> > > > > memory
> > > > > > > in advance or to cache mappings in the backend in order to
> > improve
> > > > > > > performance (because the mapping/unmapping guest pages every
> > request
> > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> > nutshell,
> > > > > > > currently, in order to map a guest page into the backend
> address
> > space
> > > > > we
> > > > > > > need to steal a real physical page from the backend domain. So,
> > with
> > > > > the
> > > > > > > said optimizations we might end up with no free memory in the
> > backend
> > > > > > > domain (see XSA-300). And what we try to achieve is to not
> waste
> > a
> > > > > real
> > > > > > > domain memory at all by providing safe non-allocated-yet (so
> > unused)
> > > > > > > address space for the foreign (and grant) pages to be mapped
> > into,
> > > > > this
> > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> > changes.
> > > > > > > However, as it turned out, for this to work in a proper and
> safe
> > way
> > > > > some
> > > > > > > prereq work needs to be done.
> > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
> > send-
> > > > > email-
> > > > > > > olekstysh@gmail.com/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > One question is how to best handle notification and
> > kicks.
> > > > > The
> > > > > > > existing
> > > > > > > > > > > > vhost-user framework uses eventfd to signal the
> daemon
> > > > > (although
> > > > > > > QEMU
> > > > > > > > > > > > is quite capable of simulating them when you use
> TCG).
> > Xen
> > > > > has
> > > > > > > it's own
> > > > > > > > > > > > IOREQ mechanism. However latency is an important
> > factor and
> > > > > > > having
> > > > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > > > >
> > > > > > > > > > > Yeah I think, regardless of anything else, we want the
> > > > > backends to
> > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > >
> > > > > > > > > > In my approach,
> > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> > hypervisor
> > > > > > > interface
> > > > > > > > > >               via virtio-proxy
> > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> > channels),
> > > > > > > which is
> > > > > > > > > >               converted to a callback to BE via virtio-
> > proxy
> > > > > > > > > >               (Xen's event channel is internnally
> > implemented by
> > > > > > > interrupts.)
> > > > > > > > > >
> > > > > > > > > > I don't know what "connect directly" means here, but
> > sending
> > > > > > > interrupts
> > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
> PCI's
> > msi-x
> > > > > > > mechanism.
> > > > > > > > >
> > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > At the moment, in order to notify the frontend, the backend
> > issues
> > > > > a
> > > > > > > specific device-model call to query Xen to inject a
> > corresponding SPI
> > > > > to
> > > > > > > the guest.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Could we consider the kernel internally converting
> > IOREQ
> > > > > > > messages from
> > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
> scale
> > with
> > > > > > > other kernel
> > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > >
> > > > > > > > > > > > So any thoughts on what directions are worth
> > experimenting
> > > > > with?
> > > > > > > > > > >
> > > > > > > > > > > One option we should consider is for each backend to
> > connect
> > > > > to
> > > > > > > Xen via
> > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> > interface
> > > > > and
> > > > > > > make it
> > > > > > > > > > > hypervisor agnostic. The interface is really trivial
> and
> > easy
> > > > > to
> > > > > > > add.
> > > > > > > > > >
> > > > > > > > > > As I said above, my proposal does the same thing that you
> > > > > mentioned
> > > > > > > here :)
> > > > > > > > > > The difference is that I do call hypervisor interfaces
> via
> > > > > virtio-
> > > > > > > proxy.
> > > > > > > > > >
> > > > > > > > > > > The only Xen-specific part is the notification
> mechanism,
> > > > > which is
> > > > > > > an
> > > > > > > > > > > event channel. If we replaced the event channel with
> > something
> > > > > > > else the
> > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > >
> > > > > > > > > > > I don't think that translating IOREQs to eventfd in the
> > kernel
> > > > > is
> > > > > > > a
> > > > > > > > > > > good idea: if feels like it would be extra complexity
> > and that
> > > > > the
> > > > > > > > > > > kernel shouldn't be involved as this is a backend-
> > hypervisor
> > > > > > > interface.
> > > > > > > > > >
> > > > > > > > > > Given that we may want to implement BE as a bare-metal
> > > > > application
> > > > > > > > > > as I did on Zephyr, I don't think that the translation
> > would not
> > > > > be
> > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > It will be some kind of abstraction layer of interrupt
> > handling
> > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > >
> > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
> to
> > > > > design an
> > > > > > > > > > > interface that could work well for RTOSes too. If we
> > want to
> > > > > do
> > > > > > > > > > > something different, both OS-agnostic and hypervisor-
> > agnostic,
> > > > > > > perhaps
> > > > > > > > > > > we could design a new interface. One that could be
> > > > > implementable
> > > > > > > in the
> > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> > other
> > > > > > > hypervisor
> > > > > > > > > > > too.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > There is also another problem. IOREQ is probably not be
> > the
> > > > > only
> > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> > Don't we
> > > > > > > also need
> > > > > > > > > > > an interface for the backend to inject interrupts into
> > the
> > > > > > > frontend? And
> > > > > > > > > > > if the backend requires dynamic memory mappings of
> > frontend
> > > > > pages,
> > > > > > > then
> > > > > > > > > > > we would also need an interface to map/unmap domU
> pages.
> > > > > > > > > >
> > > > > > > > > > My proposal document might help here; All the interfaces
> > > > > required
> > > > > > > for
> > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
> listed
> > as
> > > > > > > > > > RPC protocols :)
> > > > > > > > > >
> > > > > > > > > > > These interfaces are a lot more problematic than IOREQ:
> > IOREQ
> > > > > is
> > > > > > > tiny
> > > > > > > > > > > and self-contained. It is easy to add anywhere. A new
> > > > > interface to
> > > > > > > > > > > inject interrupts or map pages is more difficult to
> > manage
> > > > > because
> > > > > > > it
> > > > > > > > > > > would require changes scattered across the various
> > emulators.
> > > > > > > > > >
> > > > > > > > > > Exactly. I have no confident yet that my approach will
> > also
> > > > > apply
> > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > Technically, yes, but whether people can accept it or not
> > is a
> > > > > > > different
> > > > > > > > > > matter.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > -Takahiro Akashi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Oleksandr Tyshchenko
> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> > attachments are
> > > > > > > confidential and may also be privileged. If you are not the
> > intended
> > > > > > > recipient, please notify the sender immediately and do not
> > disclose
> > > > > the
> > > > > > > contents to any other person, use it for any purpose, or store
> > or copy
> > > > > the
> > > > > > > information in any medium. Thank you.
> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> > are
> > > > > confidential and may also be privileged. If you are not the
> intended
> > > > > recipient, please notify the sender immediately and do not disclose
> > the
> > > > > contents to any other person, use it for any purpose, or store or
> > copy the
> > > > > information in any medium. Thank you.
> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>

[-- Attachment #2: Type: text/html, Size: 74694 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-30 19:36                       ` Christopher Clark
@ 2021-08-30 19:53                           ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-08-30 19:53 UTC (permalink / raw)
  To: Wei Chen
  Cc: AKASHI Takahiro, Oleksandr Tyshchenko, Stefano Stabellini,
	Alex Benn??e, Kaly Xin, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, Stefano Stabellini, stefanha,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith

[-- Attachment #1: Type: text/plain, Size: 47054 bytes --]

[ resending message to ensure delivery to the CCd mailing lists
post-subscription ]

Apologies for being late to this thread, but I hope to be able to
contribute to
this discussion in a meaningful way. I am grateful for the level of
interest in
this topic. I would like to draw your attention to Argo as a suitable
technology for development of VirtIO's hypervisor-agnostic interfaces.

* Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
that
  can send and receive hypervisor-mediated notifications and messages
between
  domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
over
  all communication between domains. It is derived from the earlier v4v,
which
  has been deployed on millions of machines with the HP/Bromium uXen
hypervisor
  and with OpenXT.

* Argo has a simple interface with a small number of operations that was
  designed for ease of integration into OS primitives on both Linux
(sockets)
  and Windows (ReadFile/WriteFile) [2].
    - A unikernel example of using it has also been developed for XTF. [3]

* There has been recent discussion and support in the Xen community for
making
  revisions to the Argo interface to make it hypervisor-agnostic, and
support
  implementations of Argo on other hypervisors. This will enable a single
  interface for an OS kernel binary to use for inter-VM communication that
will
  work on multiple hypervisors -- this applies equally to both backends and
  frontend implementations. [4]

* Here are the design documents for building VirtIO-over-Argo, to support a
  hypervisor-agnostic frontend VirtIO transport driver using Argo.

The Development Plan to build VirtIO virtual device support over Argo
transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1

A design for using VirtIO over Argo, describing how VirtIO data structures
and communication is handled over the Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo

Diagram (from the above document) showing how VirtIO rings are synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241

Please note that the above design documents show that the existing VirtIO
device drivers, and both vring and virtqueue data structures can be
preserved
while interdomain communication can be performed with no shared memory
required
for most drivers; (the exceptions where further design is required are those
such as virtual framebuffer devices where shared memory regions are
intentionally
added to the communication structure beyond the vrings and virtqueues).

An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO

* Argo can be used for a communication path for configuration between the
backend
  and the toolstack, avoiding the need for a dependency on XenStore, which
is an
  advantage for any hypervisor-agnostic design. It is also amenable to a
notification
  mechanism that is not based on Xen event channels.

* Argo does not use or require shared memory between VMs and provides an
alternative
  to the use of foreign shared memory mappings. It avoids some of the
complexities
  involved with using grants (eg. XSA-300).

* Argo supports Mandatory Access Control by the hypervisor, satisfying a
common
  certification requirement.

* The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but
  accessible via the hypercall interface. The licensing should not present
an obstacle
  to adoption of Argo in guest software or implementation by other
hypervisors.

* Since the interface that Argo presents to a guest VM is similar to DMA, a
VirtIO-Argo
  frontend transport driver should be able to operate with a physical
VirtIO-enabled
  smart-NIC if the toolstack and an Argo-aware backend provide support.

The next Xen Community Call is next week and I would be happy to answer
questions
about Argo and on this topic. I will also be following this thread.

Christopher
(Argo maintainer, Xen Community)

--------------------------------------------------------------------------------
[1]
An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
https://www.youtube.com/watch?v=cnC0Tg3jqJQ
Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen

[2]
OpenXT Linux Argo driver and userspace library:
https://github.com/openxt/linux-xen-argo

Windows V4V at OpenXT wiki:
https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
Windows v4v driver source:
https://github.com/OpenXT/xc-windows/tree/master/xenv4v

HP/Bromium uXen V4V driver:
https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib

[3]
v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html

[4]
Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html

VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1


> On Thu, Aug 26, 2021 at 5:11 AM Wei Chen <Wei.Chen@arm.com> wrote:
>
>> Hi Akashi,
>>
>> > -----Original Message-----
>> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > Sent: 2021年8月26日 17:41
>> > To: Wei Chen <Wei.Chen@arm.com>
>> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
>> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
>> Xin
>> > <Kaly.Xin@arm.com>; Stratos Mailing List <
>> stratos-dev@op-lists.linaro.org>;
>> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <
>> arnd.bergmann@linaro.org>;
>> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
>> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
>> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
>> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
>> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
>> Julien
>> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
>> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>> >
>> > Hi Wei,
>> >
>> > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
>> > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
>> > > > Hi Akashi,
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > > > > Sent: 2021年8月18日 13:39
>> > > > > To: Wei Chen <Wei.Chen@arm.com>
>> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
>> Stabellini
>> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
>> > Stratos
>> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
>> > dev@lists.oasis-
>> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
>> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
>> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
>> > Jean-
>> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com
>> >;
>> > Julien
>> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>> > > > >
>> > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
>> > > > > > Hi Akashi,
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > > > > > > Sent: 2021年8月17日 16:08
>> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
>> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
>> > Stabellini
>> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <
>> alex.bennee@linaro.org>;
>> > > > > Stratos
>> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
>> > > > > dev@lists.oasis-
>> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
>> Kumar
>> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
>> Kiszka
>> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
>> >;
>> > Jean-
>> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
>> > <Artem_Mygaiev@epam.com>;
>> > > > > Julien
>> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
>> backends
>> > > > > > >
>> > > > > > > Hi Wei, Oleksandr,
>> > > > > > >
>> > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
>> > > > > > > > Hi All,
>> > > > > > > >
>> > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
>> > > > > > > > This proposal is still discussing in Xen and KVM
>> communities.
>> > > > > > > > The main work is to decouple the kvmtool from KVM and make
>> > > > > > > > other hypervisors can reuse the virtual device
>> implementations.
>> > > > > > > >
>> > > > > > > > In this case, we need to introduce an intermediate
>> hypervisor
>> > > > > > > > layer for VMM abstraction, Which is, I think it's very close
>> > > > > > > > to stratos' virtio hypervisor agnosticism work.
>> > > > > > >
>> > > > > > > # My proposal[1] comes from my own idea and doesn't always
>> > represent
>> > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
>> > > > > Nevertheless,
>> > > > > > >
>> > > > > > > Your idea and my proposal seem to share the same background.
>> > > > > > > Both have the similar goal and currently start with, at first,
>> > Xen
>> > > > > > > and are based on kvm-tool. (Actually, my work is derived from
>> > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
>> > > > > > >
>> > > > > > > In particular, the abstraction of hypervisor interfaces has a
>> > same
>> > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
>> > interfaces").
>> > > > > > > This is not co-incident as we both share the same origin as I
>> > said
>> > > > > above.
>> > > > > > > And so we will also share the same issues. One of them is a
>> way
>> > of
>> > > > > > > "sharing/mapping FE's memory". There is some trade-off between
>> > > > > > > the portability and the performance impact.
>> > > > > > > So we can discuss the topic here in this ML, too.
>> > > > > > > (See Alex's original email, too).
>> > > > > > >
>> > > > > > Yes, I agree.
>> > > > > >
>> > > > > > > On the other hand, my approach aims to create a
>> "single-binary"
>> > > > > solution
>> > > > > > > in which the same binary of BE vm could run on any
>> hypervisors.
>> > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
>> solution,
>> > all
>> > > > > > > the hypervisor-specific code would be put into another entity
>> > (VM),
>> > > > > > > named "virtio-proxy" and the abstracted operations are served
>> > via RPC.
>> > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
>> > > > > dependency.)
>> > > > > > > But I know that we need discuss if this is a requirement even
>> > > > > > > in Stratos project or not. (Maybe not)
>> > > > > > >
>> > > > > >
>> > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
>> > completely
>> > > > > > (I will do it ASAP). But from your description, it seems we
>> need a
>> > > > > > 3rd VM between FE and BE? My concern is that, if my assumption
>> is
>> > right,
>> > > > > > will it increase the latency in data transport path? Even if
>> we're
>> > > > > > using some lightweight guest like RTOS or Unikernel,
>> > > > >
>> > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
>> > > > > As far as we execute 'mapping' operations at every fetch of
>> payload,
>> > > > > we will see latency issue (even in your case) and if we have some
>> > solution
>> > > > > for it, we won't see it neither in my proposal :)
>> > > > >
>> > > >
>> > > > Oleksandr has sent a proposal to Xen mailing list to reduce this
>> kind
>> > > > of "mapping/unmapping" operations. So the latency caused by this
>> > behavior
>> > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
>> > problem.
>> > >
>> > > Obviously, I have not yet caught up there in the discussion.
>> > > Which patch specifically?
>> >
>> > Can you give me the link to the discussion or patch, please?
>> >
>>
>> It's a RFC discussion. We have tested this RFC patch internally.
>> https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
>>
>> > Thanks,
>> > -Takahiro Akashi
>> >
>> > > -Takahiro Akashi
>> > >
>> > > > > > > Specifically speaking about kvm-tool, I have a concern about
>> its
>> > > > > > > license term; Targeting different hypervisors and different
>> OSs
>> > > > > > > (which I assume includes RTOS's), the resultant library should
>> > be
>> > > > > > > license permissive and GPL for kvm-tool might be an issue.
>> > > > > > > Any thoughts?
>> > > > > > >
>> > > > > >
>> > > > > > Yes. If user want to implement a FreeBSD device model, but the
>> > virtio
>> > > > > > library is GPL. Then GPL would be a problem. If we have another
>> > good
>> > > > > > candidate, I am open to it.
>> > > > >
>> > > > > I have some candidates, particularly for vq/vring, in my mind:
>> > > > > * Open-AMP, or
>> > > > > * corresponding Free-BSD code
>> > > > >
>> > > >
>> > > > Interesting, I will look into them : )
>> > > >
>> > > > Cheers,
>> > > > Wei Chen
>> > > >
>> > > > > -Takahiro Akashi
>> > > > >
>> > > > >
>> > > > > > > -Takahiro Akashi
>> > > > > > >
>> > > > > > >
>> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
>> > > > > > > August/000548.html
>> > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
>> > > > > > > > > Sent: 2021年8月14日 23:38
>> > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
>> > > > > Stabellini
>> > > > > > > <sstabellini@kernel.org>
>> > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
>> Mailing
>> > List
>> > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
>> > open.org;
>> > > > > Arnd
>> > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
>> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
>> Kiszka
>> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
>> >;
>> > Jean-
>> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
>> > Oleksandr
>> > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
>> > <Artem_Mygaiev@epam.com>;
>> > > > > Julien
>> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
>> > backends
>> > > > > > > > >
>> > > > > > > > > Hello, all.
>> > > > > > > > >
>> > > > > > > > > Please see some comments below. And sorry for the possible
>> > format
>> > > > > > > issues.
>> > > > > > > > >
>> > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
>> > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
>> > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
>> > Stabellini
>> > > > > wrote:
>> > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
>> > trimming
>> > > > > the
>> > > > > > > original
>> > > > > > > > > > > email to let them read the full context.
>> > > > > > > > > > >
>> > > > > > > > > > > My comments below are related to a potential Xen
>> > > > > implementation,
>> > > > > > > not
>> > > > > > > > > > > because it is the only implementation that matters,
>> but
>> > > > > because it
>> > > > > > > is
>> > > > > > > > > > > the one I know best.
>> > > > > > > > > >
>> > > > > > > > > > Please note that my proposal (and hence the working
>> > prototype)[1]
>> > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
>> > > > > > > particularly
>> > > > > > > > > > EPAM's virtio-disk application (backend server).
>> > > > > > > > > > It has been, I believe, well generalized but is still a
>> > bit
>> > > > > biased
>> > > > > > > > > > toward this original design.
>> > > > > > > > > >
>> > > > > > > > > > So I hope you like my approach :)
>> > > > > > > > > >
>> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
>> > dev/2021-
>> > > > > > > August/000546.html
>> > > > > > > > > >
>> > > > > > > > > > Let me take this opportunity to explain a bit more about
>> > my
>> > > > > approach
>> > > > > > > below.
>> > > > > > > > > >
>> > > > > > > > > > > Also, please see this relevant email thread:
>> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
>> > > > > > > > > > > > Hi,
>> > > > > > > > > > > >
>> > > > > > > > > > > > One of the goals of Project Stratos is to enable
>> > hypervisor
>> > > > > > > agnostic
>> > > > > > > > > > > > backends so we can enable as much re-use of code as
>> > possible
>> > > > > and
>> > > > > > > avoid
>> > > > > > > > > > > > repeating ourselves. This is the flip side of the
>> > front end
>> > > > > > > where
>> > > > > > > > > > > > multiple front-end implementations are required -
>> one
>> > per OS,
>> > > > > > > assuming
>> > > > > > > > > > > > you don't just want Linux guests. The resultant
>> guests
>> > are
>> > > > > > > trivially
>> > > > > > > > > > > > movable between hypervisors modulo any abstracted
>> > paravirt
>> > > > > type
>> > > > > > > > > > > > interfaces.
>> > > > > > > > > > > >
>> > > > > > > > > > > > In my original thumb nail sketch of a solution I
>> > envisioned
>> > > > > > > vhost-user
>> > > > > > > > > > > > daemons running in a broadly POSIX like environment.
>> > The
>> > > > > > > interface to
>> > > > > > > > > > > > the daemon is fairly simple requiring only some
>> mapped
>> > > > > memory
>> > > > > > > and some
>> > > > > > > > > > > > sort of signalling for events (on Linux this is
>> > eventfd).
>> > > > > The
>> > > > > > > idea was a
>> > > > > > > > > > > > stub binary would be responsible for any hypervisor
>> > specific
>> > > > > > > setup and
>> > > > > > > > > > > > then launch a common binary to deal with the actual
>> > > > > virtqueue
>> > > > > > > requests
>> > > > > > > > > > > > themselves.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Since that original sketch we've seen an expansion
>> in
>> > the
>> > > > > sort
>> > > > > > > of ways
>> > > > > > > > > > > > backends could be created. There is interest in
>> > > > > encapsulating
>> > > > > > > backends
>> > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
>> There
>> > > > > interest
>> > > > > > > in Rust
>> > > > > > > > > > > > has prompted ideas of using the trait interface to
>> > abstract
>> > > > > > > differences
>> > > > > > > > > > > > away as well as the idea of bare-metal Rust
>> backends.
>> > > > > > > > > > > >
>> > > > > > > > > > > > We have a card (STR-12) called "Hypercall
>> > Standardisation"
>> > > > > which
>> > > > > > > > > > > > calls for a description of the APIs needed from the
>> > > > > hypervisor
>> > > > > > > side to
>> > > > > > > > > > > > support VirtIO guests and their backends. However we
>> > are
>> > > > > some
>> > > > > > > way off
>> > > > > > > > > > > > from that at the moment as I think we need to at
>> least
>> > > > > > > demonstrate one
>> > > > > > > > > > > > portable backend before we start codifying
>> > requirements. To
>> > > > > that
>> > > > > > > end I
>> > > > > > > > > > > > want to think about what we need for a backend to
>> > function.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Configuration
>> > > > > > > > > > > > =============
>> > > > > > > > > > > >
>> > > > > > > > > > > > In the type-2 setup this is typically fairly simple
>> > because
>> > > > > the
>> > > > > > > host
>> > > > > > > > > > > > system can orchestrate the various modules that make
>> > up the
>> > > > > > > complete
>> > > > > > > > > > > > system. In the type-1 case (or even type-2 with
>> > delegated
>> > > > > > > service VMs)
>> > > > > > > > > > > > we need some sort of mechanism to inform the backend
>> > VM
>> > > > > about
>> > > > > > > key
>> > > > > > > > > > > > details about the system:
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - where virt queue memory is in it's address space
>> > > > > > > > > > > >   - how it's going to receive (interrupt) and
>> trigger
>> > (kick)
>> > > > > > > events
>> > > > > > > > > > > >   - what (if any) resources the backend needs to
>> > connect to
>> > > > > > > > > > > >
>> > > > > > > > > > > > Obviously you can elide over configuration issues by
>> > having
>> > > > > > > static
>> > > > > > > > > > > > configurations and baking the assumptions into your
>> > guest
>> > > > > images
>> > > > > > > however
>> > > > > > > > > > > > this isn't scalable in the long term. The obvious
>> > solution
>> > > > > seems
>> > > > > > > to be
>> > > > > > > > > > > > extending a subset of Device Tree data to user space
>> > but
>> > > > > perhaps
>> > > > > > > there
>> > > > > > > > > > > > are other approaches?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Before any virtio transactions can take place the
>> > > > > appropriate
>> > > > > > > memory
>> > > > > > > > > > > > mappings need to be made between the FE guest and
>> the
>> > BE
>> > > > > guest.
>> > > > > > > > > > >
>> > > > > > > > > > > > Currently the whole of the FE guests address space
>> > needs to
>> > > > > be
>> > > > > > > visible
>> > > > > > > > > > > > to whatever is serving the virtio requests. I can
>> > envision 3
>> > > > > > > approaches:
>> > > > > > > > > > > >
>> > > > > > > > > > > >  * BE guest boots with memory already mapped
>> > > > > > > > > > > >
>> > > > > > > > > > > >  This would entail the guest OS knowing where in
>> it's
>> > Guest
>> > > > > > > Physical
>> > > > > > > > > > > >  Address space is already taken up and avoiding
>> > clashing. I
>> > > > > > > would assume
>> > > > > > > > > > > >  in this case you would want a standard interface to
>> > > > > userspace
>> > > > > > > to then
>> > > > > > > > > > > >  make that address space visible to the backend
>> daemon.
>> > > > > > > > > >
>> > > > > > > > > > Yet another way here is that we would have well known
>> > "shared
>> > > > > > > memory" between
>> > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
>> > insights on
>> > > > > this
>> > > > > > > matter
>> > > > > > > > > > and that it can even be an alternative for hypervisor-
>> > agnostic
>> > > > > > > solution.
>> > > > > > > > > >
>> > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
>> > device
>> > > > > and
>> > > > > > > can be
>> > > > > > > > > > mapped locally.)
>> > > > > > > > > >
>> > > > > > > > > > I want to add this shared memory aspect to my
>> virtio-proxy,
>> > but
>> > > > > > > > > > the resultant solution would eventually look similar to
>> > ivshmem.
>> > > > > > > > > >
>> > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
>> memory
>> > > > > > > > > > > >
>> > > > > > > > > > > >  The BE guest is then free to map the FE's memory to
>> > where
>> > > > > it
>> > > > > > > wants in
>> > > > > > > > > > > >  the BE's guest physical address space.
>> > > > > > > > > > >
>> > > > > > > > > > > I cannot see how this could work for Xen. There is no
>> > "handle"
>> > > > > to
>> > > > > > > give
>> > > > > > > > > > > to the backend if the backend is not running in dom0.
>> So
>> > for
>> > > > > Xen I
>> > > > > > > think
>> > > > > > > > > > > the memory has to be already mapped
>> > > > > > > > > >
>> > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
>> > information
>> > > > > is
>> > > > > > > expected
>> > > > > > > > > > to be exposed to BE via Xenstore:
>> > > > > > > > > > (I know that this is a tentative approach though.)
>> > > > > > > > > >    - the start address of configuration space
>> > > > > > > > > >    - interrupt number
>> > > > > > > > > >    - file path for backing storage
>> > > > > > > > > >    - read-only flag
>> > > > > > > > > > And the BE server have to call a particular hypervisor
>> > interface
>> > > > > to
>> > > > > > > > > > map the configuration space.
>> > > > > > > > >
>> > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
>> > configuration
>> > > > > info to
>> > > > > > > the backend running in a non-toolstack domain.
>> > > > > > > > > I remember, there was a wish to avoid using Xenstore in
>> > Virtio
>> > > > > backend
>> > > > > > > itself if possible, so for non-toolstack domain, this could
>> done
>> > with
>> > > > > > > adjusting devd (daemon that listens for devices and launches
>> > backends)
>> > > > > > > > > to read backend configuration from the Xenstore anyway and
>> > pass it
>> > > > > to
>> > > > > > > the backend via command line arguments.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, in current PoC code we're using xenstore to pass device
>> > > > > > > configuration.
>> > > > > > > > We also designed a static device configuration parse method
>> > for
>> > > > > Dom0less
>> > > > > > > or
>> > > > > > > > other scenarios don't have xentool. yes, it's from device
>> > model
>> > > > > command
>> > > > > > > line
>> > > > > > > > or a config file.
>> > > > > > > >
>> > > > > > > > > But, if ...
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
>> > hypervisor)-
>> > > > > > > specific
>> > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
>> > hide
>> > > > > all
>> > > > > > > details.
>> > > > > > > > >
>> > > > > > > > > ... the solution how to overcome that is already found and
>> > proven
>> > > > > to
>> > > > > > > work then even better.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > # My point is that a "handle" is not mandatory for
>> > executing
>> > > > > mapping.
>> > > > > > > > > >
>> > > > > > > > > > > and the mapping probably done by the
>> > > > > > > > > > > toolstack (also see below.) Or we would have to
>> invent a
>> > new
>> > > > > Xen
>> > > > > > > > > > > hypervisor interface and Xen virtual machine
>> privileges
>> > to
>> > > > > allow
>> > > > > > > this
>> > > > > > > > > > > kind of mapping.
>> > > > > > > > > >
>> > > > > > > > > > > If we run the backend in Dom0 that we have no problems
>> > of
>> > > > > course.
>> > > > > > > > > >
>> > > > > > > > > > One of difficulties on Xen that I found in my approach
>> is
>> > that
>> > > > > > > calling
>> > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
>> > memory) is
>> > > > > > > only
>> > > > > > > > > > allowed on BE servers themselvies and so we will have to
>> > extend
>> > > > > > > those
>> > > > > > > > > > interfaces.
>> > > > > > > > > > This, however, will raise some concern on security and
>> > privilege
>> > > > > > > distribution
>> > > > > > > > > > as Stefan suggested.
>> > > > > > > > >
>> > > > > > > > > We also faced policy related issues with Virtio backend
>> > running in
>> > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
>> > system we
>> > > > > run
>> > > > > > > the backend in a driver
>> > > > > > > > > domain (we call it DomD) where the underlying H/W resides.
>> > We
>> > > > > trust it,
>> > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
>> > provide
>> > > > > it
>> > > > > > > with a little bit more privileges than a simple DomU had.
>> > > > > > > > > Now it is permitted to issue device-model, resource and
>> > memory
>> > > > > > > mappings, etc calls.
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > To activate the mapping will
>> > > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
>> > can see
>> > > > > two
>> > > > > > > options
>> > > > > > > > > > > >  at this point:
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - expose the handle to userspace for daemon/helper
>> > to
>> > > > > trigger
>> > > > > > > the
>> > > > > > > > > > > >     mapping via existing hypercall interfaces. If
>> > using a
>> > > > > helper
>> > > > > > > you
>> > > > > > > > > > > >     would have a hypervisor specific one to avoid
>> the
>> > daemon
>> > > > > > > having to
>> > > > > > > > > > > >     care too much about the details or push that
>> > complexity
>> > > > > into
>> > > > > > > a
>> > > > > > > > > > > >     compile time option for the daemon which would
>> > result in
>> > > > > > > different
>> > > > > > > > > > > >     binaries although a common source base.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - expose a new kernel ABI to abstract the
>> hypercall
>> > > > > > > differences away
>> > > > > > > > > > > >     in the guest kernel. In this case the userspace
>> > would
>> > > > > > > essentially
>> > > > > > > > > > > >     ask for an abstract "map guest N memory to
>> > userspace
>> > > > > ptr"
>> > > > > > > and let
>> > > > > > > > > > > >     the kernel deal with the different hypercall
>> > interfaces.
>> > > > > > > This of
>> > > > > > > > > > > >     course assumes the majority of BE guests would
>> be
>> > Linux
>> > > > > > > kernels and
>> > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
>> > their own
>> > > > > > > devices.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Operation
>> > > > > > > > > > > > =========
>> > > > > > > > > > > >
>> > > > > > > > > > > > The core of the operation of VirtIO is fairly
>> simple.
>> > Once
>> > > > > the
>> > > > > > > > > > > > vhost-user feature negotiation is done it's a case
>> of
>> > > > > receiving
>> > > > > > > update
>> > > > > > > > > > > > events and parsing the resultant virt queue for
>> data.
>> > The
>> > > > > vhost-
>> > > > > > > user
>> > > > > > > > > > > > specification handles a bunch of setup before that
>> > point,
>> > > > > mostly
>> > > > > > > to
>> > > > > > > > > > > > detail where the virt queues are set up FD's for
>> > memory and
>> > > > > > > event
>> > > > > > > > > > > > communication. This is where the envisioned stub
>> > process
>> > > > > would
>> > > > > > > be
>> > > > > > > > > > > > responsible for getting the daemon up and ready to
>> run.
>> > This
>> > > > > is
>> > > > > > > > > > > > currently done inside a big VMM like QEMU but I
>> > suspect a
>> > > > > modern
>> > > > > > > > > > > > approach would be to use the rust-vmm vhost crate.
>> It
>> > would
>> > > > > then
>> > > > > > > either
>> > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
>> re-
>> > > > > targeted
>> > > > > > > as a
>> > > > > > > > > > > > build option for the various hypervisors.
>> > > > > > > > > > >
>> > > > > > > > > > > One thing I mentioned before to Alex is that Xen
>> doesn't
>> > have
>> > > > > VMMs
>> > > > > > > the
>> > > > > > > > > > > way they are typically envisioned and described in
>> other
>> > > > > > > environments.
>> > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
>> > > > > > > independently to
>> > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
>> > could
>> > > > > be
>> > > > > > > used as
>> > > > > > > > > > > emulators for a single Xen VM, each of them connecting
>> > to Xen
>> > > > > > > > > > > independently via the IOREQ interface.
>> > > > > > > > > > >
>> > > > > > > > > > > The component responsible for starting a daemon and/or
>> > setting
>> > > > > up
>> > > > > > > shared
>> > > > > > > > > > > interfaces is the toolstack: the xl command and the
>> > > > > libxl/libxc
>> > > > > > > > > > > libraries.
>> > > > > > > > > >
>> > > > > > > > > > I think that VM configuration management (or
>> orchestration
>> > in
>> > > > > > > Startos
>> > > > > > > > > > jargon?) is a subject to debate in parallel.
>> > > > > > > > > > Otherwise, is there any good assumption to avoid it
>> right
>> > now?
>> > > > > > > > > >
>> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
>> > for the
>> > > > > > > toolstack
>> > > > > > > > > > > to create virtio backends and setup memory mappings.
>> > They
>> > > > > might be
>> > > > > > > able
>> > > > > > > > > > > to provide more info on the subject. I do think we
>> miss
>> > a way
>> > > > > to
>> > > > > > > provide
>> > > > > > > > > > > the configuration to the backend and anything else
>> that
>> > the
>> > > > > > > backend
>> > > > > > > > > > > might require to start doing its job.
>> > > > > > > > >
>> > > > > > > > > Yes, some work has been done for the toolstack to handle
>> > Virtio
>> > > > > MMIO
>> > > > > > > devices in
>> > > > > > > > > general and Virtio block devices in particular. However,
>> it
>> > has
>> > > > > not
>> > > > > > > been upstreaned yet.
>> > > > > > > > > Updated patches on review now:
>> > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
>> > send-
>> > > > > email-
>> > > > > > > olekstysh@gmail.com/
>> > > > > > > > >
>> > > > > > > > > There is an additional (also important) activity to
>> > improve/fix
>> > > > > > > foreign memory mapping on Arm which I am also involved in.
>> > > > > > > > > The foreign memory mapping is proposed to be used for
>> Virtio
>> > > > > backends
>> > > > > > > (device emulators) if there is a need to run guest OS
>> completely
>> > > > > > > unmodified.
>> > > > > > > > > Of course, the more secure way would be to use grant
>> memory
>> > > > > mapping.
>> > > > > > > Brietly, the main difference between them is that with foreign
>> > mapping
>> > > > > the
>> > > > > > > backend
>> > > > > > > > > can map any guest memory it wants to map, but with grant
>> > mapping
>> > > > > it is
>> > > > > > > allowed to map only what was previously granted by the
>> frontend.
>> > > > > > > > >
>> > > > > > > > > So, there might be a problem if we want to pre-map some
>> > guest
>> > > > > memory
>> > > > > > > in advance or to cache mappings in the backend in order to
>> > improve
>> > > > > > > performance (because the mapping/unmapping guest pages every
>> > request
>> > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
>> > nutshell,
>> > > > > > > currently, in order to map a guest page into the backend
>> address
>> > space
>> > > > > we
>> > > > > > > need to steal a real physical page from the backend domain.
>> So,
>> > with
>> > > > > the
>> > > > > > > said optimizations we might end up with no free memory in the
>> > backend
>> > > > > > > domain (see XSA-300). And what we try to achieve is to not
>> waste
>> > a
>> > > > > real
>> > > > > > > domain memory at all by providing safe non-allocated-yet (so
>> > unused)
>> > > > > > > address space for the foreign (and grant) pages to be mapped
>> > into,
>> > > > > this
>> > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
>> > changes.
>> > > > > > > However, as it turned out, for this to work in a proper and
>> safe
>> > way
>> > > > > some
>> > > > > > > prereq work needs to be done.
>> > > > > > > > > You can find the related Xen discussion at:
>> > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
>> > send-
>> > > > > email-
>> > > > > > > olekstysh@gmail.com/
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > One question is how to best handle notification and
>> > kicks.
>> > > > > The
>> > > > > > > existing
>> > > > > > > > > > > > vhost-user framework uses eventfd to signal the
>> daemon
>> > > > > (although
>> > > > > > > QEMU
>> > > > > > > > > > > > is quite capable of simulating them when you use
>> TCG).
>> > Xen
>> > > > > has
>> > > > > > > it's own
>> > > > > > > > > > > > IOREQ mechanism. However latency is an important
>> > factor and
>> > > > > > > having
>> > > > > > > > > > > > events go through the stub would add quite a lot.
>> > > > > > > > > > >
>> > > > > > > > > > > Yeah I think, regardless of anything else, we want the
>> > > > > backends to
>> > > > > > > > > > > connect directly to the Xen hypervisor.
>> > > > > > > > > >
>> > > > > > > > > > In my approach,
>> > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
>> > hypervisor
>> > > > > > > interface
>> > > > > > > > > >               via virtio-proxy
>> > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
>> > channels),
>> > > > > > > which is
>> > > > > > > > > >               converted to a callback to BE via virtio-
>> > proxy
>> > > > > > > > > >               (Xen's event channel is internnally
>> > implemented by
>> > > > > > > interrupts.)
>> > > > > > > > > >
>> > > > > > > > > > I don't know what "connect directly" means here, but
>> > sending
>> > > > > > > interrupts
>> > > > > > > > > > to the opposite side would be best efficient.
>> > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
>> PCI's
>> > msi-x
>> > > > > > > mechanism.
>> > > > > > > > >
>> > > > > > > > > Agree that MSI would be more efficient than SPI...
>> > > > > > > > > At the moment, in order to notify the frontend, the
>> backend
>> > issues
>> > > > > a
>> > > > > > > specific device-model call to query Xen to inject a
>> > corresponding SPI
>> > > > > to
>> > > > > > > the guest.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > Could we consider the kernel internally converting
>> > IOREQ
>> > > > > > > messages from
>> > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
>> scale
>> > with
>> > > > > > > other kernel
>> > > > > > > > > > > > hypercall interfaces?
>> > > > > > > > > > > >
>> > > > > > > > > > > > So any thoughts on what directions are worth
>> > experimenting
>> > > > > with?
>> > > > > > > > > > >
>> > > > > > > > > > > One option we should consider is for each backend to
>> > connect
>> > > > > to
>> > > > > > > Xen via
>> > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
>> > interface
>> > > > > and
>> > > > > > > make it
>> > > > > > > > > > > hypervisor agnostic. The interface is really trivial
>> and
>> > easy
>> > > > > to
>> > > > > > > add.
>> > > > > > > > > >
>> > > > > > > > > > As I said above, my proposal does the same thing that
>> you
>> > > > > mentioned
>> > > > > > > here :)
>> > > > > > > > > > The difference is that I do call hypervisor interfaces
>> via
>> > > > > virtio-
>> > > > > > > proxy.
>> > > > > > > > > >
>> > > > > > > > > > > The only Xen-specific part is the notification
>> mechanism,
>> > > > > which is
>> > > > > > > an
>> > > > > > > > > > > event channel. If we replaced the event channel with
>> > something
>> > > > > > > else the
>> > > > > > > > > > > interface would be generic. See:
>> > > > > > > > > > > https://gitlab.com/xen-project/xen/-
>> > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
>> > > > > > > > > > >
>> > > > > > > > > > > I don't think that translating IOREQs to eventfd in
>> the
>> > kernel
>> > > > > is
>> > > > > > > a
>> > > > > > > > > > > good idea: if feels like it would be extra complexity
>> > and that
>> > > > > the
>> > > > > > > > > > > kernel shouldn't be involved as this is a backend-
>> > hypervisor
>> > > > > > > interface.
>> > > > > > > > > >
>> > > > > > > > > > Given that we may want to implement BE as a bare-metal
>> > > > > application
>> > > > > > > > > > as I did on Zephyr, I don't think that the translation
>> > would not
>> > > > > be
>> > > > > > > > > > a big issue, especially on RTOS's.
>> > > > > > > > > > It will be some kind of abstraction layer of interrupt
>> > handling
>> > > > > > > > > > (or nothing but a callback mechanism).
>> > > > > > > > > >
>> > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
>> to
>> > > > > design an
>> > > > > > > > > > > interface that could work well for RTOSes too. If we
>> > want to
>> > > > > do
>> > > > > > > > > > > something different, both OS-agnostic and hypervisor-
>> > agnostic,
>> > > > > > > perhaps
>> > > > > > > > > > > we could design a new interface. One that could be
>> > > > > implementable
>> > > > > > > in the
>> > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
>> > other
>> > > > > > > hypervisor
>> > > > > > > > > > > too.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > There is also another problem. IOREQ is probably not
>> be
>> > the
>> > > > > only
>> > > > > > > > > > > interface needed. Have a look at
>> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
>> > Don't we
>> > > > > > > also need
>> > > > > > > > > > > an interface for the backend to inject interrupts into
>> > the
>> > > > > > > frontend? And
>> > > > > > > > > > > if the backend requires dynamic memory mappings of
>> > frontend
>> > > > > pages,
>> > > > > > > then
>> > > > > > > > > > > we would also need an interface to map/unmap domU
>> pages.
>> > > > > > > > > >
>> > > > > > > > > > My proposal document might help here; All the interfaces
>> > > > > required
>> > > > > > > for
>> > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
>> listed
>> > as
>> > > > > > > > > > RPC protocols :)
>> > > > > > > > > >
>> > > > > > > > > > > These interfaces are a lot more problematic than
>> IOREQ:
>> > IOREQ
>> > > > > is
>> > > > > > > tiny
>> > > > > > > > > > > and self-contained. It is easy to add anywhere. A new
>> > > > > interface to
>> > > > > > > > > > > inject interrupts or map pages is more difficult to
>> > manage
>> > > > > because
>> > > > > > > it
>> > > > > > > > > > > would require changes scattered across the various
>> > emulators.
>> > > > > > > > > >
>> > > > > > > > > > Exactly. I have no confident yet that my approach will
>> > also
>> > > > > apply
>> > > > > > > > > > to other hypervisors than Xen.
>> > > > > > > > > > Technically, yes, but whether people can accept it or
>> not
>> > is a
>> > > > > > > different
>> > > > > > > > > > matter.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > -Takahiro Akashi
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Regards,
>> > > > > > > > >
>> > > > > > > > > Oleksandr Tyshchenko
>> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
>> > attachments are
>> > > > > > > confidential and may also be privileged. If you are not the
>> > intended
>> > > > > > > recipient, please notify the sender immediately and do not
>> > disclose
>> > > > > the
>> > > > > > > contents to any other person, use it for any purpose, or store
>> > or copy
>> > > > > the
>> > > > > > > information in any medium. Thank you.
>> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
>> > are
>> > > > > confidential and may also be privileged. If you are not the
>> intended
>> > > > > recipient, please notify the sender immediately and do not
>> disclose
>> > the
>> > > > > contents to any other person, use it for any purpose, or store or
>> > copy the
>> > > > > information in any medium. Thank you.
>> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
>> > confidential and may also be privileged. If you are not the intended
>> > recipient, please notify the sender immediately and do not disclose the
>> > contents to any other person, use it for any purpose, or store or copy
>> the
>> > information in any medium. Thank you.
>> IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy the
>> information in any medium. Thank you.
>>
>

[-- Attachment #2: Type: text/html, Size: 75017 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-08-30 19:53                           ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-08-30 19:53 UTC (permalink / raw)
  To: Wei Chen
  Cc: AKASHI Takahiro, Oleksandr Tyshchenko, Stefano Stabellini,
	Alex Benn??e, Kaly Xin, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, Stefano Stabellini, stefanha,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith

[-- Attachment #1: Type: text/plain, Size: 47054 bytes --]

[ resending message to ensure delivery to the CCd mailing lists
post-subscription ]

Apologies for being late to this thread, but I hope to be able to
contribute to
this discussion in a meaningful way. I am grateful for the level of
interest in
this topic. I would like to draw your attention to Argo as a suitable
technology for development of VirtIO's hypervisor-agnostic interfaces.

* Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
that
  can send and receive hypervisor-mediated notifications and messages
between
  domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
over
  all communication between domains. It is derived from the earlier v4v,
which
  has been deployed on millions of machines with the HP/Bromium uXen
hypervisor
  and with OpenXT.

* Argo has a simple interface with a small number of operations that was
  designed for ease of integration into OS primitives on both Linux
(sockets)
  and Windows (ReadFile/WriteFile) [2].
    - A unikernel example of using it has also been developed for XTF. [3]

* There has been recent discussion and support in the Xen community for
making
  revisions to the Argo interface to make it hypervisor-agnostic, and
support
  implementations of Argo on other hypervisors. This will enable a single
  interface for an OS kernel binary to use for inter-VM communication that
will
  work on multiple hypervisors -- this applies equally to both backends and
  frontend implementations. [4]

* Here are the design documents for building VirtIO-over-Argo, to support a
  hypervisor-agnostic frontend VirtIO transport driver using Argo.

The Development Plan to build VirtIO virtual device support over Argo
transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1

A design for using VirtIO over Argo, describing how VirtIO data structures
and communication is handled over the Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo

Diagram (from the above document) showing how VirtIO rings are synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241

Please note that the above design documents show that the existing VirtIO
device drivers, and both vring and virtqueue data structures can be
preserved
while interdomain communication can be performed with no shared memory
required
for most drivers; (the exceptions where further design is required are those
such as virtual framebuffer devices where shared memory regions are
intentionally
added to the communication structure beyond the vrings and virtqueues).

An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO

* Argo can be used for a communication path for configuration between the
backend
  and the toolstack, avoiding the need for a dependency on XenStore, which
is an
  advantage for any hypervisor-agnostic design. It is also amenable to a
notification
  mechanism that is not based on Xen event channels.

* Argo does not use or require shared memory between VMs and provides an
alternative
  to the use of foreign shared memory mappings. It avoids some of the
complexities
  involved with using grants (eg. XSA-300).

* Argo supports Mandatory Access Control by the hypervisor, satisfying a
common
  certification requirement.

* The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but
  accessible via the hypercall interface. The licensing should not present
an obstacle
  to adoption of Argo in guest software or implementation by other
hypervisors.

* Since the interface that Argo presents to a guest VM is similar to DMA, a
VirtIO-Argo
  frontend transport driver should be able to operate with a physical
VirtIO-enabled
  smart-NIC if the toolstack and an Argo-aware backend provide support.

The next Xen Community Call is next week and I would be happy to answer
questions
about Argo and on this topic. I will also be following this thread.

Christopher
(Argo maintainer, Xen Community)

--------------------------------------------------------------------------------
[1]
An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
https://www.youtube.com/watch?v=cnC0Tg3jqJQ
Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen

[2]
OpenXT Linux Argo driver and userspace library:
https://github.com/openxt/linux-xen-argo

Windows V4V at OpenXT wiki:
https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
Windows v4v driver source:
https://github.com/OpenXT/xc-windows/tree/master/xenv4v

HP/Bromium uXen V4V driver:
https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib

[3]
v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html

[4]
Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html

VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1


> On Thu, Aug 26, 2021 at 5:11 AM Wei Chen <Wei.Chen@arm.com> wrote:
>
>> Hi Akashi,
>>
>> > -----Original Message-----
>> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > Sent: 2021年8月26日 17:41
>> > To: Wei Chen <Wei.Chen@arm.com>
>> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
>> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
>> Xin
>> > <Kaly.Xin@arm.com>; Stratos Mailing List <
>> stratos-dev@op-lists.linaro.org>;
>> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <
>> arnd.bergmann@linaro.org>;
>> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
>> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
>> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
>> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
>> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
>> Julien
>> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
>> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>> >
>> > Hi Wei,
>> >
>> > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
>> > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
>> > > > Hi Akashi,
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > > > > Sent: 2021年8月18日 13:39
>> > > > > To: Wei Chen <Wei.Chen@arm.com>
>> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
>> Stabellini
>> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
>> > Stratos
>> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
>> > dev@lists.oasis-
>> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
>> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
>> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
>> > Jean-
>> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com
>> >;
>> > Julien
>> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
>> > > > >
>> > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
>> > > > > > Hi Akashi,
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> > > > > > > Sent: 2021年8月17日 16:08
>> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
>> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
>> > Stabellini
>> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <
>> alex.bennee@linaro.org>;
>> > > > > Stratos
>> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
>> > > > > dev@lists.oasis-
>> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
>> Kumar
>> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
>> Kiszka
>> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
>> >;
>> > Jean-
>> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
>> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
>> > <Artem_Mygaiev@epam.com>;
>> > > > > Julien
>> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
>> backends
>> > > > > > >
>> > > > > > > Hi Wei, Oleksandr,
>> > > > > > >
>> > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
>> > > > > > > > Hi All,
>> > > > > > > >
>> > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
>> > > > > > > > This proposal is still discussing in Xen and KVM
>> communities.
>> > > > > > > > The main work is to decouple the kvmtool from KVM and make
>> > > > > > > > other hypervisors can reuse the virtual device
>> implementations.
>> > > > > > > >
>> > > > > > > > In this case, we need to introduce an intermediate
>> hypervisor
>> > > > > > > > layer for VMM abstraction, Which is, I think it's very close
>> > > > > > > > to stratos' virtio hypervisor agnosticism work.
>> > > > > > >
>> > > > > > > # My proposal[1] comes from my own idea and doesn't always
>> > represent
>> > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
>> > > > > Nevertheless,
>> > > > > > >
>> > > > > > > Your idea and my proposal seem to share the same background.
>> > > > > > > Both have the similar goal and currently start with, at first,
>> > Xen
>> > > > > > > and are based on kvm-tool. (Actually, my work is derived from
>> > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
>> > > > > > >
>> > > > > > > In particular, the abstraction of hypervisor interfaces has a
>> > same
>> > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
>> > interfaces").
>> > > > > > > This is not co-incident as we both share the same origin as I
>> > said
>> > > > > above.
>> > > > > > > And so we will also share the same issues. One of them is a
>> way
>> > of
>> > > > > > > "sharing/mapping FE's memory". There is some trade-off between
>> > > > > > > the portability and the performance impact.
>> > > > > > > So we can discuss the topic here in this ML, too.
>> > > > > > > (See Alex's original email, too).
>> > > > > > >
>> > > > > > Yes, I agree.
>> > > > > >
>> > > > > > > On the other hand, my approach aims to create a
>> "single-binary"
>> > > > > solution
>> > > > > > > in which the same binary of BE vm could run on any
>> hypervisors.
>> > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
>> solution,
>> > all
>> > > > > > > the hypervisor-specific code would be put into another entity
>> > (VM),
>> > > > > > > named "virtio-proxy" and the abstracted operations are served
>> > via RPC.
>> > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
>> > > > > dependency.)
>> > > > > > > But I know that we need discuss if this is a requirement even
>> > > > > > > in Stratos project or not. (Maybe not)
>> > > > > > >
>> > > > > >
>> > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
>> > completely
>> > > > > > (I will do it ASAP). But from your description, it seems we
>> need a
>> > > > > > 3rd VM between FE and BE? My concern is that, if my assumption
>> is
>> > right,
>> > > > > > will it increase the latency in data transport path? Even if
>> we're
>> > > > > > using some lightweight guest like RTOS or Unikernel,
>> > > > >
>> > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
>> > > > > As far as we execute 'mapping' operations at every fetch of
>> payload,
>> > > > > we will see latency issue (even in your case) and if we have some
>> > solution
>> > > > > for it, we won't see it neither in my proposal :)
>> > > > >
>> > > >
>> > > > Oleksandr has sent a proposal to Xen mailing list to reduce this
>> kind
>> > > > of "mapping/unmapping" operations. So the latency caused by this
>> > behavior
>> > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
>> > problem.
>> > >
>> > > Obviously, I have not yet caught up there in the discussion.
>> > > Which patch specifically?
>> >
>> > Can you give me the link to the discussion or patch, please?
>> >
>>
>> It's a RFC discussion. We have tested this RFC patch internally.
>> https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
>>
>> > Thanks,
>> > -Takahiro Akashi
>> >
>> > > -Takahiro Akashi
>> > >
>> > > > > > > Specifically speaking about kvm-tool, I have a concern about
>> its
>> > > > > > > license term; Targeting different hypervisors and different
>> OSs
>> > > > > > > (which I assume includes RTOS's), the resultant library should
>> > be
>> > > > > > > license permissive and GPL for kvm-tool might be an issue.
>> > > > > > > Any thoughts?
>> > > > > > >
>> > > > > >
>> > > > > > Yes. If user want to implement a FreeBSD device model, but the
>> > virtio
>> > > > > > library is GPL. Then GPL would be a problem. If we have another
>> > good
>> > > > > > candidate, I am open to it.
>> > > > >
>> > > > > I have some candidates, particularly for vq/vring, in my mind:
>> > > > > * Open-AMP, or
>> > > > > * corresponding Free-BSD code
>> > > > >
>> > > >
>> > > > Interesting, I will look into them : )
>> > > >
>> > > > Cheers,
>> > > > Wei Chen
>> > > >
>> > > > > -Takahiro Akashi
>> > > > >
>> > > > >
>> > > > > > > -Takahiro Akashi
>> > > > > > >
>> > > > > > >
>> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
>> > > > > > > August/000548.html
>> > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
>> > > > > > > > > Sent: 2021年8月14日 23:38
>> > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
>> > > > > Stabellini
>> > > > > > > <sstabellini@kernel.org>
>> > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
>> Mailing
>> > List
>> > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
>> > open.org;
>> > > > > Arnd
>> > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
>> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
>> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
>> Kiszka
>> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
>> > <cvanscha@qti.qualcomm.com>;
>> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
>> >;
>> > Jean-
>> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
>> > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
>> > Oleksandr
>> > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
>> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
>> > <Artem_Mygaiev@epam.com>;
>> > > > > Julien
>> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
>> > Durrant
>> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
>> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
>> > backends
>> > > > > > > > >
>> > > > > > > > > Hello, all.
>> > > > > > > > >
>> > > > > > > > > Please see some comments below. And sorry for the possible
>> > format
>> > > > > > > issues.
>> > > > > > > > >
>> > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
>> > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
>> > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
>> > Stabellini
>> > > > > wrote:
>> > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
>> > trimming
>> > > > > the
>> > > > > > > original
>> > > > > > > > > > > email to let them read the full context.
>> > > > > > > > > > >
>> > > > > > > > > > > My comments below are related to a potential Xen
>> > > > > implementation,
>> > > > > > > not
>> > > > > > > > > > > because it is the only implementation that matters,
>> but
>> > > > > because it
>> > > > > > > is
>> > > > > > > > > > > the one I know best.
>> > > > > > > > > >
>> > > > > > > > > > Please note that my proposal (and hence the working
>> > prototype)[1]
>> > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
>> > > > > > > particularly
>> > > > > > > > > > EPAM's virtio-disk application (backend server).
>> > > > > > > > > > It has been, I believe, well generalized but is still a
>> > bit
>> > > > > biased
>> > > > > > > > > > toward this original design.
>> > > > > > > > > >
>> > > > > > > > > > So I hope you like my approach :)
>> > > > > > > > > >
>> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
>> > dev/2021-
>> > > > > > > August/000546.html
>> > > > > > > > > >
>> > > > > > > > > > Let me take this opportunity to explain a bit more about
>> > my
>> > > > > approach
>> > > > > > > below.
>> > > > > > > > > >
>> > > > > > > > > > > Also, please see this relevant email thread:
>> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
>> > > > > > > > > > > > Hi,
>> > > > > > > > > > > >
>> > > > > > > > > > > > One of the goals of Project Stratos is to enable
>> > hypervisor
>> > > > > > > agnostic
>> > > > > > > > > > > > backends so we can enable as much re-use of code as
>> > possible
>> > > > > and
>> > > > > > > avoid
>> > > > > > > > > > > > repeating ourselves. This is the flip side of the
>> > front end
>> > > > > > > where
>> > > > > > > > > > > > multiple front-end implementations are required -
>> one
>> > per OS,
>> > > > > > > assuming
>> > > > > > > > > > > > you don't just want Linux guests. The resultant
>> guests
>> > are
>> > > > > > > trivially
>> > > > > > > > > > > > movable between hypervisors modulo any abstracted
>> > paravirt
>> > > > > type
>> > > > > > > > > > > > interfaces.
>> > > > > > > > > > > >
>> > > > > > > > > > > > In my original thumb nail sketch of a solution I
>> > envisioned
>> > > > > > > vhost-user
>> > > > > > > > > > > > daemons running in a broadly POSIX like environment.
>> > The
>> > > > > > > interface to
>> > > > > > > > > > > > the daemon is fairly simple requiring only some
>> mapped
>> > > > > memory
>> > > > > > > and some
>> > > > > > > > > > > > sort of signalling for events (on Linux this is
>> > eventfd).
>> > > > > The
>> > > > > > > idea was a
>> > > > > > > > > > > > stub binary would be responsible for any hypervisor
>> > specific
>> > > > > > > setup and
>> > > > > > > > > > > > then launch a common binary to deal with the actual
>> > > > > virtqueue
>> > > > > > > requests
>> > > > > > > > > > > > themselves.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Since that original sketch we've seen an expansion
>> in
>> > the
>> > > > > sort
>> > > > > > > of ways
>> > > > > > > > > > > > backends could be created. There is interest in
>> > > > > encapsulating
>> > > > > > > backends
>> > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
>> There
>> > > > > interest
>> > > > > > > in Rust
>> > > > > > > > > > > > has prompted ideas of using the trait interface to
>> > abstract
>> > > > > > > differences
>> > > > > > > > > > > > away as well as the idea of bare-metal Rust
>> backends.
>> > > > > > > > > > > >
>> > > > > > > > > > > > We have a card (STR-12) called "Hypercall
>> > Standardisation"
>> > > > > which
>> > > > > > > > > > > > calls for a description of the APIs needed from the
>> > > > > hypervisor
>> > > > > > > side to
>> > > > > > > > > > > > support VirtIO guests and their backends. However we
>> > are
>> > > > > some
>> > > > > > > way off
>> > > > > > > > > > > > from that at the moment as I think we need to at
>> least
>> > > > > > > demonstrate one
>> > > > > > > > > > > > portable backend before we start codifying
>> > requirements. To
>> > > > > that
>> > > > > > > end I
>> > > > > > > > > > > > want to think about what we need for a backend to
>> > function.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Configuration
>> > > > > > > > > > > > =============
>> > > > > > > > > > > >
>> > > > > > > > > > > > In the type-2 setup this is typically fairly simple
>> > because
>> > > > > the
>> > > > > > > host
>> > > > > > > > > > > > system can orchestrate the various modules that make
>> > up the
>> > > > > > > complete
>> > > > > > > > > > > > system. In the type-1 case (or even type-2 with
>> > delegated
>> > > > > > > service VMs)
>> > > > > > > > > > > > we need some sort of mechanism to inform the backend
>> > VM
>> > > > > about
>> > > > > > > key
>> > > > > > > > > > > > details about the system:
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - where virt queue memory is in it's address space
>> > > > > > > > > > > >   - how it's going to receive (interrupt) and
>> trigger
>> > (kick)
>> > > > > > > events
>> > > > > > > > > > > >   - what (if any) resources the backend needs to
>> > connect to
>> > > > > > > > > > > >
>> > > > > > > > > > > > Obviously you can elide over configuration issues by
>> > having
>> > > > > > > static
>> > > > > > > > > > > > configurations and baking the assumptions into your
>> > guest
>> > > > > images
>> > > > > > > however
>> > > > > > > > > > > > this isn't scalable in the long term. The obvious
>> > solution
>> > > > > seems
>> > > > > > > to be
>> > > > > > > > > > > > extending a subset of Device Tree data to user space
>> > but
>> > > > > perhaps
>> > > > > > > there
>> > > > > > > > > > > > are other approaches?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Before any virtio transactions can take place the
>> > > > > appropriate
>> > > > > > > memory
>> > > > > > > > > > > > mappings need to be made between the FE guest and
>> the
>> > BE
>> > > > > guest.
>> > > > > > > > > > >
>> > > > > > > > > > > > Currently the whole of the FE guests address space
>> > needs to
>> > > > > be
>> > > > > > > visible
>> > > > > > > > > > > > to whatever is serving the virtio requests. I can
>> > envision 3
>> > > > > > > approaches:
>> > > > > > > > > > > >
>> > > > > > > > > > > >  * BE guest boots with memory already mapped
>> > > > > > > > > > > >
>> > > > > > > > > > > >  This would entail the guest OS knowing where in
>> it's
>> > Guest
>> > > > > > > Physical
>> > > > > > > > > > > >  Address space is already taken up and avoiding
>> > clashing. I
>> > > > > > > would assume
>> > > > > > > > > > > >  in this case you would want a standard interface to
>> > > > > userspace
>> > > > > > > to then
>> > > > > > > > > > > >  make that address space visible to the backend
>> daemon.
>> > > > > > > > > >
>> > > > > > > > > > Yet another way here is that we would have well known
>> > "shared
>> > > > > > > memory" between
>> > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
>> > insights on
>> > > > > this
>> > > > > > > matter
>> > > > > > > > > > and that it can even be an alternative for hypervisor-
>> > agnostic
>> > > > > > > solution.
>> > > > > > > > > >
>> > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
>> > device
>> > > > > and
>> > > > > > > can be
>> > > > > > > > > > mapped locally.)
>> > > > > > > > > >
>> > > > > > > > > > I want to add this shared memory aspect to my
>> virtio-proxy,
>> > but
>> > > > > > > > > > the resultant solution would eventually look similar to
>> > ivshmem.
>> > > > > > > > > >
>> > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
>> memory
>> > > > > > > > > > > >
>> > > > > > > > > > > >  The BE guest is then free to map the FE's memory to
>> > where
>> > > > > it
>> > > > > > > wants in
>> > > > > > > > > > > >  the BE's guest physical address space.
>> > > > > > > > > > >
>> > > > > > > > > > > I cannot see how this could work for Xen. There is no
>> > "handle"
>> > > > > to
>> > > > > > > give
>> > > > > > > > > > > to the backend if the backend is not running in dom0.
>> So
>> > for
>> > > > > Xen I
>> > > > > > > think
>> > > > > > > > > > > the memory has to be already mapped
>> > > > > > > > > >
>> > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
>> > information
>> > > > > is
>> > > > > > > expected
>> > > > > > > > > > to be exposed to BE via Xenstore:
>> > > > > > > > > > (I know that this is a tentative approach though.)
>> > > > > > > > > >    - the start address of configuration space
>> > > > > > > > > >    - interrupt number
>> > > > > > > > > >    - file path for backing storage
>> > > > > > > > > >    - read-only flag
>> > > > > > > > > > And the BE server have to call a particular hypervisor
>> > interface
>> > > > > to
>> > > > > > > > > > map the configuration space.
>> > > > > > > > >
>> > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
>> > configuration
>> > > > > info to
>> > > > > > > the backend running in a non-toolstack domain.
>> > > > > > > > > I remember, there was a wish to avoid using Xenstore in
>> > Virtio
>> > > > > backend
>> > > > > > > itself if possible, so for non-toolstack domain, this could
>> done
>> > with
>> > > > > > > adjusting devd (daemon that listens for devices and launches
>> > backends)
>> > > > > > > > > to read backend configuration from the Xenstore anyway and
>> > pass it
>> > > > > to
>> > > > > > > the backend via command line arguments.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, in current PoC code we're using xenstore to pass device
>> > > > > > > configuration.
>> > > > > > > > We also designed a static device configuration parse method
>> > for
>> > > > > Dom0less
>> > > > > > > or
>> > > > > > > > other scenarios don't have xentool. yes, it's from device
>> > model
>> > > > > command
>> > > > > > > line
>> > > > > > > > or a config file.
>> > > > > > > >
>> > > > > > > > > But, if ...
>> > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
>> > hypervisor)-
>> > > > > > > specific
>> > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
>> > hide
>> > > > > all
>> > > > > > > details.
>> > > > > > > > >
>> > > > > > > > > ... the solution how to overcome that is already found and
>> > proven
>> > > > > to
>> > > > > > > work then even better.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > # My point is that a "handle" is not mandatory for
>> > executing
>> > > > > mapping.
>> > > > > > > > > >
>> > > > > > > > > > > and the mapping probably done by the
>> > > > > > > > > > > toolstack (also see below.) Or we would have to
>> invent a
>> > new
>> > > > > Xen
>> > > > > > > > > > > hypervisor interface and Xen virtual machine
>> privileges
>> > to
>> > > > > allow
>> > > > > > > this
>> > > > > > > > > > > kind of mapping.
>> > > > > > > > > >
>> > > > > > > > > > > If we run the backend in Dom0 that we have no problems
>> > of
>> > > > > course.
>> > > > > > > > > >
>> > > > > > > > > > One of difficulties on Xen that I found in my approach
>> is
>> > that
>> > > > > > > calling
>> > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
>> > memory) is
>> > > > > > > only
>> > > > > > > > > > allowed on BE servers themselvies and so we will have to
>> > extend
>> > > > > > > those
>> > > > > > > > > > interfaces.
>> > > > > > > > > > This, however, will raise some concern on security and
>> > privilege
>> > > > > > > distribution
>> > > > > > > > > > as Stefan suggested.
>> > > > > > > > >
>> > > > > > > > > We also faced policy related issues with Virtio backend
>> > running in
>> > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
>> > system we
>> > > > > run
>> > > > > > > the backend in a driver
>> > > > > > > > > domain (we call it DomD) where the underlying H/W resides.
>> > We
>> > > > > trust it,
>> > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
>> > provide
>> > > > > it
>> > > > > > > with a little bit more privileges than a simple DomU had.
>> > > > > > > > > Now it is permitted to issue device-model, resource and
>> > memory
>> > > > > > > mappings, etc calls.
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > To activate the mapping will
>> > > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
>> > can see
>> > > > > two
>> > > > > > > options
>> > > > > > > > > > > >  at this point:
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - expose the handle to userspace for daemon/helper
>> > to
>> > > > > trigger
>> > > > > > > the
>> > > > > > > > > > > >     mapping via existing hypercall interfaces. If
>> > using a
>> > > > > helper
>> > > > > > > you
>> > > > > > > > > > > >     would have a hypervisor specific one to avoid
>> the
>> > daemon
>> > > > > > > having to
>> > > > > > > > > > > >     care too much about the details or push that
>> > complexity
>> > > > > into
>> > > > > > > a
>> > > > > > > > > > > >     compile time option for the daemon which would
>> > result in
>> > > > > > > different
>> > > > > > > > > > > >     binaries although a common source base.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   - expose a new kernel ABI to abstract the
>> hypercall
>> > > > > > > differences away
>> > > > > > > > > > > >     in the guest kernel. In this case the userspace
>> > would
>> > > > > > > essentially
>> > > > > > > > > > > >     ask for an abstract "map guest N memory to
>> > userspace
>> > > > > ptr"
>> > > > > > > and let
>> > > > > > > > > > > >     the kernel deal with the different hypercall
>> > interfaces.
>> > > > > > > This of
>> > > > > > > > > > > >     course assumes the majority of BE guests would
>> be
>> > Linux
>> > > > > > > kernels and
>> > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
>> > their own
>> > > > > > > devices.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Operation
>> > > > > > > > > > > > =========
>> > > > > > > > > > > >
>> > > > > > > > > > > > The core of the operation of VirtIO is fairly
>> simple.
>> > Once
>> > > > > the
>> > > > > > > > > > > > vhost-user feature negotiation is done it's a case
>> of
>> > > > > receiving
>> > > > > > > update
>> > > > > > > > > > > > events and parsing the resultant virt queue for
>> data.
>> > The
>> > > > > vhost-
>> > > > > > > user
>> > > > > > > > > > > > specification handles a bunch of setup before that
>> > point,
>> > > > > mostly
>> > > > > > > to
>> > > > > > > > > > > > detail where the virt queues are set up FD's for
>> > memory and
>> > > > > > > event
>> > > > > > > > > > > > communication. This is where the envisioned stub
>> > process
>> > > > > would
>> > > > > > > be
>> > > > > > > > > > > > responsible for getting the daemon up and ready to
>> run.
>> > This
>> > > > > is
>> > > > > > > > > > > > currently done inside a big VMM like QEMU but I
>> > suspect a
>> > > > > modern
>> > > > > > > > > > > > approach would be to use the rust-vmm vhost crate.
>> It
>> > would
>> > > > > then
>> > > > > > > either
>> > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
>> re-
>> > > > > targeted
>> > > > > > > as a
>> > > > > > > > > > > > build option for the various hypervisors.
>> > > > > > > > > > >
>> > > > > > > > > > > One thing I mentioned before to Alex is that Xen
>> doesn't
>> > have
>> > > > > VMMs
>> > > > > > > the
>> > > > > > > > > > > way they are typically envisioned and described in
>> other
>> > > > > > > environments.
>> > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
>> > > > > > > independently to
>> > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
>> > could
>> > > > > be
>> > > > > > > used as
>> > > > > > > > > > > emulators for a single Xen VM, each of them connecting
>> > to Xen
>> > > > > > > > > > > independently via the IOREQ interface.
>> > > > > > > > > > >
>> > > > > > > > > > > The component responsible for starting a daemon and/or
>> > setting
>> > > > > up
>> > > > > > > shared
>> > > > > > > > > > > interfaces is the toolstack: the xl command and the
>> > > > > libxl/libxc
>> > > > > > > > > > > libraries.
>> > > > > > > > > >
>> > > > > > > > > > I think that VM configuration management (or
>> orchestration
>> > in
>> > > > > > > Startos
>> > > > > > > > > > jargon?) is a subject to debate in parallel.
>> > > > > > > > > > Otherwise, is there any good assumption to avoid it
>> right
>> > now?
>> > > > > > > > > >
>> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
>> > for the
>> > > > > > > toolstack
>> > > > > > > > > > > to create virtio backends and setup memory mappings.
>> > They
>> > > > > might be
>> > > > > > > able
>> > > > > > > > > > > to provide more info on the subject. I do think we
>> miss
>> > a way
>> > > > > to
>> > > > > > > provide
>> > > > > > > > > > > the configuration to the backend and anything else
>> that
>> > the
>> > > > > > > backend
>> > > > > > > > > > > might require to start doing its job.
>> > > > > > > > >
>> > > > > > > > > Yes, some work has been done for the toolstack to handle
>> > Virtio
>> > > > > MMIO
>> > > > > > > devices in
>> > > > > > > > > general and Virtio block devices in particular. However,
>> it
>> > has
>> > > > > not
>> > > > > > > been upstreaned yet.
>> > > > > > > > > Updated patches on review now:
>> > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
>> > send-
>> > > > > email-
>> > > > > > > olekstysh@gmail.com/
>> > > > > > > > >
>> > > > > > > > > There is an additional (also important) activity to
>> > improve/fix
>> > > > > > > foreign memory mapping on Arm which I am also involved in.
>> > > > > > > > > The foreign memory mapping is proposed to be used for
>> Virtio
>> > > > > backends
>> > > > > > > (device emulators) if there is a need to run guest OS
>> completely
>> > > > > > > unmodified.
>> > > > > > > > > Of course, the more secure way would be to use grant
>> memory
>> > > > > mapping.
>> > > > > > > Brietly, the main difference between them is that with foreign
>> > mapping
>> > > > > the
>> > > > > > > backend
>> > > > > > > > > can map any guest memory it wants to map, but with grant
>> > mapping
>> > > > > it is
>> > > > > > > allowed to map only what was previously granted by the
>> frontend.
>> > > > > > > > >
>> > > > > > > > > So, there might be a problem if we want to pre-map some
>> > guest
>> > > > > memory
>> > > > > > > in advance or to cache mappings in the backend in order to
>> > improve
>> > > > > > > performance (because the mapping/unmapping guest pages every
>> > request
>> > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
>> > nutshell,
>> > > > > > > currently, in order to map a guest page into the backend
>> address
>> > space
>> > > > > we
>> > > > > > > need to steal a real physical page from the backend domain.
>> So,
>> > with
>> > > > > the
>> > > > > > > said optimizations we might end up with no free memory in the
>> > backend
>> > > > > > > domain (see XSA-300). And what we try to achieve is to not
>> waste
>> > a
>> > > > > real
>> > > > > > > domain memory at all by providing safe non-allocated-yet (so
>> > unused)
>> > > > > > > address space for the foreign (and grant) pages to be mapped
>> > into,
>> > > > > this
>> > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
>> > changes.
>> > > > > > > However, as it turned out, for this to work in a proper and
>> safe
>> > way
>> > > > > some
>> > > > > > > prereq work needs to be done.
>> > > > > > > > > You can find the related Xen discussion at:
>> > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
>> > send-
>> > > > > email-
>> > > > > > > olekstysh@gmail.com/
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > One question is how to best handle notification and
>> > kicks.
>> > > > > The
>> > > > > > > existing
>> > > > > > > > > > > > vhost-user framework uses eventfd to signal the
>> daemon
>> > > > > (although
>> > > > > > > QEMU
>> > > > > > > > > > > > is quite capable of simulating them when you use
>> TCG).
>> > Xen
>> > > > > has
>> > > > > > > it's own
>> > > > > > > > > > > > IOREQ mechanism. However latency is an important
>> > factor and
>> > > > > > > having
>> > > > > > > > > > > > events go through the stub would add quite a lot.
>> > > > > > > > > > >
>> > > > > > > > > > > Yeah I think, regardless of anything else, we want the
>> > > > > backends to
>> > > > > > > > > > > connect directly to the Xen hypervisor.
>> > > > > > > > > >
>> > > > > > > > > > In my approach,
>> > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
>> > hypervisor
>> > > > > > > interface
>> > > > > > > > > >               via virtio-proxy
>> > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
>> > channels),
>> > > > > > > which is
>> > > > > > > > > >               converted to a callback to BE via virtio-
>> > proxy
>> > > > > > > > > >               (Xen's event channel is internnally
>> > implemented by
>> > > > > > > interrupts.)
>> > > > > > > > > >
>> > > > > > > > > > I don't know what "connect directly" means here, but
>> > sending
>> > > > > > > interrupts
>> > > > > > > > > > to the opposite side would be best efficient.
>> > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
>> PCI's
>> > msi-x
>> > > > > > > mechanism.
>> > > > > > > > >
>> > > > > > > > > Agree that MSI would be more efficient than SPI...
>> > > > > > > > > At the moment, in order to notify the frontend, the
>> backend
>> > issues
>> > > > > a
>> > > > > > > specific device-model call to query Xen to inject a
>> > corresponding SPI
>> > > > > to
>> > > > > > > the guest.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > Could we consider the kernel internally converting
>> > IOREQ
>> > > > > > > messages from
>> > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
>> scale
>> > with
>> > > > > > > other kernel
>> > > > > > > > > > > > hypercall interfaces?
>> > > > > > > > > > > >
>> > > > > > > > > > > > So any thoughts on what directions are worth
>> > experimenting
>> > > > > with?
>> > > > > > > > > > >
>> > > > > > > > > > > One option we should consider is for each backend to
>> > connect
>> > > > > to
>> > > > > > > Xen via
>> > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
>> > interface
>> > > > > and
>> > > > > > > make it
>> > > > > > > > > > > hypervisor agnostic. The interface is really trivial
>> and
>> > easy
>> > > > > to
>> > > > > > > add.
>> > > > > > > > > >
>> > > > > > > > > > As I said above, my proposal does the same thing that
>> you
>> > > > > mentioned
>> > > > > > > here :)
>> > > > > > > > > > The difference is that I do call hypervisor interfaces
>> via
>> > > > > virtio-
>> > > > > > > proxy.
>> > > > > > > > > >
>> > > > > > > > > > > The only Xen-specific part is the notification
>> mechanism,
>> > > > > which is
>> > > > > > > an
>> > > > > > > > > > > event channel. If we replaced the event channel with
>> > something
>> > > > > > > else the
>> > > > > > > > > > > interface would be generic. See:
>> > > > > > > > > > > https://gitlab.com/xen-project/xen/-
>> > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
>> > > > > > > > > > >
>> > > > > > > > > > > I don't think that translating IOREQs to eventfd in
>> the
>> > kernel
>> > > > > is
>> > > > > > > a
>> > > > > > > > > > > good idea: if feels like it would be extra complexity
>> > and that
>> > > > > the
>> > > > > > > > > > > kernel shouldn't be involved as this is a backend-
>> > hypervisor
>> > > > > > > interface.
>> > > > > > > > > >
>> > > > > > > > > > Given that we may want to implement BE as a bare-metal
>> > > > > application
>> > > > > > > > > > as I did on Zephyr, I don't think that the translation
>> > would not
>> > > > > be
>> > > > > > > > > > a big issue, especially on RTOS's.
>> > > > > > > > > > It will be some kind of abstraction layer of interrupt
>> > handling
>> > > > > > > > > > (or nothing but a callback mechanism).
>> > > > > > > > > >
>> > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
>> to
>> > > > > design an
>> > > > > > > > > > > interface that could work well for RTOSes too. If we
>> > want to
>> > > > > do
>> > > > > > > > > > > something different, both OS-agnostic and hypervisor-
>> > agnostic,
>> > > > > > > perhaps
>> > > > > > > > > > > we could design a new interface. One that could be
>> > > > > implementable
>> > > > > > > in the
>> > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
>> > other
>> > > > > > > hypervisor
>> > > > > > > > > > > too.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > There is also another problem. IOREQ is probably not
>> be
>> > the
>> > > > > only
>> > > > > > > > > > > interface needed. Have a look at
>> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
>> > Don't we
>> > > > > > > also need
>> > > > > > > > > > > an interface for the backend to inject interrupts into
>> > the
>> > > > > > > frontend? And
>> > > > > > > > > > > if the backend requires dynamic memory mappings of
>> > frontend
>> > > > > pages,
>> > > > > > > then
>> > > > > > > > > > > we would also need an interface to map/unmap domU
>> pages.
>> > > > > > > > > >
>> > > > > > > > > > My proposal document might help here; All the interfaces
>> > > > > required
>> > > > > > > for
>> > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
>> listed
>> > as
>> > > > > > > > > > RPC protocols :)
>> > > > > > > > > >
>> > > > > > > > > > > These interfaces are a lot more problematic than
>> IOREQ:
>> > IOREQ
>> > > > > is
>> > > > > > > tiny
>> > > > > > > > > > > and self-contained. It is easy to add anywhere. A new
>> > > > > interface to
>> > > > > > > > > > > inject interrupts or map pages is more difficult to
>> > manage
>> > > > > because
>> > > > > > > it
>> > > > > > > > > > > would require changes scattered across the various
>> > emulators.
>> > > > > > > > > >
>> > > > > > > > > > Exactly. I have no confident yet that my approach will
>> > also
>> > > > > apply
>> > > > > > > > > > to other hypervisors than Xen.
>> > > > > > > > > > Technically, yes, but whether people can accept it or
>> not
>> > is a
>> > > > > > > different
>> > > > > > > > > > matter.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > -Takahiro Akashi
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Regards,
>> > > > > > > > >
>> > > > > > > > > Oleksandr Tyshchenko
>> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
>> > attachments are
>> > > > > > > confidential and may also be privileged. If you are not the
>> > intended
>> > > > > > > recipient, please notify the sender immediately and do not
>> > disclose
>> > > > > the
>> > > > > > > contents to any other person, use it for any purpose, or store
>> > or copy
>> > > > > the
>> > > > > > > information in any medium. Thank you.
>> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
>> > are
>> > > > > confidential and may also be privileged. If you are not the
>> intended
>> > > > > recipient, please notify the sender immediately and do not
>> disclose
>> > the
>> > > > > contents to any other person, use it for any purpose, or store or
>> > copy the
>> > > > > information in any medium. Thank you.
>> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
>> > confidential and may also be privileged. If you are not the intended
>> > recipient, please notify the sender immediately and do not disclose the
>> > contents to any other person, use it for any purpose, or store or copy
>> the
>> > information in any medium. Thank you.
>> IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy the
>> information in any medium. Thank you.
>>
>

[-- Attachment #2: Type: text/html, Size: 75017 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-26 12:10                     ` Wei Chen
  2021-08-30 19:36                       ` Christopher Clark
@ 2021-08-31  6:18                       ` AKASHI Takahiro
  2021-09-01 11:12                         ` Wei Chen
  1 sibling, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-08-31  6:18 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant,
	Xen Devel

Wei,

On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
> Hi Akashi,
> 
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年8月26日 17:41
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > Hi Wei,
> >
> > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > Hi Akashi,
> > > >
> > > > > -----Original Message-----
> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > Sent: 2021年8月18日 13:39
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > Stratos
> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > dev@lists.oasis-
> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > Jean-
> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > Julien
> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > Hi Akashi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > Stabellini
> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > > > > Stratos
> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > dev@lists.oasis-
> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > Jean-
> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > <Artem_Mygaiev@epam.com>;
> > > > > Julien
> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > > >
> > > > > > > Hi Wei, Oleksandr,
> > > > > > >
> > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> > > > > > > > This proposal is still discussing in Xen and KVM communities.
> > > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > > other hypervisors can reuse the virtual device implementations.
> > > > > > > >
> > > > > > > > In this case, we need to introduce an intermediate hypervisor
> > > > > > > > layer for VMM abstraction, Which is, I think it's very close
> > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > >
> > > > > > > # My proposal[1] comes from my own idea and doesn't always
> > represent
> > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > > Nevertheless,
> > > > > > >
> > > > > > > Your idea and my proposal seem to share the same background.
> > > > > > > Both have the similar goal and currently start with, at first,
> > Xen
> > > > > > > and are based on kvm-tool. (Actually, my work is derived from
> > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > >
> > > > > > > In particular, the abstraction of hypervisor interfaces has a
> > same
> > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> > interfaces").
> > > > > > > This is not co-incident as we both share the same origin as I
> > said
> > > > > above.
> > > > > > > And so we will also share the same issues. One of them is a way
> > of
> > > > > > > "sharing/mapping FE's memory". There is some trade-off between
> > > > > > > the portability and the performance impact.
> > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > (See Alex's original email, too).
> > > > > > >
> > > > > > Yes, I agree.
> > > > > >
> > > > > > > On the other hand, my approach aims to create a "single-binary"
> > > > > solution
> > > > > > > in which the same binary of BE vm could run on any hypervisors.
> > > > > > > Somehow similar to your "proposal-#2" in [2], but in my solution,
> > all
> > > > > > > the hypervisor-specific code would be put into another entity
> > (VM),
> > > > > > > named "virtio-proxy" and the abstracted operations are served
> > via RPC.
> > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > > dependency.)
> > > > > > > But I know that we need discuss if this is a requirement even
> > > > > > > in Stratos project or not. (Maybe not)
> > > > > > >
> > > > > >
> > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> > completely
> > > > > > (I will do it ASAP). But from your description, it seems we need a
> > > > > > 3rd VM between FE and BE? My concern is that, if my assumption is
> > right,
> > > > > > will it increase the latency in data transport path? Even if we're
> > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > >
> > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > > > As far as we execute 'mapping' operations at every fetch of payload,
> > > > > we will see latency issue (even in your case) and if we have some
> > solution
> > > > > for it, we won't see it neither in my proposal :)
> > > > >
> > > >
> > > > Oleksandr has sent a proposal to Xen mailing list to reduce this kind
> > > > of "mapping/unmapping" operations. So the latency caused by this
> > behavior
> > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
> > problem.
> > >
> > > Obviously, I have not yet caught up there in the discussion.
> > > Which patch specifically?
> >
> > Can you give me the link to the discussion or patch, please?
> >
> 
> It's a RFC discussion. We have tested this RFC patch internally.
> https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html

I'm afraid that I miss something here, but I don't know
why this proposed API will lead to eliminating 'mmap' in accessing
the queued payload at every request?

-Takahiro Akashi


> > Thanks,
> > -Takahiro Akashi
> >
> > > -Takahiro Akashi
> > >
> > > > > > > Specifically speaking about kvm-tool, I have a concern about its
> > > > > > > license term; Targeting different hypervisors and different OSs
> > > > > > > (which I assume includes RTOS's), the resultant library should
> > be
> > > > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > > > Any thoughts?
> > > > > > >
> > > > > >
> > > > > > Yes. If user want to implement a FreeBSD device model, but the
> > virtio
> > > > > > library is GPL. Then GPL would be a problem. If we have another
> > good
> > > > > > candidate, I am open to it.
> > > > >
> > > > > I have some candidates, particularly for vq/vring, in my mind:
> > > > > * Open-AMP, or
> > > > > * corresponding Free-BSD code
> > > > >
> > > >
> > > > Interesting, I will look into them : )
> > > >
> > > > Cheers,
> > > > Wei Chen
> > > >
> > > > > -Takahiro Akashi
> > > > >
> > > > >
> > > > > > > -Takahiro Akashi
> > > > > > >
> > > > > > >
> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > > > August/000548.html
> > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > >
> > > > > > > >
> > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> > > > > Stabellini
> > > > > > > <sstabellini@kernel.org>
> > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos Mailing
> > List
> > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > open.org;
> > > > > Arnd
> > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > Jean-
> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> > Oleksandr
> > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > <Artem_Mygaiev@epam.com>;
> > > > > Julien
> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > backends
> > > > > > > > >
> > > > > > > > > Hello, all.
> > > > > > > > >
> > > > > > > > > Please see some comments below. And sorry for the possible
> > format
> > > > > > > issues.
> > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> > Stabellini
> > > > > wrote:
> > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> > trimming
> > > > > the
> > > > > > > original
> > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > >
> > > > > > > > > > > My comments below are related to a potential Xen
> > > > > implementation,
> > > > > > > not
> > > > > > > > > > > because it is the only implementation that matters, but
> > > > > because it
> > > > > > > is
> > > > > > > > > > > the one I know best.
> > > > > > > > > >
> > > > > > > > > > Please note that my proposal (and hence the working
> > prototype)[1]
> > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> > > > > > > particularly
> > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > It has been, I believe, well generalized but is still a
> > bit
> > > > > biased
> > > > > > > > > > toward this original design.
> > > > > > > > > >
> > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > >
> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > dev/2021-
> > > > > > > August/000546.html
> > > > > > > > > >
> > > > > > > > > > Let me take this opportunity to explain a bit more about
> > my
> > > > > approach
> > > > > > > below.
> > > > > > > > > >
> > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > One of the goals of Project Stratos is to enable
> > hypervisor
> > > > > > > agnostic
> > > > > > > > > > > > backends so we can enable as much re-use of code as
> > possible
> > > > > and
> > > > > > > avoid
> > > > > > > > > > > > repeating ourselves. This is the flip side of the
> > front end
> > > > > > > where
> > > > > > > > > > > > multiple front-end implementations are required - one
> > per OS,
> > > > > > > assuming
> > > > > > > > > > > > you don't just want Linux guests. The resultant guests
> > are
> > > > > > > trivially
> > > > > > > > > > > > movable between hypervisors modulo any abstracted
> > paravirt
> > > > > type
> > > > > > > > > > > > interfaces.
> > > > > > > > > > > >
> > > > > > > > > > > > In my original thumb nail sketch of a solution I
> > envisioned
> > > > > > > vhost-user
> > > > > > > > > > > > daemons running in a broadly POSIX like environment.
> > The
> > > > > > > interface to
> > > > > > > > > > > > the daemon is fairly simple requiring only some mapped
> > > > > memory
> > > > > > > and some
> > > > > > > > > > > > sort of signalling for events (on Linux this is
> > eventfd).
> > > > > The
> > > > > > > idea was a
> > > > > > > > > > > > stub binary would be responsible for any hypervisor
> > specific
> > > > > > > setup and
> > > > > > > > > > > > then launch a common binary to deal with the actual
> > > > > virtqueue
> > > > > > > requests
> > > > > > > > > > > > themselves.
> > > > > > > > > > > >
> > > > > > > > > > > > Since that original sketch we've seen an expansion in
> > the
> > > > > sort
> > > > > > > of ways
> > > > > > > > > > > > backends could be created. There is interest in
> > > > > encapsulating
> > > > > > > backends
> > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI. There
> > > > > interest
> > > > > > > in Rust
> > > > > > > > > > > > has prompted ideas of using the trait interface to
> > abstract
> > > > > > > differences
> > > > > > > > > > > > away as well as the idea of bare-metal Rust backends.
> > > > > > > > > > > >
> > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > Standardisation"
> > > > > which
> > > > > > > > > > > > calls for a description of the APIs needed from the
> > > > > hypervisor
> > > > > > > side to
> > > > > > > > > > > > support VirtIO guests and their backends. However we
> > are
> > > > > some
> > > > > > > way off
> > > > > > > > > > > > from that at the moment as I think we need to at least
> > > > > > > demonstrate one
> > > > > > > > > > > > portable backend before we start codifying
> > requirements. To
> > > > > that
> > > > > > > end I
> > > > > > > > > > > > want to think about what we need for a backend to
> > function.
> > > > > > > > > > > >
> > > > > > > > > > > > Configuration
> > > > > > > > > > > > =============
> > > > > > > > > > > >
> > > > > > > > > > > > In the type-2 setup this is typically fairly simple
> > because
> > > > > the
> > > > > > > host
> > > > > > > > > > > > system can orchestrate the various modules that make
> > up the
> > > > > > > complete
> > > > > > > > > > > > system. In the type-1 case (or even type-2 with
> > delegated
> > > > > > > service VMs)
> > > > > > > > > > > > we need some sort of mechanism to inform the backend
> > VM
> > > > > about
> > > > > > > key
> > > > > > > > > > > > details about the system:
> > > > > > > > > > > >
> > > > > > > > > > > >   - where virt queue memory is in it's address space
> > > > > > > > > > > >   - how it's going to receive (interrupt) and trigger
> > (kick)
> > > > > > > events
> > > > > > > > > > > >   - what (if any) resources the backend needs to
> > connect to
> > > > > > > > > > > >
> > > > > > > > > > > > Obviously you can elide over configuration issues by
> > having
> > > > > > > static
> > > > > > > > > > > > configurations and baking the assumptions into your
> > guest
> > > > > images
> > > > > > > however
> > > > > > > > > > > > this isn't scalable in the long term. The obvious
> > solution
> > > > > seems
> > > > > > > to be
> > > > > > > > > > > > extending a subset of Device Tree data to user space
> > but
> > > > > perhaps
> > > > > > > there
> > > > > > > > > > > > are other approaches?
> > > > > > > > > > > >
> > > > > > > > > > > > Before any virtio transactions can take place the
> > > > > appropriate
> > > > > > > memory
> > > > > > > > > > > > mappings need to be made between the FE guest and the
> > BE
> > > > > guest.
> > > > > > > > > > >
> > > > > > > > > > > > Currently the whole of the FE guests address space
> > needs to
> > > > > be
> > > > > > > visible
> > > > > > > > > > > > to whatever is serving the virtio requests. I can
> > envision 3
> > > > > > > approaches:
> > > > > > > > > > > >
> > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > >
> > > > > > > > > > > >  This would entail the guest OS knowing where in it's
> > Guest
> > > > > > > Physical
> > > > > > > > > > > >  Address space is already taken up and avoiding
> > clashing. I
> > > > > > > would assume
> > > > > > > > > > > >  in this case you would want a standard interface to
> > > > > userspace
> > > > > > > to then
> > > > > > > > > > > >  make that address space visible to the backend daemon.
> > > > > > > > > >
> > > > > > > > > > Yet another way here is that we would have well known
> > "shared
> > > > > > > memory" between
> > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> > insights on
> > > > > this
> > > > > > > matter
> > > > > > > > > > and that it can even be an alternative for hypervisor-
> > agnostic
> > > > > > > solution.
> > > > > > > > > >
> > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> > device
> > > > > and
> > > > > > > can be
> > > > > > > > > > mapped locally.)
> > > > > > > > > >
> > > > > > > > > > I want to add this shared memory aspect to my virtio-proxy,
> > but
> > > > > > > > > > the resultant solution would eventually look similar to
> > ivshmem.
> > > > > > > > > >
> > > > > > > > > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > > > > > > > >
> > > > > > > > > > > >  The BE guest is then free to map the FE's memory to
> > where
> > > > > it
> > > > > > > wants in
> > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > >
> > > > > > > > > > > I cannot see how this could work for Xen. There is no
> > "handle"
> > > > > to
> > > > > > > give
> > > > > > > > > > > to the backend if the backend is not running in dom0. So
> > for
> > > > > Xen I
> > > > > > > think
> > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > >
> > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> > information
> > > > > is
> > > > > > > expected
> > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > > >    - the start address of configuration space
> > > > > > > > > >    - interrupt number
> > > > > > > > > >    - file path for backing storage
> > > > > > > > > >    - read-only flag
> > > > > > > > > > And the BE server have to call a particular hypervisor
> > interface
> > > > > to
> > > > > > > > > > map the configuration space.
> > > > > > > > >
> > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > configuration
> > > > > info to
> > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > I remember, there was a wish to avoid using Xenstore in
> > Virtio
> > > > > backend
> > > > > > > itself if possible, so for non-toolstack domain, this could done
> > with
> > > > > > > adjusting devd (daemon that listens for devices and launches
> > backends)
> > > > > > > > > to read backend configuration from the Xenstore anyway and
> > pass it
> > > > > to
> > > > > > > the backend via command line arguments.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, in current PoC code we're using xenstore to pass device
> > > > > > > configuration.
> > > > > > > > We also designed a static device configuration parse method
> > for
> > > > > Dom0less
> > > > > > > or
> > > > > > > > other scenarios don't have xentool. yes, it's from device
> > model
> > > > > command
> > > > > > > line
> > > > > > > > or a config file.
> > > > > > > >
> > > > > > > > > But, if ...
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > hypervisor)-
> > > > > > > specific
> > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
> > hide
> > > > > all
> > > > > > > details.
> > > > > > > > >
> > > > > > > > > ... the solution how to overcome that is already found and
> > proven
> > > > > to
> > > > > > > work then even better.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > # My point is that a "handle" is not mandatory for
> > executing
> > > > > mapping.
> > > > > > > > > >
> > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > toolstack (also see below.) Or we would have to invent a
> > new
> > > > > Xen
> > > > > > > > > > > hypervisor interface and Xen virtual machine privileges
> > to
> > > > > allow
> > > > > > > this
> > > > > > > > > > > kind of mapping.
> > > > > > > > > >
> > > > > > > > > > > If we run the backend in Dom0 that we have no problems
> > of
> > > > > course.
> > > > > > > > > >
> > > > > > > > > > One of difficulties on Xen that I found in my approach is
> > that
> > > > > > > calling
> > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> > memory) is
> > > > > > > only
> > > > > > > > > > allowed on BE servers themselvies and so we will have to
> > extend
> > > > > > > those
> > > > > > > > > > interfaces.
> > > > > > > > > > This, however, will raise some concern on security and
> > privilege
> > > > > > > distribution
> > > > > > > > > > as Stefan suggested.
> > > > > > > > >
> > > > > > > > > We also faced policy related issues with Virtio backend
> > running in
> > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> > system we
> > > > > run
> > > > > > > the backend in a driver
> > > > > > > > > domain (we call it DomD) where the underlying H/W resides.
> > We
> > > > > trust it,
> > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> > provide
> > > > > it
> > > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > > Now it is permitted to issue device-model, resource and
> > memory
> > > > > > > mappings, etc calls.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
> > can see
> > > > > two
> > > > > > > options
> > > > > > > > > > > >  at this point:
> > > > > > > > > > > >
> > > > > > > > > > > >   - expose the handle to userspace for daemon/helper
> > to
> > > > > trigger
> > > > > > > the
> > > > > > > > > > > >     mapping via existing hypercall interfaces. If
> > using a
> > > > > helper
> > > > > > > you
> > > > > > > > > > > >     would have a hypervisor specific one to avoid the
> > daemon
> > > > > > > having to
> > > > > > > > > > > >     care too much about the details or push that
> > complexity
> > > > > into
> > > > > > > a
> > > > > > > > > > > >     compile time option for the daemon which would
> > result in
> > > > > > > different
> > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > >
> > > > > > > > > > > >   - expose a new kernel ABI to abstract the hypercall
> > > > > > > differences away
> > > > > > > > > > > >     in the guest kernel. In this case the userspace
> > would
> > > > > > > essentially
> > > > > > > > > > > >     ask for an abstract "map guest N memory to
> > userspace
> > > > > ptr"
> > > > > > > and let
> > > > > > > > > > > >     the kernel deal with the different hypercall
> > interfaces.
> > > > > > > This of
> > > > > > > > > > > >     course assumes the majority of BE guests would be
> > Linux
> > > > > > > kernels and
> > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> > their own
> > > > > > > devices.
> > > > > > > > > > > >
> > > > > > > > > > > > Operation
> > > > > > > > > > > > =========
> > > > > > > > > > > >
> > > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
> > Once
> > > > > the
> > > > > > > > > > > > vhost-user feature negotiation is done it's a case of
> > > > > receiving
> > > > > > > update
> > > > > > > > > > > > events and parsing the resultant virt queue for data.
> > The
> > > > > vhost-
> > > > > > > user
> > > > > > > > > > > > specification handles a bunch of setup before that
> > point,
> > > > > mostly
> > > > > > > to
> > > > > > > > > > > > detail where the virt queues are set up FD's for
> > memory and
> > > > > > > event
> > > > > > > > > > > > communication. This is where the envisioned stub
> > process
> > > > > would
> > > > > > > be
> > > > > > > > > > > > responsible for getting the daemon up and ready to run.
> > This
> > > > > is
> > > > > > > > > > > > currently done inside a big VMM like QEMU but I
> > suspect a
> > > > > modern
> > > > > > > > > > > > approach would be to use the rust-vmm vhost crate. It
> > would
> > > > > then
> > > > > > > either
> > > > > > > > > > > > communicate with the kernel's abstracted ABI or be re-
> > > > > targeted
> > > > > > > as a
> > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > >
> > > > > > > > > > > One thing I mentioned before to Alex is that Xen doesn't
> > have
> > > > > VMMs
> > > > > > > the
> > > > > > > > > > > way they are typically envisioned and described in other
> > > > > > > environments.
> > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> > > > > > > independently to
> > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
> > could
> > > > > be
> > > > > > > used as
> > > > > > > > > > > emulators for a single Xen VM, each of them connecting
> > to Xen
> > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > >
> > > > > > > > > > > The component responsible for starting a daemon and/or
> > setting
> > > > > up
> > > > > > > shared
> > > > > > > > > > > interfaces is the toolstack: the xl command and the
> > > > > libxl/libxc
> > > > > > > > > > > libraries.
> > > > > > > > > >
> > > > > > > > > > I think that VM configuration management (or orchestration
> > in
> > > > > > > Startos
> > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > Otherwise, is there any good assumption to avoid it right
> > now?
> > > > > > > > > >
> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
> > for the
> > > > > > > toolstack
> > > > > > > > > > > to create virtio backends and setup memory mappings.
> > They
> > > > > might be
> > > > > > > able
> > > > > > > > > > > to provide more info on the subject. I do think we miss
> > a way
> > > > > to
> > > > > > > provide
> > > > > > > > > > > the configuration to the backend and anything else that
> > the
> > > > > > > backend
> > > > > > > > > > > might require to start doing its job.
> > > > > > > > >
> > > > > > > > > Yes, some work has been done for the toolstack to handle
> > Virtio
> > > > > MMIO
> > > > > > > devices in
> > > > > > > > > general and Virtio block devices in particular. However, it
> > has
> > > > > not
> > > > > > > been upstreaned yet.
> > > > > > > > > Updated patches on review now:
> > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
> > send-
> > > > > email-
> > > > > > > olekstysh@gmail.com/
> > > > > > > > >
> > > > > > > > > There is an additional (also important) activity to
> > improve/fix
> > > > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > > > The foreign memory mapping is proposed to be used for Virtio
> > > > > backends
> > > > > > > (device emulators) if there is a need to run guest OS completely
> > > > > > > unmodified.
> > > > > > > > > Of course, the more secure way would be to use grant memory
> > > > > mapping.
> > > > > > > Brietly, the main difference between them is that with foreign
> > mapping
> > > > > the
> > > > > > > backend
> > > > > > > > > can map any guest memory it wants to map, but with grant
> > mapping
> > > > > it is
> > > > > > > allowed to map only what was previously granted by the frontend.
> > > > > > > > >
> > > > > > > > > So, there might be a problem if we want to pre-map some
> > guest
> > > > > memory
> > > > > > > in advance or to cache mappings in the backend in order to
> > improve
> > > > > > > performance (because the mapping/unmapping guest pages every
> > request
> > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> > nutshell,
> > > > > > > currently, in order to map a guest page into the backend address
> > space
> > > > > we
> > > > > > > need to steal a real physical page from the backend domain. So,
> > with
> > > > > the
> > > > > > > said optimizations we might end up with no free memory in the
> > backend
> > > > > > > domain (see XSA-300). And what we try to achieve is to not waste
> > a
> > > > > real
> > > > > > > domain memory at all by providing safe non-allocated-yet (so
> > unused)
> > > > > > > address space for the foreign (and grant) pages to be mapped
> > into,
> > > > > this
> > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> > changes.
> > > > > > > However, as it turned out, for this to work in a proper and safe
> > way
> > > > > some
> > > > > > > prereq work needs to be done.
> > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
> > send-
> > > > > email-
> > > > > > > olekstysh@gmail.com/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > One question is how to best handle notification and
> > kicks.
> > > > > The
> > > > > > > existing
> > > > > > > > > > > > vhost-user framework uses eventfd to signal the daemon
> > > > > (although
> > > > > > > QEMU
> > > > > > > > > > > > is quite capable of simulating them when you use TCG).
> > Xen
> > > > > has
> > > > > > > it's own
> > > > > > > > > > > > IOREQ mechanism. However latency is an important
> > factor and
> > > > > > > having
> > > > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > > > >
> > > > > > > > > > > Yeah I think, regardless of anything else, we want the
> > > > > backends to
> > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > >
> > > > > > > > > > In my approach,
> > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> > hypervisor
> > > > > > > interface
> > > > > > > > > >               via virtio-proxy
> > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> > channels),
> > > > > > > which is
> > > > > > > > > >               converted to a callback to BE via virtio-
> > proxy
> > > > > > > > > >               (Xen's event channel is internnally
> > implemented by
> > > > > > > interrupts.)
> > > > > > > > > >
> > > > > > > > > > I don't know what "connect directly" means here, but
> > sending
> > > > > > > interrupts
> > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's
> > msi-x
> > > > > > > mechanism.
> > > > > > > > >
> > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > At the moment, in order to notify the frontend, the backend
> > issues
> > > > > a
> > > > > > > specific device-model call to query Xen to inject a
> > corresponding SPI
> > > > > to
> > > > > > > the guest.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Could we consider the kernel internally converting
> > IOREQ
> > > > > > > messages from
> > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this scale
> > with
> > > > > > > other kernel
> > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > >
> > > > > > > > > > > > So any thoughts on what directions are worth
> > experimenting
> > > > > with?
> > > > > > > > > > >
> > > > > > > > > > > One option we should consider is for each backend to
> > connect
> > > > > to
> > > > > > > Xen via
> > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> > interface
> > > > > and
> > > > > > > make it
> > > > > > > > > > > hypervisor agnostic. The interface is really trivial and
> > easy
> > > > > to
> > > > > > > add.
> > > > > > > > > >
> > > > > > > > > > As I said above, my proposal does the same thing that you
> > > > > mentioned
> > > > > > > here :)
> > > > > > > > > > The difference is that I do call hypervisor interfaces via
> > > > > virtio-
> > > > > > > proxy.
> > > > > > > > > >
> > > > > > > > > > > The only Xen-specific part is the notification mechanism,
> > > > > which is
> > > > > > > an
> > > > > > > > > > > event channel. If we replaced the event channel with
> > something
> > > > > > > else the
> > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > >
> > > > > > > > > > > I don't think that translating IOREQs to eventfd in the
> > kernel
> > > > > is
> > > > > > > a
> > > > > > > > > > > good idea: if feels like it would be extra complexity
> > and that
> > > > > the
> > > > > > > > > > > kernel shouldn't be involved as this is a backend-
> > hypervisor
> > > > > > > interface.
> > > > > > > > > >
> > > > > > > > > > Given that we may want to implement BE as a bare-metal
> > > > > application
> > > > > > > > > > as I did on Zephyr, I don't think that the translation
> > would not
> > > > > be
> > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > It will be some kind of abstraction layer of interrupt
> > handling
> > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > >
> > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying to
> > > > > design an
> > > > > > > > > > > interface that could work well for RTOSes too. If we
> > want to
> > > > > do
> > > > > > > > > > > something different, both OS-agnostic and hypervisor-
> > agnostic,
> > > > > > > perhaps
> > > > > > > > > > > we could design a new interface. One that could be
> > > > > implementable
> > > > > > > in the
> > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> > other
> > > > > > > hypervisor
> > > > > > > > > > > too.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > There is also another problem. IOREQ is probably not be
> > the
> > > > > only
> > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> > Don't we
> > > > > > > also need
> > > > > > > > > > > an interface for the backend to inject interrupts into
> > the
> > > > > > > frontend? And
> > > > > > > > > > > if the backend requires dynamic memory mappings of
> > frontend
> > > > > pages,
> > > > > > > then
> > > > > > > > > > > we would also need an interface to map/unmap domU pages.
> > > > > > > > > >
> > > > > > > > > > My proposal document might help here; All the interfaces
> > > > > required
> > > > > > > for
> > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are listed
> > as
> > > > > > > > > > RPC protocols :)
> > > > > > > > > >
> > > > > > > > > > > These interfaces are a lot more problematic than IOREQ:
> > IOREQ
> > > > > is
> > > > > > > tiny
> > > > > > > > > > > and self-contained. It is easy to add anywhere. A new
> > > > > interface to
> > > > > > > > > > > inject interrupts or map pages is more difficult to
> > manage
> > > > > because
> > > > > > > it
> > > > > > > > > > > would require changes scattered across the various
> > emulators.
> > > > > > > > > >
> > > > > > > > > > Exactly. I have no confident yet that my approach will
> > also
> > > > > apply
> > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > Technically, yes, but whether people can accept it or not
> > is a
> > > > > > > different
> > > > > > > > > > matter.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > -Takahiro Akashi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Oleksandr Tyshchenko
> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> > attachments are
> > > > > > > confidential and may also be privileged. If you are not the
> > intended
> > > > > > > recipient, please notify the sender immediately and do not
> > disclose
> > > > > the
> > > > > > > contents to any other person, use it for any purpose, or store
> > or copy
> > > > > the
> > > > > > > information in any medium. Thank you.
> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> > are
> > > > > confidential and may also be privileged. If you are not the intended
> > > > > recipient, please notify the sender immediately and do not disclose
> > the
> > > > > contents to any other person, use it for any purpose, or store or
> > copy the
> > > > > information in any medium. Thank you.
> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-08-19  9:11 ` [virtio-dev] " Matias Ezequiel Vara Larsen
       [not found]   ` <20210820060558.GB13452@laputa>
@ 2021-09-01  8:43   ` Alex Bennée
  1 sibling, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-01  8:43 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	AKASHI Takahiro, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier


Matias Ezequiel Vara Larsen <matiasevara@gmail.com> writes:

> Hello Alex,
>
> I can tell you my experience from working on a PoC (library) 
> to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
> I focused on two use cases:
> 1. type-I hypervisor in which the backend is running as a VM. This
> is an in-house hypervisor that does not support VMExits.
> 2. Linux user-space. In this case, the library is just used to
> communicate threads. The goal of this use case is merely testing.
>
> I have chosen virtio-mmio as the way to exchange information
> between the frontend and backend. I found it hard to synchronize the
> access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
> the front-end and back-end to synchronize, which is required
> during the device-status initialization. These extra bits would not be 
> needed in case the hypervisor supports VMExits, e.g., KVM.

The support for a vmexit seems rather fundamental to type-2 hypervisors
(like KVM) as the VMM is intrinsically linked to a vCPUs run loop. This
makes handling a condition like a bit of MMIO fairly natural to
implement. For type-1 cases the line of execution between "guest
accesses MMIO" and "something services that request" is a little
trickier to pin down. Ultimately at that point you are relying on the
hypervisor itself to make the scheduling decision to stop executing the
guest and allow the backend to do it's thing. We don't really want to
expose the exact details about that as it probably varies a lot between
hypervisors. However would a backend API semantic that expresses:

  - guest has done some MMIO
  - hypervisor has stopped execution of guest
  - guest will be restarted when response conditions are set by backend

cover the needs of a virtio backend and could the userspace facing
portion of that be agnostic?

>
> Each guest has a memory region that is shared with the backend. 
> This memory region is used by the frontend to allocate the io-buffers. This region also 
> maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
> is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
> At some point, the guest shall be able to balloon this region. Notifications between 
> the frontend and the backend are implemented by using an hypercall. The hypercall 
> mechanism and the memory allocation are abstracted away by a platform layer that 
> exposes an interface that is hypervisor/os agnostic.
>
> I split the backend into a virtio-device driver and a
> backend driver. The virtio-device driver is the virtqueues and the
> backend driver gets packets from the virtqueue for
> post-processing. For example, in the case of virtio-net, the backend
> driver would decide if the packet goes to the hardware or to another
> virtio-net device. The virtio-device drivers may be
> implemented in different ways like by using a single thread, multiple threads, 
> or one thread for all the virtio-devices.
>
> In this PoC, I just tackled two very simple use-cases. These
> use-cases allowed me to extract some requirements for an hypervisor to
> support virtio.
>
> Matias
<snip>

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-08-13  5:10       ` AKASHI Takahiro
@ 2021-09-01  8:57           ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-01  8:57 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Fran??ois Ozog, Stefano Stabellini, paul, Stratos Mailing List,
	virtio-dev, Jan Kiszka, Arnd Bergmann, jgross, julien,
	Carl van Schaik, Bertrand.Marquis, stefanha, Artem_Mygaiev,
	xen-devel, olekstysh, Oleksandr_Tyshchenko, xen-devel


AKASHI Takahiro <takahiro.akashi@linaro.org> writes:

> Hi François,
>
> On Thu, Aug 12, 2021 at 09:55:52AM +0200, Fran??ois Ozog wrote:
>> I top post as I find it difficult to identify where to make the comments.
>
> Thank you for the posting. 
> I think that we should first discuss more about the goal/requirements/
> practical use cases for the framework.
>
>> 1) BE acceleration
>> Network and storage backends may actually be executed in SmartNICs. As
>> virtio 1.1 is hardware friendly, there may be SmartNICs with virtio 1.1 PCI
>> VFs. Is it a valid use case for the generic BE framework to be used in this
>> context?
>> DPDK is used in some BE to significantly accelerate switching. DPDK is also
>> used sometimes in guests. In that case, there are no event injection but
>> just high performance memory scheme. Is this considered as a use case?
>
> I'm not quite familiar with DPDK but it seems to be heavily reliant
> on not only virtqueues but also kvm/linux features/functionality, say,
> according to [1].
> I'm afraid that DPDK is not suitable for primary (at least, initial)
> target use.
> # In my proposal, virtio-proxy, I have in mind the assumption that we would
> # create BE VM as a baremetal application on RTOS (and/or unikernel.)
>
> But as far as virtqueue is concerned, I think we can discuss in general
> technical details as Alex suggested, including:
> - sharing or mapping memory regions for data payload
> - efficient notification mechanism
>
> [1] https://www.redhat.com/en/blog/journey-vhost-users-realm
>
>> 2) Virtio as OS HAL
>> Panasonic CTO has been calling for a virtio based HAL and based on the
>> teachings of Google GKI, an internal HAL seem inevitable in the long term.
>> Virtio is then a contender to Google promoted Android HAL. Could the
>> framework be used in that context?
>
> In this case, where will the implementation of "HAL" reside?
> I don't think the portability of "HAL" code (as a set of virtio BEs)
> is a requirement here.

When I hear people referring to VirtIO HALs I'm thinking mainly of
VirtIO FE's living in a Linux kernel. There are certainly more devices
that can get added but the commonality on the guest side I think is
pretty much a solved problem (modulo Linux-ism's creeping into the
virtio spec).

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-01  8:57           ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-01  8:57 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Fran??ois Ozog, Stefano Stabellini, paul, Stratos Mailing List,
	virtio-dev, Jan Kiszka, Arnd Bergmann, jgross, julien,
	Carl van Schaik, Bertrand.Marquis, stefanha, Artem_Mygaiev,
	xen-devel, olekstysh, Oleksandr_Tyshchenko, xen-devel


AKASHI Takahiro <takahiro.akashi@linaro.org> writes:

> Hi François,
>
> On Thu, Aug 12, 2021 at 09:55:52AM +0200, Fran??ois Ozog wrote:
>> I top post as I find it difficult to identify where to make the comments.
>
> Thank you for the posting. 
> I think that we should first discuss more about the goal/requirements/
> practical use cases for the framework.
>
>> 1) BE acceleration
>> Network and storage backends may actually be executed in SmartNICs. As
>> virtio 1.1 is hardware friendly, there may be SmartNICs with virtio 1.1 PCI
>> VFs. Is it a valid use case for the generic BE framework to be used in this
>> context?
>> DPDK is used in some BE to significantly accelerate switching. DPDK is also
>> used sometimes in guests. In that case, there are no event injection but
>> just high performance memory scheme. Is this considered as a use case?
>
> I'm not quite familiar with DPDK but it seems to be heavily reliant
> on not only virtqueues but also kvm/linux features/functionality, say,
> according to [1].
> I'm afraid that DPDK is not suitable for primary (at least, initial)
> target use.
> # In my proposal, virtio-proxy, I have in mind the assumption that we would
> # create BE VM as a baremetal application on RTOS (and/or unikernel.)
>
> But as far as virtqueue is concerned, I think we can discuss in general
> technical details as Alex suggested, including:
> - sharing or mapping memory regions for data payload
> - efficient notification mechanism
>
> [1] https://www.redhat.com/en/blog/journey-vhost-users-realm
>
>> 2) Virtio as OS HAL
>> Panasonic CTO has been calling for a virtio based HAL and based on the
>> teachings of Google GKI, an internal HAL seem inevitable in the long term.
>> Virtio is then a contender to Google promoted Android HAL. Could the
>> framework be used in that context?
>
> In this case, where will the implementation of "HAL" reside?
> I don't think the portability of "HAL" code (as a set of virtio BEs)
> is a requirement here.

When I hear people referring to VirtIO HALs I'm thinking mainly of
VirtIO FE's living in a Linux kernel. There are certainly more devices
that can get added but the commonality on the guest side I think is
pretty much a solved problem (modulo Linux-ism's creeping into the
virtio spec).

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-31  6:18                       ` AKASHI Takahiro
@ 2021-09-01 11:12                         ` Wei Chen
  2021-09-01 12:29                           ` AKASHI Takahiro
  0 siblings, 1 reply; 66+ messages in thread
From: Wei Chen @ 2021-09-01 11:12 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant, nd,
	Xen Devel

Hi Akashi,

> -----Original Message-----
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Sent: 2021年8月31日 14:18
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> 
> Wei,
> 
> On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
> > Hi Akashi,
> >
> > > -----Original Message-----
> > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > Sent: 2021年8月26日 17:41
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> Xin
> > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> lists.linaro.org>;
> > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> <arnd.bergmann@linaro.org>;
> > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > Hi Wei,
> > >
> > > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > > Hi Akashi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > Sent: 2021年8月18日 13:39
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> Stabellini
> > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > > Stratos
> > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > dev@lists.oasis-
> > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > <cvanscha@qti.qualcomm.com>;
> > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > > Jean-
> > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> <Artem_Mygaiev@epam.com>;
> > > Julien
> > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > > Durrant
> > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > > Hi Akashi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > > Stabellini
> > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> <alex.bennee@linaro.org>;
> > > > > > Stratos
> > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > > dev@lists.oasis-
> > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> Kumar
> > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> Kiszka
> > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> <vatsa@codeaurora.org>;
> > > Jean-
> > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > <Artem_Mygaiev@epam.com>;
> > > > > > Julien
> > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> Paul
> > > Durrant
> > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> backends
> > > > > > > >
> > > > > > > > Hi Wei, Oleksandr,
> > > > > > > >
> > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
> here.
> > > > > > > > > This proposal is still discussing in Xen and KVM
> communities.
> > > > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > > > other hypervisors can reuse the virtual device
> implementations.
> > > > > > > > >
> > > > > > > > > In this case, we need to introduce an intermediate
> hypervisor
> > > > > > > > > layer for VMM abstraction, Which is, I think it's very
> close
> > > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > > >
> > > > > > > > # My proposal[1] comes from my own idea and doesn't always
> > > represent
> > > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > > > Nevertheless,
> > > > > > > >
> > > > > > > > Your idea and my proposal seem to share the same background.
> > > > > > > > Both have the similar goal and currently start with, at
> first,
> > > Xen
> > > > > > > > and are based on kvm-tool. (Actually, my work is derived
> from
> > > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > > >
> > > > > > > > In particular, the abstraction of hypervisor interfaces has
> a
> > > same
> > > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> > > interfaces").
> > > > > > > > This is not co-incident as we both share the same origin as
> I
> > > said
> > > > > > above.
> > > > > > > > And so we will also share the same issues. One of them is a
> way
> > > of
> > > > > > > > "sharing/mapping FE's memory". There is some trade-off
> between
> > > > > > > > the portability and the performance impact.
> > > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > > (See Alex's original email, too).
> > > > > > > >
> > > > > > > Yes, I agree.
> > > > > > >
> > > > > > > > On the other hand, my approach aims to create a "single-
> binary"
> > > > > > solution
> > > > > > > > in which the same binary of BE vm could run on any
> hypervisors.
> > > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
> solution,
> > > all
> > > > > > > > the hypervisor-specific code would be put into another
> entity
> > > (VM),
> > > > > > > > named "virtio-proxy" and the abstracted operations are
> served
> > > via RPC.
> > > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > > > dependency.)
> > > > > > > > But I know that we need discuss if this is a requirement
> even
> > > > > > > > in Stratos project or not. (Maybe not)
> > > > > > > >
> > > > > > >
> > > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> > > completely
> > > > > > > (I will do it ASAP). But from your description, it seems we
> need a
> > > > > > > 3rd VM between FE and BE? My concern is that, if my assumption
> is
> > > right,
> > > > > > > will it increase the latency in data transport path? Even if
> we're
> > > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > > >
> > > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > > > > As far as we execute 'mapping' operations at every fetch of
> payload,
> > > > > > we will see latency issue (even in your case) and if we have
> some
> > > solution
> > > > > > for it, we won't see it neither in my proposal :)
> > > > > >
> > > > >
> > > > > Oleksandr has sent a proposal to Xen mailing list to reduce this
> kind
> > > > > of "mapping/unmapping" operations. So the latency caused by this
> > > behavior
> > > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have
> that
> > > problem.
> > > >
> > > > Obviously, I have not yet caught up there in the discussion.
> > > > Which patch specifically?
> > >
> > > Can you give me the link to the discussion or patch, please?
> > >
> >
> > It's a RFC discussion. We have tested this RFC patch internally.
> > https://lists.xenproject.org/archives/html/xen-devel/2021-
> 07/msg01532.html
> 
> I'm afraid that I miss something here, but I don't know
> why this proposed API will lead to eliminating 'mmap' in accessing
> the queued payload at every request?
> 

This API give Xen device model (QEMU or kvmtool) the ability to map
whole guest RAM in device model's address space. In this case, device
model doesn't need dynamic hypercall to map/unmap payload memory.
It can use a flat offset to access payload memory in its address
space directly. Just Like KVM device model does now.

Before this API, When device model to map whole guest memory, will
severely consume the physical pages of Dom-0/Dom-D.

> -Takahiro Akashi
> 
> 
> > > Thanks,
> > > -Takahiro Akashi
> > >
> > > > -Takahiro Akashi
> > > >
> > > > > > > > Specifically speaking about kvm-tool, I have a concern about
> its
> > > > > > > > license term; Targeting different hypervisors and different
> OSs
> > > > > > > > (which I assume includes RTOS's), the resultant library
> should
> > > be
> > > > > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > > > > Any thoughts?
> > > > > > > >
> > > > > > >
> > > > > > > Yes. If user want to implement a FreeBSD device model, but the
> > > virtio
> > > > > > > library is GPL. Then GPL would be a problem. If we have
> another
> > > good
> > > > > > > candidate, I am open to it.
> > > > > >
> > > > > > I have some candidates, particularly for vq/vring, in my mind:
> > > > > > * Open-AMP, or
> > > > > > * corresponding Free-BSD code
> > > > > >
> > > > >
> > > > > Interesting, I will look into them : )
> > > > >
> > > > > Cheers,
> > > > > Wei Chen
> > > > >
> > > > > > -Takahiro Akashi
> > > > > >
> > > > > >
> > > > > > > > -Takahiro Akashi
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > > > > August/000548.html
> > > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>;
> Stefano
> > > > > > Stabellini
> > > > > > > > <sstabellini@kernel.org>
> > > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
> Mailing
> > > List
> > > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > > open.org;
> > > > > > Arnd
> > > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> Kiszka
> > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> <vatsa@codeaurora.org>;
> > > Jean-
> > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> > > Oleksandr
> > > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > <Artem_Mygaiev@epam.com>;
> > > > > > Julien
> > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> Paul
> > > Durrant
> > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > > backends
> > > > > > > > > >
> > > > > > > > > > Hello, all.
> > > > > > > > > >
> > > > > > > > > > Please see some comments below. And sorry for the
> possible
> > > format
> > > > > > > > issues.
> > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> > > Stabellini
> > > > > > wrote:
> > > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> > > trimming
> > > > > > the
> > > > > > > > original
> > > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > > >
> > > > > > > > > > > > My comments below are related to a potential Xen
> > > > > > implementation,
> > > > > > > > not
> > > > > > > > > > > > because it is the only implementation that matters,
> but
> > > > > > because it
> > > > > > > > is
> > > > > > > > > > > > the one I know best.
> > > > > > > > > > >
> > > > > > > > > > > Please note that my proposal (and hence the working
> > > prototype)[1]
> > > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ)
> and
> > > > > > > > particularly
> > > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > > It has been, I believe, well generalized but is still
> a
> > > bit
> > > > > > biased
> > > > > > > > > > > toward this original design.
> > > > > > > > > > >
> > > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > > >
> > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > > dev/2021-
> > > > > > > > August/000546.html
> > > > > > > > > > >
> > > > > > > > > > > Let me take this opportunity to explain a bit more
> about
> > > my
> > > > > > approach
> > > > > > > > below.
> > > > > > > > > > >
> > > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > One of the goals of Project Stratos is to enable
> > > hypervisor
> > > > > > > > agnostic
> > > > > > > > > > > > > backends so we can enable as much re-use of code
> as
> > > possible
> > > > > > and
> > > > > > > > avoid
> > > > > > > > > > > > > repeating ourselves. This is the flip side of the
> > > front end
> > > > > > > > where
> > > > > > > > > > > > > multiple front-end implementations are required -
> one
> > > per OS,
> > > > > > > > assuming
> > > > > > > > > > > > > you don't just want Linux guests. The resultant
> guests
> > > are
> > > > > > > > trivially
> > > > > > > > > > > > > movable between hypervisors modulo any abstracted
> > > paravirt
> > > > > > type
> > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In my original thumb nail sketch of a solution I
> > > envisioned
> > > > > > > > vhost-user
> > > > > > > > > > > > > daemons running in a broadly POSIX like
> environment.
> > > The
> > > > > > > > interface to
> > > > > > > > > > > > > the daemon is fairly simple requiring only some
> mapped
> > > > > > memory
> > > > > > > > and some
> > > > > > > > > > > > > sort of signalling for events (on Linux this is
> > > eventfd).
> > > > > > The
> > > > > > > > idea was a
> > > > > > > > > > > > > stub binary would be responsible for any
> hypervisor
> > > specific
> > > > > > > > setup and
> > > > > > > > > > > > > then launch a common binary to deal with the
> actual
> > > > > > virtqueue
> > > > > > > > requests
> > > > > > > > > > > > > themselves.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Since that original sketch we've seen an expansion
> in
> > > the
> > > > > > sort
> > > > > > > > of ways
> > > > > > > > > > > > > backends could be created. There is interest in
> > > > > > encapsulating
> > > > > > > > backends
> > > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
> There
> > > > > > interest
> > > > > > > > in Rust
> > > > > > > > > > > > > has prompted ideas of using the trait interface to
> > > abstract
> > > > > > > > differences
> > > > > > > > > > > > > away as well as the idea of bare-metal Rust
> backends.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > > Standardisation"
> > > > > > which
> > > > > > > > > > > > > calls for a description of the APIs needed from
> the
> > > > > > hypervisor
> > > > > > > > side to
> > > > > > > > > > > > > support VirtIO guests and their backends. However
> we
> > > are
> > > > > > some
> > > > > > > > way off
> > > > > > > > > > > > > from that at the moment as I think we need to at
> least
> > > > > > > > demonstrate one
> > > > > > > > > > > > > portable backend before we start codifying
> > > requirements. To
> > > > > > that
> > > > > > > > end I
> > > > > > > > > > > > > want to think about what we need for a backend to
> > > function.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Configuration
> > > > > > > > > > > > > =============
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the type-2 setup this is typically fairly
> simple
> > > because
> > > > > > the
> > > > > > > > host
> > > > > > > > > > > > > system can orchestrate the various modules that
> make
> > > up the
> > > > > > > > complete
> > > > > > > > > > > > > system. In the type-1 case (or even type-2 with
> > > delegated
> > > > > > > > service VMs)
> > > > > > > > > > > > > we need some sort of mechanism to inform the
> backend
> > > VM
> > > > > > about
> > > > > > > > key
> > > > > > > > > > > > > details about the system:
> > > > > > > > > > > > >
> > > > > > > > > > > > >   - where virt queue memory is in it's address
> space
> > > > > > > > > > > > >   - how it's going to receive (interrupt) and
> trigger
> > > (kick)
> > > > > > > > events
> > > > > > > > > > > > >   - what (if any) resources the backend needs to
> > > connect to
> > > > > > > > > > > > >
> > > > > > > > > > > > > Obviously you can elide over configuration issues
> by
> > > having
> > > > > > > > static
> > > > > > > > > > > > > configurations and baking the assumptions into
> your
> > > guest
> > > > > > images
> > > > > > > > however
> > > > > > > > > > > > > this isn't scalable in the long term. The obvious
> > > solution
> > > > > > seems
> > > > > > > > to be
> > > > > > > > > > > > > extending a subset of Device Tree data to user
> space
> > > but
> > > > > > perhaps
> > > > > > > > there
> > > > > > > > > > > > > are other approaches?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Before any virtio transactions can take place the
> > > > > > appropriate
> > > > > > > > memory
> > > > > > > > > > > > > mappings need to be made between the FE guest and
> the
> > > BE
> > > > > > guest.
> > > > > > > > > > > >
> > > > > > > > > > > > > Currently the whole of the FE guests address space
> > > needs to
> > > > > > be
> > > > > > > > visible
> > > > > > > > > > > > > to whatever is serving the virtio requests. I can
> > > envision 3
> > > > > > > > approaches:
> > > > > > > > > > > > >
> > > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > > >
> > > > > > > > > > > > >  This would entail the guest OS knowing where in
> it's
> > > Guest
> > > > > > > > Physical
> > > > > > > > > > > > >  Address space is already taken up and avoiding
> > > clashing. I
> > > > > > > > would assume
> > > > > > > > > > > > >  in this case you would want a standard interface
> to
> > > > > > userspace
> > > > > > > > to then
> > > > > > > > > > > > >  make that address space visible to the backend
> daemon.
> > > > > > > > > > >
> > > > > > > > > > > Yet another way here is that we would have well known
> > > "shared
> > > > > > > > memory" between
> > > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> > > insights on
> > > > > > this
> > > > > > > > matter
> > > > > > > > > > > and that it can even be an alternative for hypervisor-
> > > agnostic
> > > > > > > > solution.
> > > > > > > > > > >
> > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> > > device
> > > > > > and
> > > > > > > > can be
> > > > > > > > > > > mapped locally.)
> > > > > > > > > > >
> > > > > > > > > > > I want to add this shared memory aspect to my virtio-
> proxy,
> > > but
> > > > > > > > > > > the resultant solution would eventually look similar
> to
> > > ivshmem.
> > > > > > > > > > >
> > > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
> memory
> > > > > > > > > > > > >
> > > > > > > > > > > > >  The BE guest is then free to map the FE's memory
> to
> > > where
> > > > > > it
> > > > > > > > wants in
> > > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > > >
> > > > > > > > > > > > I cannot see how this could work for Xen. There is
> no
> > > "handle"
> > > > > > to
> > > > > > > > give
> > > > > > > > > > > > to the backend if the backend is not running in dom0.
> So
> > > for
> > > > > > Xen I
> > > > > > > > think
> > > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > > >
> > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> > > information
> > > > > > is
> > > > > > > > expected
> > > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > > > >    - the start address of configuration space
> > > > > > > > > > >    - interrupt number
> > > > > > > > > > >    - file path for backing storage
> > > > > > > > > > >    - read-only flag
> > > > > > > > > > > And the BE server have to call a particular hypervisor
> > > interface
> > > > > > to
> > > > > > > > > > > map the configuration space.
> > > > > > > > > >
> > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > > configuration
> > > > > > info to
> > > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > > I remember, there was a wish to avoid using Xenstore in
> > > Virtio
> > > > > > backend
> > > > > > > > itself if possible, so for non-toolstack domain, this could
> done
> > > with
> > > > > > > > adjusting devd (daemon that listens for devices and launches
> > > backends)
> > > > > > > > > > to read backend configuration from the Xenstore anyway
> and
> > > pass it
> > > > > > to
> > > > > > > > the backend via command line arguments.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, in current PoC code we're using xenstore to pass
> device
> > > > > > > > configuration.
> > > > > > > > > We also designed a static device configuration parse
> method
> > > for
> > > > > > Dom0less
> > > > > > > > or
> > > > > > > > > other scenarios don't have xentool. yes, it's from device
> > > model
> > > > > > command
> > > > > > > > line
> > > > > > > > > or a config file.
> > > > > > > > >
> > > > > > > > > > But, if ...
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > > hypervisor)-
> > > > > > > > specific
> > > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM,
> to
> > > hide
> > > > > > all
> > > > > > > > details.
> > > > > > > > > >
> > > > > > > > > > ... the solution how to overcome that is already found
> and
> > > proven
> > > > > > to
> > > > > > > > work then even better.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > # My point is that a "handle" is not mandatory for
> > > executing
> > > > > > mapping.
> > > > > > > > > > >
> > > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > > toolstack (also see below.) Or we would have to
> invent a
> > > new
> > > > > > Xen
> > > > > > > > > > > > hypervisor interface and Xen virtual machine
> privileges
> > > to
> > > > > > allow
> > > > > > > > this
> > > > > > > > > > > > kind of mapping.
> > > > > > > > > > >
> > > > > > > > > > > > If we run the backend in Dom0 that we have no
> problems
> > > of
> > > > > > course.
> > > > > > > > > > >
> > > > > > > > > > > One of difficulties on Xen that I found in my approach
> is
> > > that
> > > > > > > > calling
> > > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> > > memory) is
> > > > > > > > only
> > > > > > > > > > > allowed on BE servers themselvies and so we will have
> to
> > > extend
> > > > > > > > those
> > > > > > > > > > > interfaces.
> > > > > > > > > > > This, however, will raise some concern on security and
> > > privilege
> > > > > > > > distribution
> > > > > > > > > > > as Stefan suggested.
> > > > > > > > > >
> > > > > > > > > > We also faced policy related issues with Virtio backend
> > > running in
> > > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> > > system we
> > > > > > run
> > > > > > > > the backend in a driver
> > > > > > > > > > domain (we call it DomD) where the underlying H/W
> resides.
> > > We
> > > > > > trust it,
> > > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> > > provide
> > > > > > it
> > > > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > > > Now it is permitted to issue device-model, resource and
> > > memory
> > > > > > > > mappings, etc calls.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > > >  require some sort of hypercall to the hypervisor.
> I
> > > can see
> > > > > > two
> > > > > > > > options
> > > > > > > > > > > > >  at this point:
> > > > > > > > > > > > >
> > > > > > > > > > > > >   - expose the handle to userspace for
> daemon/helper
> > > to
> > > > > > trigger
> > > > > > > > the
> > > > > > > > > > > > >     mapping via existing hypercall interfaces. If
> > > using a
> > > > > > helper
> > > > > > > > you
> > > > > > > > > > > > >     would have a hypervisor specific one to avoid
> the
> > > daemon
> > > > > > > > having to
> > > > > > > > > > > > >     care too much about the details or push that
> > > complexity
> > > > > > into
> > > > > > > > a
> > > > > > > > > > > > >     compile time option for the daemon which would
> > > result in
> > > > > > > > different
> > > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > > >
> > > > > > > > > > > > >   - expose a new kernel ABI to abstract the
> hypercall
> > > > > > > > differences away
> > > > > > > > > > > > >     in the guest kernel. In this case the
> userspace
> > > would
> > > > > > > > essentially
> > > > > > > > > > > > >     ask for an abstract "map guest N memory to
> > > userspace
> > > > > > ptr"
> > > > > > > > and let
> > > > > > > > > > > > >     the kernel deal with the different hypercall
> > > interfaces.
> > > > > > > > This of
> > > > > > > > > > > > >     course assumes the majority of BE guests would
> be
> > > Linux
> > > > > > > > kernels and
> > > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> > > their own
> > > > > > > > devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Operation
> > > > > > > > > > > > > =========
> > > > > > > > > > > > >
> > > > > > > > > > > > > The core of the operation of VirtIO is fairly
> simple.
> > > Once
> > > > > > the
> > > > > > > > > > > > > vhost-user feature negotiation is done it's a case
> of
> > > > > > receiving
> > > > > > > > update
> > > > > > > > > > > > > events and parsing the resultant virt queue for
> data.
> > > The
> > > > > > vhost-
> > > > > > > > user
> > > > > > > > > > > > > specification handles a bunch of setup before that
> > > point,
> > > > > > mostly
> > > > > > > > to
> > > > > > > > > > > > > detail where the virt queues are set up FD's for
> > > memory and
> > > > > > > > event
> > > > > > > > > > > > > communication. This is where the envisioned stub
> > > process
> > > > > > would
> > > > > > > > be
> > > > > > > > > > > > > responsible for getting the daemon up and ready to
> run.
> > > This
> > > > > > is
> > > > > > > > > > > > > currently done inside a big VMM like QEMU but I
> > > suspect a
> > > > > > modern
> > > > > > > > > > > > > approach would be to use the rust-vmm vhost crate.
> It
> > > would
> > > > > > then
> > > > > > > > either
> > > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
> re-
> > > > > > targeted
> > > > > > > > as a
> > > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > > >
> > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
> doesn't
> > > have
> > > > > > VMMs
> > > > > > > > the
> > > > > > > > > > > > way they are typically envisioned and described in
> other
> > > > > > > > environments.
> > > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them
> connects
> > > > > > > > independently to
> > > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple
> QEMUs
> > > could
> > > > > > be
> > > > > > > > used as
> > > > > > > > > > > > emulators for a single Xen VM, each of them
> connecting
> > > to Xen
> > > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > > >
> > > > > > > > > > > > The component responsible for starting a daemon
> and/or
> > > setting
> > > > > > up
> > > > > > > > shared
> > > > > > > > > > > > interfaces is the toolstack: the xl command and the
> > > > > > libxl/libxc
> > > > > > > > > > > > libraries.
> > > > > > > > > > >
> > > > > > > > > > > I think that VM configuration management (or
> orchestration
> > > in
> > > > > > > > Startos
> > > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > > Otherwise, is there any good assumption to avoid it
> right
> > > now?
> > > > > > > > > > >
> > > > > > > > > > > > Oleksandr and others I CCed have been working on
> ways
> > > for the
> > > > > > > > toolstack
> > > > > > > > > > > > to create virtio backends and setup memory mappings.
> > > They
> > > > > > might be
> > > > > > > > able
> > > > > > > > > > > > to provide more info on the subject. I do think we
> miss
> > > a way
> > > > > > to
> > > > > > > > provide
> > > > > > > > > > > > the configuration to the backend and anything else
> that
> > > the
> > > > > > > > backend
> > > > > > > > > > > > might require to start doing its job.
> > > > > > > > > >
> > > > > > > > > > Yes, some work has been done for the toolstack to handle
> > > Virtio
> > > > > > MMIO
> > > > > > > > devices in
> > > > > > > > > > general and Virtio block devices in particular. However,
> it
> > > has
> > > > > > not
> > > > > > > > been upstreaned yet.
> > > > > > > > > > Updated patches on review now:
> > > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-
> git-
> > > send-
> > > > > > email-
> > > > > > > > olekstysh@gmail.com/
> > > > > > > > > >
> > > > > > > > > > There is an additional (also important) activity to
> > > improve/fix
> > > > > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > > > > The foreign memory mapping is proposed to be used for
> Virtio
> > > > > > backends
> > > > > > > > (device emulators) if there is a need to run guest OS
> completely
> > > > > > > > unmodified.
> > > > > > > > > > Of course, the more secure way would be to use grant
> memory
> > > > > > mapping.
> > > > > > > > Brietly, the main difference between them is that with
> foreign
> > > mapping
> > > > > > the
> > > > > > > > backend
> > > > > > > > > > can map any guest memory it wants to map, but with grant
> > > mapping
> > > > > > it is
> > > > > > > > allowed to map only what was previously granted by the
> frontend.
> > > > > > > > > >
> > > > > > > > > > So, there might be a problem if we want to pre-map some
> > > guest
> > > > > > memory
> > > > > > > > in advance or to cache mappings in the backend in order to
> > > improve
> > > > > > > > performance (because the mapping/unmapping guest pages every
> > > request
> > > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> > > nutshell,
> > > > > > > > currently, in order to map a guest page into the backend
> address
> > > space
> > > > > > we
> > > > > > > > need to steal a real physical page from the backend domain.
> So,
> > > with
> > > > > > the
> > > > > > > > said optimizations we might end up with no free memory in
> the
> > > backend
> > > > > > > > domain (see XSA-300). And what we try to achieve is to not
> waste
> > > a
> > > > > > real
> > > > > > > > domain memory at all by providing safe non-allocated-yet (so
> > > unused)
> > > > > > > > address space for the foreign (and grant) pages to be mapped
> > > into,
> > > > > > this
> > > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> > > changes.
> > > > > > > > However, as it turned out, for this to work in a proper and
> safe
> > > way
> > > > > > some
> > > > > > > > prereq work needs to be done.
> > > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-
> git-
> > > send-
> > > > > > email-
> > > > > > > > olekstysh@gmail.com/
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > One question is how to best handle notification
> and
> > > kicks.
> > > > > > The
> > > > > > > > existing
> > > > > > > > > > > > > vhost-user framework uses eventfd to signal the
> daemon
> > > > > > (although
> > > > > > > > QEMU
> > > > > > > > > > > > > is quite capable of simulating them when you use
> TCG).
> > > Xen
> > > > > > has
> > > > > > > > it's own
> > > > > > > > > > > > > IOREQ mechanism. However latency is an important
> > > factor and
> > > > > > > > having
> > > > > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > > > > >
> > > > > > > > > > > > Yeah I think, regardless of anything else, we want
> the
> > > > > > backends to
> > > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > > >
> > > > > > > > > > > In my approach,
> > > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> > > hypervisor
> > > > > > > > interface
> > > > > > > > > > >               via virtio-proxy
> > > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> > > channels),
> > > > > > > > which is
> > > > > > > > > > >               converted to a callback to BE via
> virtio-
> > > proxy
> > > > > > > > > > >               (Xen's event channel is internnally
> > > implemented by
> > > > > > > > interrupts.)
> > > > > > > > > > >
> > > > > > > > > > > I don't know what "connect directly" means here, but
> > > sending
> > > > > > > > interrupts
> > > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
> PCI's
> > > msi-x
> > > > > > > > mechanism.
> > > > > > > > > >
> > > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > > At the moment, in order to notify the frontend, the
> backend
> > > issues
> > > > > > a
> > > > > > > > specific device-model call to query Xen to inject a
> > > corresponding SPI
> > > > > > to
> > > > > > > > the guest.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Could we consider the kernel internally converting
> > > IOREQ
> > > > > > > > messages from
> > > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
> scale
> > > with
> > > > > > > > other kernel
> > > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > > >
> > > > > > > > > > > > > So any thoughts on what directions are worth
> > > experimenting
> > > > > > with?
> > > > > > > > > > > >
> > > > > > > > > > > > One option we should consider is for each backend to
> > > connect
> > > > > > to
> > > > > > > > Xen via
> > > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> > > interface
> > > > > > and
> > > > > > > > make it
> > > > > > > > > > > > hypervisor agnostic. The interface is really trivial
> and
> > > easy
> > > > > > to
> > > > > > > > add.
> > > > > > > > > > >
> > > > > > > > > > > As I said above, my proposal does the same thing that
> you
> > > > > > mentioned
> > > > > > > > here :)
> > > > > > > > > > > The difference is that I do call hypervisor interfaces
> via
> > > > > > virtio-
> > > > > > > > proxy.
> > > > > > > > > > >
> > > > > > > > > > > > The only Xen-specific part is the notification
> mechanism,
> > > > > > which is
> > > > > > > > an
> > > > > > > > > > > > event channel. If we replaced the event channel with
> > > something
> > > > > > > > else the
> > > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > > >
> > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
> the
> > > kernel
> > > > > > is
> > > > > > > > a
> > > > > > > > > > > > good idea: if feels like it would be extra
> complexity
> > > and that
> > > > > > the
> > > > > > > > > > > > kernel shouldn't be involved as this is a backend-
> > > hypervisor
> > > > > > > > interface.
> > > > > > > > > > >
> > > > > > > > > > > Given that we may want to implement BE as a bare-metal
> > > > > > application
> > > > > > > > > > > as I did on Zephyr, I don't think that the translation
> > > would not
> > > > > > be
> > > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > > It will be some kind of abstraction layer of interrupt
> > > handling
> > > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > > >
> > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
> trying to
> > > > > > design an
> > > > > > > > > > > > interface that could work well for RTOSes too. If we
> > > want to
> > > > > > do
> > > > > > > > > > > > something different, both OS-agnostic and
> hypervisor-
> > > agnostic,
> > > > > > > > perhaps
> > > > > > > > > > > > we could design a new interface. One that could be
> > > > > > implementable
> > > > > > > > in the
> > > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> > > other
> > > > > > > > hypervisor
> > > > > > > > > > > > too.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > There is also another problem. IOREQ is probably not
> be
> > > the
> > > > > > only
> > > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> > > Don't we
> > > > > > > > also need
> > > > > > > > > > > > an interface for the backend to inject interrupts
> into
> > > the
> > > > > > > > frontend? And
> > > > > > > > > > > > if the backend requires dynamic memory mappings of
> > > frontend
> > > > > > pages,
> > > > > > > > then
> > > > > > > > > > > > we would also need an interface to map/unmap domU
> pages.
> > > > > > > > > > >
> > > > > > > > > > > My proposal document might help here; All the
> interfaces
> > > > > > required
> > > > > > > > for
> > > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
> listed
> > > as
> > > > > > > > > > > RPC protocols :)
> > > > > > > > > > >
> > > > > > > > > > > > These interfaces are a lot more problematic than
> IOREQ:
> > > IOREQ
> > > > > > is
> > > > > > > > tiny
> > > > > > > > > > > > and self-contained. It is easy to add anywhere. A
> new
> > > > > > interface to
> > > > > > > > > > > > inject interrupts or map pages is more difficult to
> > > manage
> > > > > > because
> > > > > > > > it
> > > > > > > > > > > > would require changes scattered across the various
> > > emulators.
> > > > > > > > > > >
> > > > > > > > > > > Exactly. I have no confident yet that my approach will
> > > also
> > > > > > apply
> > > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > > Technically, yes, but whether people can accept it or
> not
> > > is a
> > > > > > > > different
> > > > > > > > > > > matter.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > -Takahiro Akashi
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Oleksandr Tyshchenko
> > > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> > > attachments are
> > > > > > > > confidential and may also be privileged. If you are not the
> > > intended
> > > > > > > > recipient, please notify the sender immediately and do not
> > > disclose
> > > > > > the
> > > > > > > > contents to any other person, use it for any purpose, or
> store
> > > or copy
> > > > > > the
> > > > > > > > information in any medium. Thank you.
> > > > > > > IMPORTANT NOTICE: The contents of this email and any
> attachments
> > > are
> > > > > > confidential and may also be privileged. If you are not the
> intended
> > > > > > recipient, please notify the sender immediately and do not
> disclose
> > > the
> > > > > > contents to any other person, use it for any purpose, or store
> or
> > > copy the
> > > > > > information in any medium. Thank you.
> > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose
> the
> > > contents to any other person, use it for any purpose, or store or copy
> the
> > > information in any medium. Thank you.
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-01 11:12                         ` Wei Chen
@ 2021-09-01 12:29                           ` AKASHI Takahiro
  2021-09-01 16:26                             ` Oleksandr Tyshchenko
  2021-09-02  1:30                             ` Wei Chen
  0 siblings, 2 replies; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-01 12:29 UTC (permalink / raw)
  To: Wei Chen
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant, nd,
	Xen Devel

Hi Wei,

On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
> Hi Akashi,
> 
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年8月31日 14:18
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > 
> > Wei,
> > 
> > On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
> > > Hi Akashi,
> > >
> > > > -----Original Message-----
> > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > Sent: 2021年8月26日 17:41
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> > Xin
> > > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> > lists.linaro.org>;
> > > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> > <arnd.bergmann@linaro.org>;
> > > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > Julien
> > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > Hi Wei,
> > > >
> > > > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > > > Hi Akashi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > Sent: 2021年8月18日 13:39
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > Stabellini
> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > > > Stratos
> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > dev@lists.oasis-
> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > > > Jean-
> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > <Artem_Mygaiev@epam.com>;
> > > > Julien
> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > > > Durrant
> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > > > Hi Akashi,
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > > > Stabellini
> > > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> > <alex.bennee@linaro.org>;
> > > > > > > Stratos
> > > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > > > dev@lists.oasis-
> > > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> > Kumar
> > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> > Kiszka
> > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > <vatsa@codeaurora.org>;
> > > > Jean-
> > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > <Artem_Mygaiev@epam.com>;
> > > > > > > Julien
> > > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> > Paul
> > > > Durrant
> > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > backends
> > > > > > > > >
> > > > > > > > > Hi Wei, Oleksandr,
> > > > > > > > >
> > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
> > here.
> > > > > > > > > > This proposal is still discussing in Xen and KVM
> > communities.
> > > > > > > > > > The main work is to decouple the kvmtool from KVM and make
> > > > > > > > > > other hypervisors can reuse the virtual device
> > implementations.
> > > > > > > > > >
> > > > > > > > > > In this case, we need to introduce an intermediate
> > hypervisor
> > > > > > > > > > layer for VMM abstraction, Which is, I think it's very
> > close
> > > > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > > > >
> > > > > > > > > # My proposal[1] comes from my own idea and doesn't always
> > > > represent
> > > > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> > > > > > > Nevertheless,
> > > > > > > > >
> > > > > > > > > Your idea and my proposal seem to share the same background.
> > > > > > > > > Both have the similar goal and currently start with, at
> > first,
> > > > Xen
> > > > > > > > > and are based on kvm-tool. (Actually, my work is derived
> > from
> > > > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > > > >
> > > > > > > > > In particular, the abstraction of hypervisor interfaces has
> > a
> > > > same
> > > > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> > > > interfaces").
> > > > > > > > > This is not co-incident as we both share the same origin as
> > I
> > > > said
> > > > > > > above.
> > > > > > > > > And so we will also share the same issues. One of them is a
> > way
> > > > of
> > > > > > > > > "sharing/mapping FE's memory". There is some trade-off
> > between
> > > > > > > > > the portability and the performance impact.
> > > > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > > > (See Alex's original email, too).
> > > > > > > > >
> > > > > > > > Yes, I agree.
> > > > > > > >
> > > > > > > > > On the other hand, my approach aims to create a "single-
> > binary"
> > > > > > > solution
> > > > > > > > > in which the same binary of BE vm could run on any
> > hypervisors.
> > > > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
> > solution,
> > > > all
> > > > > > > > > the hypervisor-specific code would be put into another
> > entity
> > > > (VM),
> > > > > > > > > named "virtio-proxy" and the abstracted operations are
> > served
> > > > via RPC.
> > > > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> > > > > > > dependency.)
> > > > > > > > > But I know that we need discuss if this is a requirement
> > even
> > > > > > > > > in Stratos project or not. (Maybe not)
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> > > > completely
> > > > > > > > (I will do it ASAP). But from your description, it seems we
> > need a
> > > > > > > > 3rd VM between FE and BE? My concern is that, if my assumption
> > is
> > > > right,
> > > > > > > > will it increase the latency in data transport path? Even if
> > we're
> > > > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > > > >
> > > > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> > > > > > > As far as we execute 'mapping' operations at every fetch of
> > payload,
> > > > > > > we will see latency issue (even in your case) and if we have
> > some
> > > > solution
> > > > > > > for it, we won't see it neither in my proposal :)
> > > > > > >
> > > > > >
> > > > > > Oleksandr has sent a proposal to Xen mailing list to reduce this
> > kind
> > > > > > of "mapping/unmapping" operations. So the latency caused by this
> > > > behavior
> > > > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have
> > that
> > > > problem.
> > > > >
> > > > > Obviously, I have not yet caught up there in the discussion.
> > > > > Which patch specifically?
> > > >
> > > > Can you give me the link to the discussion or patch, please?
> > > >
> > >
> > > It's a RFC discussion. We have tested this RFC patch internally.
> > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > 07/msg01532.html
> > 
> > I'm afraid that I miss something here, but I don't know
> > why this proposed API will lead to eliminating 'mmap' in accessing
> > the queued payload at every request?
> > 
> 
> This API give Xen device model (QEMU or kvmtool) the ability to map
> whole guest RAM in device model's address space. In this case, device
> model doesn't need dynamic hypercall to map/unmap payload memory.
> It can use a flat offset to access payload memory in its address
> space directly. Just Like KVM device model does now.

Thank you. Quickly, let me make sure one thing:
This API itself doesn't do any mapping operations, right?
So I suppose that virtio BE guest is responsible to
1) fetch the information about all the memory regions in FE,
2) call this API to allocate a big chunk of unused space in BE,
3) create grant/foreign mappings for FE onto this region(S)
in the initialization/configuration of emulated virtio devices.

Is this the way this API is expected to be used?
Does Xen already has an interface for (1)?

-Takahiro Akashi

> Before this API, When device model to map whole guest memory, will
> severely consume the physical pages of Dom-0/Dom-D.
> 
> > -Takahiro Akashi
> > 
> > 
> > > > Thanks,
> > > > -Takahiro Akashi
> > > >
> > > > > -Takahiro Akashi
> > > > >
> > > > > > > > > Specifically speaking about kvm-tool, I have a concern about
> > its
> > > > > > > > > license term; Targeting different hypervisors and different
> > OSs
> > > > > > > > > (which I assume includes RTOS's), the resultant library
> > should
> > > > be
> > > > > > > > > license permissive and GPL for kvm-tool might be an issue.
> > > > > > > > > Any thoughts?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes. If user want to implement a FreeBSD device model, but the
> > > > virtio
> > > > > > > > library is GPL. Then GPL would be a problem. If we have
> > another
> > > > good
> > > > > > > > candidate, I am open to it.
> > > > > > >
> > > > > > > I have some candidates, particularly for vq/vring, in my mind:
> > > > > > > * Open-AMP, or
> > > > > > > * corresponding Free-BSD code
> > > > > > >
> > > > > >
> > > > > > Interesting, I will look into them : )
> > > > > >
> > > > > > Cheers,
> > > > > > Wei Chen
> > > > > >
> > > > > > > -Takahiro Akashi
> > > > > > >
> > > > > > >
> > > > > > > > > -Takahiro Akashi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> > > > > > > > > August/000548.html
> > > > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>;
> > Stefano
> > > > > > > Stabellini
> > > > > > > > > <sstabellini@kernel.org>
> > > > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > Mailing
> > > > List
> > > > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> > > > open.org;
> > > > > > > Arnd
> > > > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> > Kiszka
> > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > <vatsa@codeaurora.org>;
> > > > Jean-
> > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> > > > Oleksandr
> > > > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > <Artem_Mygaiev@epam.com>;
> > > > > > > Julien
> > > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> > Paul
> > > > Durrant
> > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > > > backends
> > > > > > > > > > >
> > > > > > > > > > > Hello, all.
> > > > > > > > > > >
> > > > > > > > > > > Please see some comments below. And sorry for the
> > possible
> > > > format
> > > > > > > > > issues.
> > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> > > > Stabellini
> > > > > > > wrote:
> > > > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> > > > trimming
> > > > > > > the
> > > > > > > > > original
> > > > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My comments below are related to a potential Xen
> > > > > > > implementation,
> > > > > > > > > not
> > > > > > > > > > > > > because it is the only implementation that matters,
> > but
> > > > > > > because it
> > > > > > > > > is
> > > > > > > > > > > > > the one I know best.
> > > > > > > > > > > >
> > > > > > > > > > > > Please note that my proposal (and hence the working
> > > > prototype)[1]
> > > > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ)
> > and
> > > > > > > > > particularly
> > > > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > > > It has been, I believe, well generalized but is still
> > a
> > > > bit
> > > > > > > biased
> > > > > > > > > > > > toward this original design.
> > > > > > > > > > > >
> > > > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > > > dev/2021-
> > > > > > > > > August/000546.html
> > > > > > > > > > > >
> > > > > > > > > > > > Let me take this opportunity to explain a bit more
> > about
> > > > my
> > > > > > > approach
> > > > > > > > > below.
> > > > > > > > > > > >
> > > > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One of the goals of Project Stratos is to enable
> > > > hypervisor
> > > > > > > > > agnostic
> > > > > > > > > > > > > > backends so we can enable as much re-use of code
> > as
> > > > possible
> > > > > > > and
> > > > > > > > > avoid
> > > > > > > > > > > > > > repeating ourselves. This is the flip side of the
> > > > front end
> > > > > > > > > where
> > > > > > > > > > > > > > multiple front-end implementations are required -
> > one
> > > > per OS,
> > > > > > > > > assuming
> > > > > > > > > > > > > > you don't just want Linux guests. The resultant
> > guests
> > > > are
> > > > > > > > > trivially
> > > > > > > > > > > > > > movable between hypervisors modulo any abstracted
> > > > paravirt
> > > > > > > type
> > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In my original thumb nail sketch of a solution I
> > > > envisioned
> > > > > > > > > vhost-user
> > > > > > > > > > > > > > daemons running in a broadly POSIX like
> > environment.
> > > > The
> > > > > > > > > interface to
> > > > > > > > > > > > > > the daemon is fairly simple requiring only some
> > mapped
> > > > > > > memory
> > > > > > > > > and some
> > > > > > > > > > > > > > sort of signalling for events (on Linux this is
> > > > eventfd).
> > > > > > > The
> > > > > > > > > idea was a
> > > > > > > > > > > > > > stub binary would be responsible for any
> > hypervisor
> > > > specific
> > > > > > > > > setup and
> > > > > > > > > > > > > > then launch a common binary to deal with the
> > actual
> > > > > > > virtqueue
> > > > > > > > > requests
> > > > > > > > > > > > > > themselves.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Since that original sketch we've seen an expansion
> > in
> > > > the
> > > > > > > sort
> > > > > > > > > of ways
> > > > > > > > > > > > > > backends could be created. There is interest in
> > > > > > > encapsulating
> > > > > > > > > backends
> > > > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
> > There
> > > > > > > interest
> > > > > > > > > in Rust
> > > > > > > > > > > > > > has prompted ideas of using the trait interface to
> > > > abstract
> > > > > > > > > differences
> > > > > > > > > > > > > > away as well as the idea of bare-metal Rust
> > backends.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > > > Standardisation"
> > > > > > > which
> > > > > > > > > > > > > > calls for a description of the APIs needed from
> > the
> > > > > > > hypervisor
> > > > > > > > > side to
> > > > > > > > > > > > > > support VirtIO guests and their backends. However
> > we
> > > > are
> > > > > > > some
> > > > > > > > > way off
> > > > > > > > > > > > > > from that at the moment as I think we need to at
> > least
> > > > > > > > > demonstrate one
> > > > > > > > > > > > > > portable backend before we start codifying
> > > > requirements. To
> > > > > > > that
> > > > > > > > > end I
> > > > > > > > > > > > > > want to think about what we need for a backend to
> > > > function.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Configuration
> > > > > > > > > > > > > > =============
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the type-2 setup this is typically fairly
> > simple
> > > > because
> > > > > > > the
> > > > > > > > > host
> > > > > > > > > > > > > > system can orchestrate the various modules that
> > make
> > > > up the
> > > > > > > > > complete
> > > > > > > > > > > > > > system. In the type-1 case (or even type-2 with
> > > > delegated
> > > > > > > > > service VMs)
> > > > > > > > > > > > > > we need some sort of mechanism to inform the
> > backend
> > > > VM
> > > > > > > about
> > > > > > > > > key
> > > > > > > > > > > > > > details about the system:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   - where virt queue memory is in it's address
> > space
> > > > > > > > > > > > > >   - how it's going to receive (interrupt) and
> > trigger
> > > > (kick)
> > > > > > > > > events
> > > > > > > > > > > > > >   - what (if any) resources the backend needs to
> > > > connect to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Obviously you can elide over configuration issues
> > by
> > > > having
> > > > > > > > > static
> > > > > > > > > > > > > > configurations and baking the assumptions into
> > your
> > > > guest
> > > > > > > images
> > > > > > > > > however
> > > > > > > > > > > > > > this isn't scalable in the long term. The obvious
> > > > solution
> > > > > > > seems
> > > > > > > > > to be
> > > > > > > > > > > > > > extending a subset of Device Tree data to user
> > space
> > > > but
> > > > > > > perhaps
> > > > > > > > > there
> > > > > > > > > > > > > > are other approaches?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Before any virtio transactions can take place the
> > > > > > > appropriate
> > > > > > > > > memory
> > > > > > > > > > > > > > mappings need to be made between the FE guest and
> > the
> > > > BE
> > > > > > > guest.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently the whole of the FE guests address space
> > > > needs to
> > > > > > > be
> > > > > > > > > visible
> > > > > > > > > > > > > > to whatever is serving the virtio requests. I can
> > > > envision 3
> > > > > > > > > approaches:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  This would entail the guest OS knowing where in
> > it's
> > > > Guest
> > > > > > > > > Physical
> > > > > > > > > > > > > >  Address space is already taken up and avoiding
> > > > clashing. I
> > > > > > > > > would assume
> > > > > > > > > > > > > >  in this case you would want a standard interface
> > to
> > > > > > > userspace
> > > > > > > > > to then
> > > > > > > > > > > > > >  make that address space visible to the backend
> > daemon.
> > > > > > > > > > > >
> > > > > > > > > > > > Yet another way here is that we would have well known
> > > > "shared
> > > > > > > > > memory" between
> > > > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> > > > insights on
> > > > > > > this
> > > > > > > > > matter
> > > > > > > > > > > > and that it can even be an alternative for hypervisor-
> > > > agnostic
> > > > > > > > > solution.
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> > > > device
> > > > > > > and
> > > > > > > > > can be
> > > > > > > > > > > > mapped locally.)
> > > > > > > > > > > >
> > > > > > > > > > > > I want to add this shared memory aspect to my virtio-
> > proxy,
> > > > but
> > > > > > > > > > > > the resultant solution would eventually look similar
> > to
> > > > ivshmem.
> > > > > > > > > > > >
> > > > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
> > memory
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  The BE guest is then free to map the FE's memory
> > to
> > > > where
> > > > > > > it
> > > > > > > > > wants in
> > > > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I cannot see how this could work for Xen. There is
> > no
> > > > "handle"
> > > > > > > to
> > > > > > > > > give
> > > > > > > > > > > > > to the backend if the backend is not running in dom0.
> > So
> > > > for
> > > > > > > Xen I
> > > > > > > > > think
> > > > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > > > >
> > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> > > > information
> > > > > > > is
> > > > > > > > > expected
> > > > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > > > > >    - the start address of configuration space
> > > > > > > > > > > >    - interrupt number
> > > > > > > > > > > >    - file path for backing storage
> > > > > > > > > > > >    - read-only flag
> > > > > > > > > > > > And the BE server have to call a particular hypervisor
> > > > interface
> > > > > > > to
> > > > > > > > > > > > map the configuration space.
> > > > > > > > > > >
> > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > > > configuration
> > > > > > > info to
> > > > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > > > I remember, there was a wish to avoid using Xenstore in
> > > > Virtio
> > > > > > > backend
> > > > > > > > > itself if possible, so for non-toolstack domain, this could
> > done
> > > > with
> > > > > > > > > adjusting devd (daemon that listens for devices and launches
> > > > backends)
> > > > > > > > > > > to read backend configuration from the Xenstore anyway
> > and
> > > > pass it
> > > > > > > to
> > > > > > > > > the backend via command line arguments.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yes, in current PoC code we're using xenstore to pass
> > device
> > > > > > > > > configuration.
> > > > > > > > > > We also designed a static device configuration parse
> > method
> > > > for
> > > > > > > Dom0less
> > > > > > > > > or
> > > > > > > > > > other scenarios don't have xentool. yes, it's from device
> > > > model
> > > > > > > command
> > > > > > > > > line
> > > > > > > > > > or a config file.
> > > > > > > > > >
> > > > > > > > > > > But, if ...
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > > > hypervisor)-
> > > > > > > > > specific
> > > > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM,
> > to
> > > > hide
> > > > > > > all
> > > > > > > > > details.
> > > > > > > > > > >
> > > > > > > > > > > ... the solution how to overcome that is already found
> > and
> > > > proven
> > > > > > > to
> > > > > > > > > work then even better.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > # My point is that a "handle" is not mandatory for
> > > > executing
> > > > > > > mapping.
> > > > > > > > > > > >
> > > > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > > > toolstack (also see below.) Or we would have to
> > invent a
> > > > new
> > > > > > > Xen
> > > > > > > > > > > > > hypervisor interface and Xen virtual machine
> > privileges
> > > > to
> > > > > > > allow
> > > > > > > > > this
> > > > > > > > > > > > > kind of mapping.
> > > > > > > > > > > >
> > > > > > > > > > > > > If we run the backend in Dom0 that we have no
> > problems
> > > > of
> > > > > > > course.
> > > > > > > > > > > >
> > > > > > > > > > > > One of difficulties on Xen that I found in my approach
> > is
> > > > that
> > > > > > > > > calling
> > > > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> > > > memory) is
> > > > > > > > > only
> > > > > > > > > > > > allowed on BE servers themselvies and so we will have
> > to
> > > > extend
> > > > > > > > > those
> > > > > > > > > > > > interfaces.
> > > > > > > > > > > > This, however, will raise some concern on security and
> > > > privilege
> > > > > > > > > distribution
> > > > > > > > > > > > as Stefan suggested.
> > > > > > > > > > >
> > > > > > > > > > > We also faced policy related issues with Virtio backend
> > > > running in
> > > > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> > > > system we
> > > > > > > run
> > > > > > > > > the backend in a driver
> > > > > > > > > > > domain (we call it DomD) where the underlying H/W
> > resides.
> > > > We
> > > > > > > trust it,
> > > > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> > > > provide
> > > > > > > it
> > > > > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > > > > Now it is permitted to issue device-model, resource and
> > > > memory
> > > > > > > > > mappings, etc calls.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > > > >  require some sort of hypercall to the hypervisor.
> > I
> > > > can see
> > > > > > > two
> > > > > > > > > options
> > > > > > > > > > > > > >  at this point:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   - expose the handle to userspace for
> > daemon/helper
> > > > to
> > > > > > > trigger
> > > > > > > > > the
> > > > > > > > > > > > > >     mapping via existing hypercall interfaces. If
> > > > using a
> > > > > > > helper
> > > > > > > > > you
> > > > > > > > > > > > > >     would have a hypervisor specific one to avoid
> > the
> > > > daemon
> > > > > > > > > having to
> > > > > > > > > > > > > >     care too much about the details or push that
> > > > complexity
> > > > > > > into
> > > > > > > > > a
> > > > > > > > > > > > > >     compile time option for the daemon which would
> > > > result in
> > > > > > > > > different
> > > > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   - expose a new kernel ABI to abstract the
> > hypercall
> > > > > > > > > differences away
> > > > > > > > > > > > > >     in the guest kernel. In this case the
> > userspace
> > > > would
> > > > > > > > > essentially
> > > > > > > > > > > > > >     ask for an abstract "map guest N memory to
> > > > userspace
> > > > > > > ptr"
> > > > > > > > > and let
> > > > > > > > > > > > > >     the kernel deal with the different hypercall
> > > > interfaces.
> > > > > > > > > This of
> > > > > > > > > > > > > >     course assumes the majority of BE guests would
> > be
> > > > Linux
> > > > > > > > > kernels and
> > > > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> > > > their own
> > > > > > > > > devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Operation
> > > > > > > > > > > > > > =========
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
> > simple.
> > > > Once
> > > > > > > the
> > > > > > > > > > > > > > vhost-user feature negotiation is done it's a case
> > of
> > > > > > > receiving
> > > > > > > > > update
> > > > > > > > > > > > > > events and parsing the resultant virt queue for
> > data.
> > > > The
> > > > > > > vhost-
> > > > > > > > > user
> > > > > > > > > > > > > > specification handles a bunch of setup before that
> > > > point,
> > > > > > > mostly
> > > > > > > > > to
> > > > > > > > > > > > > > detail where the virt queues are set up FD's for
> > > > memory and
> > > > > > > > > event
> > > > > > > > > > > > > > communication. This is where the envisioned stub
> > > > process
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > > > responsible for getting the daemon up and ready to
> > run.
> > > > This
> > > > > > > is
> > > > > > > > > > > > > > currently done inside a big VMM like QEMU but I
> > > > suspect a
> > > > > > > modern
> > > > > > > > > > > > > > approach would be to use the rust-vmm vhost crate.
> > It
> > > > would
> > > > > > > then
> > > > > > > > > either
> > > > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
> > re-
> > > > > > > targeted
> > > > > > > > > as a
> > > > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
> > doesn't
> > > > have
> > > > > > > VMMs
> > > > > > > > > the
> > > > > > > > > > > > > way they are typically envisioned and described in
> > other
> > > > > > > > > environments.
> > > > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them
> > connects
> > > > > > > > > independently to
> > > > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple
> > QEMUs
> > > > could
> > > > > > > be
> > > > > > > > > used as
> > > > > > > > > > > > > emulators for a single Xen VM, each of them
> > connecting
> > > > to Xen
> > > > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The component responsible for starting a daemon
> > and/or
> > > > setting
> > > > > > > up
> > > > > > > > > shared
> > > > > > > > > > > > > interfaces is the toolstack: the xl command and the
> > > > > > > libxl/libxc
> > > > > > > > > > > > > libraries.
> > > > > > > > > > > >
> > > > > > > > > > > > I think that VM configuration management (or
> > orchestration
> > > > in
> > > > > > > > > Startos
> > > > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > > > Otherwise, is there any good assumption to avoid it
> > right
> > > > now?
> > > > > > > > > > > >
> > > > > > > > > > > > > Oleksandr and others I CCed have been working on
> > ways
> > > > for the
> > > > > > > > > toolstack
> > > > > > > > > > > > > to create virtio backends and setup memory mappings.
> > > > They
> > > > > > > might be
> > > > > > > > > able
> > > > > > > > > > > > > to provide more info on the subject. I do think we
> > miss
> > > > a way
> > > > > > > to
> > > > > > > > > provide
> > > > > > > > > > > > > the configuration to the backend and anything else
> > that
> > > > the
> > > > > > > > > backend
> > > > > > > > > > > > > might require to start doing its job.
> > > > > > > > > > >
> > > > > > > > > > > Yes, some work has been done for the toolstack to handle
> > > > Virtio
> > > > > > > MMIO
> > > > > > > > > devices in
> > > > > > > > > > > general and Virtio block devices in particular. However,
> > it
> > > > has
> > > > > > > not
> > > > > > > > > been upstreaned yet.
> > > > > > > > > > > Updated patches on review now:
> > > > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-
> > git-
> > > > send-
> > > > > > > email-
> > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > >
> > > > > > > > > > > There is an additional (also important) activity to
> > > > improve/fix
> > > > > > > > > foreign memory mapping on Arm which I am also involved in.
> > > > > > > > > > > The foreign memory mapping is proposed to be used for
> > Virtio
> > > > > > > backends
> > > > > > > > > (device emulators) if there is a need to run guest OS
> > completely
> > > > > > > > > unmodified.
> > > > > > > > > > > Of course, the more secure way would be to use grant
> > memory
> > > > > > > mapping.
> > > > > > > > > Brietly, the main difference between them is that with
> > foreign
> > > > mapping
> > > > > > > the
> > > > > > > > > backend
> > > > > > > > > > > can map any guest memory it wants to map, but with grant
> > > > mapping
> > > > > > > it is
> > > > > > > > > allowed to map only what was previously granted by the
> > frontend.
> > > > > > > > > > >
> > > > > > > > > > > So, there might be a problem if we want to pre-map some
> > > > guest
> > > > > > > memory
> > > > > > > > > in advance or to cache mappings in the backend in order to
> > > > improve
> > > > > > > > > performance (because the mapping/unmapping guest pages every
> > > > request
> > > > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> > > > nutshell,
> > > > > > > > > currently, in order to map a guest page into the backend
> > address
> > > > space
> > > > > > > we
> > > > > > > > > need to steal a real physical page from the backend domain.
> > So,
> > > > with
> > > > > > > the
> > > > > > > > > said optimizations we might end up with no free memory in
> > the
> > > > backend
> > > > > > > > > domain (see XSA-300). And what we try to achieve is to not
> > waste
> > > > a
> > > > > > > real
> > > > > > > > > domain memory at all by providing safe non-allocated-yet (so
> > > > unused)
> > > > > > > > > address space for the foreign (and grant) pages to be mapped
> > > > into,
> > > > > > > this
> > > > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> > > > changes.
> > > > > > > > > However, as it turned out, for this to work in a proper and
> > safe
> > > > way
> > > > > > > some
> > > > > > > > > prereq work needs to be done.
> > > > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-
> > git-
> > > > send-
> > > > > > > email-
> > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > One question is how to best handle notification
> > and
> > > > kicks.
> > > > > > > The
> > > > > > > > > existing
> > > > > > > > > > > > > > vhost-user framework uses eventfd to signal the
> > daemon
> > > > > > > (although
> > > > > > > > > QEMU
> > > > > > > > > > > > > > is quite capable of simulating them when you use
> > TCG).
> > > > Xen
> > > > > > > has
> > > > > > > > > it's own
> > > > > > > > > > > > > > IOREQ mechanism. However latency is an important
> > > > factor and
> > > > > > > > > having
> > > > > > > > > > > > > > events go through the stub would add quite a lot.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yeah I think, regardless of anything else, we want
> > the
> > > > > > > backends to
> > > > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > > > >
> > > > > > > > > > > > In my approach,
> > > > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> > > > hypervisor
> > > > > > > > > interface
> > > > > > > > > > > >               via virtio-proxy
> > > > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> > > > channels),
> > > > > > > > > which is
> > > > > > > > > > > >               converted to a callback to BE via
> > virtio-
> > > > proxy
> > > > > > > > > > > >               (Xen's event channel is internnally
> > > > implemented by
> > > > > > > > > interrupts.)
> > > > > > > > > > > >
> > > > > > > > > > > > I don't know what "connect directly" means here, but
> > > > sending
> > > > > > > > > interrupts
> > > > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
> > PCI's
> > > > msi-x
> > > > > > > > > mechanism.
> > > > > > > > > > >
> > > > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > > > At the moment, in order to notify the frontend, the
> > backend
> > > > issues
> > > > > > > a
> > > > > > > > > specific device-model call to query Xen to inject a
> > > > corresponding SPI
> > > > > > > to
> > > > > > > > > the guest.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Could we consider the kernel internally converting
> > > > IOREQ
> > > > > > > > > messages from
> > > > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
> > scale
> > > > with
> > > > > > > > > other kernel
> > > > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So any thoughts on what directions are worth
> > > > experimenting
> > > > > > > with?
> > > > > > > > > > > > >
> > > > > > > > > > > > > One option we should consider is for each backend to
> > > > connect
> > > > > > > to
> > > > > > > > > Xen via
> > > > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> > > > interface
> > > > > > > and
> > > > > > > > > make it
> > > > > > > > > > > > > hypervisor agnostic. The interface is really trivial
> > and
> > > > easy
> > > > > > > to
> > > > > > > > > add.
> > > > > > > > > > > >
> > > > > > > > > > > > As I said above, my proposal does the same thing that
> > you
> > > > > > > mentioned
> > > > > > > > > here :)
> > > > > > > > > > > > The difference is that I do call hypervisor interfaces
> > via
> > > > > > > virtio-
> > > > > > > > > proxy.
> > > > > > > > > > > >
> > > > > > > > > > > > > The only Xen-specific part is the notification
> > mechanism,
> > > > > > > which is
> > > > > > > > > an
> > > > > > > > > > > > > event channel. If we replaced the event channel with
> > > > something
> > > > > > > > > else the
> > > > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
> > the
> > > > kernel
> > > > > > > is
> > > > > > > > > a
> > > > > > > > > > > > > good idea: if feels like it would be extra
> > complexity
> > > > and that
> > > > > > > the
> > > > > > > > > > > > > kernel shouldn't be involved as this is a backend-
> > > > hypervisor
> > > > > > > > > interface.
> > > > > > > > > > > >
> > > > > > > > > > > > Given that we may want to implement BE as a bare-metal
> > > > > > > application
> > > > > > > > > > > > as I did on Zephyr, I don't think that the translation
> > > > would not
> > > > > > > be
> > > > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > > > It will be some kind of abstraction layer of interrupt
> > > > handling
> > > > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > > > >
> > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
> > trying to
> > > > > > > design an
> > > > > > > > > > > > > interface that could work well for RTOSes too. If we
> > > > want to
> > > > > > > do
> > > > > > > > > > > > > something different, both OS-agnostic and
> > hypervisor-
> > > > agnostic,
> > > > > > > > > perhaps
> > > > > > > > > > > > > we could design a new interface. One that could be
> > > > > > > implementable
> > > > > > > > > in the
> > > > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> > > > other
> > > > > > > > > hypervisor
> > > > > > > > > > > > > too.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > There is also another problem. IOREQ is probably not
> > be
> > > > the
> > > > > > > only
> > > > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> > > > Don't we
> > > > > > > > > also need
> > > > > > > > > > > > > an interface for the backend to inject interrupts
> > into
> > > > the
> > > > > > > > > frontend? And
> > > > > > > > > > > > > if the backend requires dynamic memory mappings of
> > > > frontend
> > > > > > > pages,
> > > > > > > > > then
> > > > > > > > > > > > > we would also need an interface to map/unmap domU
> > pages.
> > > > > > > > > > > >
> > > > > > > > > > > > My proposal document might help here; All the
> > interfaces
> > > > > > > required
> > > > > > > > > for
> > > > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
> > listed
> > > > as
> > > > > > > > > > > > RPC protocols :)
> > > > > > > > > > > >
> > > > > > > > > > > > > These interfaces are a lot more problematic than
> > IOREQ:
> > > > IOREQ
> > > > > > > is
> > > > > > > > > tiny
> > > > > > > > > > > > > and self-contained. It is easy to add anywhere. A
> > new
> > > > > > > interface to
> > > > > > > > > > > > > inject interrupts or map pages is more difficult to
> > > > manage
> > > > > > > because
> > > > > > > > > it
> > > > > > > > > > > > > would require changes scattered across the various
> > > > emulators.
> > > > > > > > > > > >
> > > > > > > > > > > > Exactly. I have no confident yet that my approach will
> > > > also
> > > > > > > apply
> > > > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > > > Technically, yes, but whether people can accept it or
> > not
> > > > is a
> > > > > > > > > different
> > > > > > > > > > > > matter.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > -Takahiro Akashi
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Oleksandr Tyshchenko
> > > > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> > > > attachments are
> > > > > > > > > confidential and may also be privileged. If you are not the
> > > > intended
> > > > > > > > > recipient, please notify the sender immediately and do not
> > > > disclose
> > > > > > > the
> > > > > > > > > contents to any other person, use it for any purpose, or
> > store
> > > > or copy
> > > > > > > the
> > > > > > > > > information in any medium. Thank you.
> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> > attachments
> > > > are
> > > > > > > confidential and may also be privileged. If you are not the
> > intended
> > > > > > > recipient, please notify the sender immediately and do not
> > disclose
> > > > the
> > > > > > > contents to any other person, use it for any purpose, or store
> > or
> > > > copy the
> > > > > > > information in any medium. Thank you.
> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> > are
> > > > confidential and may also be privileged. If you are not the intended
> > > > recipient, please notify the sender immediately and do not disclose
> > the
> > > > contents to any other person, use it for any purpose, or store or copy
> > the
> > > > information in any medium. Thank you.
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-17 10:41     ` [virtio-dev] " Stefan Hajnoczi
@ 2021-09-01 12:53       ` Alex Bennée
  -1 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-01 12:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefano Stabellini, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


Stefan Hajnoczi <stefanha@redhat.com> writes:

> [[PGP Signed Part:Undecided]]
> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
>> > Could we consider the kernel internally converting IOREQ messages from
>> > the Xen hypervisor to eventfd events? Would this scale with other kernel
>> > hypercall interfaces?
>> > 
>> > So any thoughts on what directions are worth experimenting with?
>>  
>> One option we should consider is for each backend to connect to Xen via
>> the IOREQ interface. We could generalize the IOREQ interface and make it
>> hypervisor agnostic. The interface is really trivial and easy to add.
>> The only Xen-specific part is the notification mechanism, which is an
>> event channel. If we replaced the event channel with something else the
>> interface would be generic. See:
>> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
>
> There have been experiments with something kind of similar in KVM
> recently (see struct ioregionfd_cmd):
> https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/

Reading the cover letter was very useful in showing how this provides a
separate channel for signalling IO events to userspace instead of using
the normal type-2 vmexit type event. I wonder how deeply tied the
userspace facing side of this is to KVM? Could it provide a common FD
type interface to IOREQ?

As I understand IOREQ this is currently a direct communication between
userspace and the hypervisor using the existing Xen message bus. My
worry would be that by adding knowledge of what the underlying
hypervisor is we'd end up with excess complexity in the kernel. For one
thing we certainly wouldn't want an API version dependency on the kernel
to understand which version of the Xen hypervisor it was running on.

>> There is also another problem. IOREQ is probably not be the only
>> interface needed. Have a look at
>> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
>> an interface for the backend to inject interrupts into the frontend? And
>> if the backend requires dynamic memory mappings of frontend pages, then
>> we would also need an interface to map/unmap domU pages.
>> 
>> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
>> and self-contained. It is easy to add anywhere. A new interface to
>> inject interrupts or map pages is more difficult to manage because it
>> would require changes scattered across the various emulators.
>
> Something like ioreq is indeed necessary to implement arbitrary devices,
> but if you are willing to restrict yourself to VIRTIO then other
> interfaces are possible too because the VIRTIO device model is different
> from the general purpose x86 PIO/MMIO that Xen's ioreq seems to
> support.

It's true our focus is just VirtIO which does support alternative
transport options however most implementations seem to be targeting
virtio-mmio for it's relative simplicity and understood semantics
(modulo a desire for MSI to reduce round trip latency handling
signalling).

>
> Stefan
>
> [[End of PGP Signed Part]]


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-01 12:53       ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-01 12:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefano Stabellini, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


Stefan Hajnoczi <stefanha@redhat.com> writes:

> [[PGP Signed Part:Undecided]]
> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
>> > Could we consider the kernel internally converting IOREQ messages from
>> > the Xen hypervisor to eventfd events? Would this scale with other kernel
>> > hypercall interfaces?
>> > 
>> > So any thoughts on what directions are worth experimenting with?
>>  
>> One option we should consider is for each backend to connect to Xen via
>> the IOREQ interface. We could generalize the IOREQ interface and make it
>> hypervisor agnostic. The interface is really trivial and easy to add.
>> The only Xen-specific part is the notification mechanism, which is an
>> event channel. If we replaced the event channel with something else the
>> interface would be generic. See:
>> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
>
> There have been experiments with something kind of similar in KVM
> recently (see struct ioregionfd_cmd):
> https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/

Reading the cover letter was very useful in showing how this provides a
separate channel for signalling IO events to userspace instead of using
the normal type-2 vmexit type event. I wonder how deeply tied the
userspace facing side of this is to KVM? Could it provide a common FD
type interface to IOREQ?

As I understand IOREQ this is currently a direct communication between
userspace and the hypervisor using the existing Xen message bus. My
worry would be that by adding knowledge of what the underlying
hypervisor is we'd end up with excess complexity in the kernel. For one
thing we certainly wouldn't want an API version dependency on the kernel
to understand which version of the Xen hypervisor it was running on.

>> There is also another problem. IOREQ is probably not be the only
>> interface needed. Have a look at
>> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
>> an interface for the backend to inject interrupts into the frontend? And
>> if the backend requires dynamic memory mappings of frontend pages, then
>> we would also need an interface to map/unmap domU pages.
>> 
>> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
>> and self-contained. It is easy to add anywhere. A new interface to
>> inject interrupts or map pages is more difficult to manage because it
>> would require changes scattered across the various emulators.
>
> Something like ioreq is indeed necessary to implement arbitrary devices,
> but if you are willing to restrict yourself to VIRTIO then other
> interfaces are possible too because the VIRTIO device model is different
> from the general purpose x86 PIO/MMIO that Xen's ioreq seems to
> support.

It's true our focus is just VirtIO which does support alternative
transport options however most implementations seem to be targeting
virtio-mmio for it's relative simplicity and understood semantics
(modulo a desire for MSI to reduce round trip latency handling
signalling).

>
> Stefan
>
> [[End of PGP Signed Part]]


-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-01 12:29                           ` AKASHI Takahiro
@ 2021-09-01 16:26                             ` Oleksandr Tyshchenko
  2021-09-02  1:30                             ` Wei Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Oleksandr Tyshchenko @ 2021-09-01 16:26 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Wei Chen, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant, nd,
	Xen Devel

[-- Attachment #1: Type: text/plain, Size: 3357 bytes --]

Hi Akashi,

I am sorry for the possible format issues.


>
> > > >
> > > > It's a RFC discussion. We have tested this RFC patch internally.
> > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > 07/msg01532.html
> > >
> > > I'm afraid that I miss something here, but I don't know
> > > why this proposed API will lead to eliminating 'mmap' in accessing
> > > the queued payload at every request?
> > >
> >
> > This API give Xen device model (QEMU or kvmtool) the ability to map
> > whole guest RAM in device model's address space. In this case, device
> > model doesn't need dynamic hypercall to map/unmap payload memory.
> > It can use a flat offset to access payload memory in its address
> > space directly. Just Like KVM device model does now.
>
Yes!


>
> Thank you. Quickly, let me make sure one thing:
> This API itself doesn't do any mapping operations, right?


Right. The only purpose of that "API" is to guery hypervisor to find
unallocated address space ranges to map the foreign pages into (instead of
stealing real RAM pages),
In a nutshell, if you try to map the whole guest memory in the backend
address space on Arm (or even cache some mappings) you might end up with
memory exhaustion in the backend domain (XSA-300),
and the possibility to hit XSA-300 is higher if your backend needs to serve
several Guests. Of course, this depends on the memory assigned to the
backend domain and to the Guest(s) it serves...
We believe that with the proposed solution the backend will be able to
handle Guest(s) without wasting it's real RAM. However, please note that
proposed Xen + Linux changes which are on review now [1] are far from the
final solution and require rework and some prereq work to operate in a
proper and safe way.


>
> So I suppose that virtio BE guest is responsible to
> 1) fetch the information about all the memory regions in FE,
> 2) call this API to allocate a big chunk of unused space in BE,
> 3) create grant/foreign mappings for FE onto this region(S)
> in the initialization/configuration of emulated virtio devices.
>
> Is this the way this API is expected to be used?
>

No really, the userspace backend doesn't need to call this API at all, all
what backend calls is still
xenforeignmemory_map()/xenforeignmemory_unmap(), so let's say "magic" is
done by Linux and Xen internally.
You can take a look at the virtio-disk PoC [2] (last 4 patches) to better
understand what Wei and I are talking about. There we map the Guest memory
at the beginning and just calculate a pointer at runtime. Again, the code
is not in good shape, but it is enough to demonstrate the feasibility of
the improvement.



> Does Xen already has an interface for (1)?
>

I am not aware of anything existing. For the PoC I guessed the Guest memory
layout in a really hackish way (I got total Guest memory size, so having
GUEST_RAMX_BASE/GUEST_RAMX_SIZE in hand just performed calculation).
Definitely, it is a no-go, so 1) deserves additional discussion/design.

[1]
https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstysh@gmail.com/
https://lore.kernel.org/lkml/1627490656-1267-1-git-send-email-olekstysh@gmail.com/
https://lore.kernel.org/lkml/1627490656-1267-2-git-send-email-olekstysh@gmail.com/
[2]
https://github.com/otyshchenko1/virtio-disk/commits/map_opt_next
-- 
Regards,

Oleksandr Tyshchenko

[-- Attachment #2: Type: text/html, Size: 5371 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-01 12:29                           ` AKASHI Takahiro
  2021-09-01 16:26                             ` Oleksandr Tyshchenko
@ 2021-09-02  1:30                             ` Wei Chen
  2021-09-02  1:50                               ` Wei Chen
  1 sibling, 1 reply; 66+ messages in thread
From: Wei Chen @ 2021-09-02  1:30 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant, nd,
	Xen Devel

Hi Akashi,

> -----Original Message-----
> From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Sent: 2021年9月1日 20:29
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; nd <nd@arm.com>; Xen Devel <xen-devel@lists.xen.org>
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> 
> Hi Wei,
> 
> On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
> > Hi Akashi,
> >
> > > -----Original Message-----
> > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > Sent: 2021年8月31日 14:18
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> Xin
> > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> lists.linaro.org>;
> > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> <arnd.bergmann@linaro.org>;
> > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > >
> > > Wei,
> > >
> > > On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
> > > > Hi Akashi,
> > > >
> > > > > -----Original Message-----
> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > Sent: 2021年8月26日 17:41
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> Kaly
> > > Xin
> > > > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> > > lists.linaro.org>;
> > > > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> > > <arnd.bergmann@linaro.org>;
> > > > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> <cvanscha@qti.qualcomm.com>;
> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> Jean-
> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > > Julien
> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> Durrant
> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > > > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > > > > Hi Akashi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > Sent: 2021年8月18日 13:39
> > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > > Stabellini
> > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> <alex.bennee@linaro.org>;
> > > > > Stratos
> > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > dev@lists.oasis-
> > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> Kumar
> > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> Kiszka
> > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> <vatsa@codeaurora.org>;
> > > > > Jean-
> > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > <Artem_Mygaiev@epam.com>;
> > > > > Julien
> > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> Paul
> > > > > Durrant
> > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> backends
> > > > > > > >
> > > > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > > > > Hi Akashi,
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > > > > Stabellini
> > > > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> > > <alex.bennee@linaro.org>;
> > > > > > > > Stratos
> > > > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > > > > dev@lists.oasis-
> > > > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> Viresh
> > > Kumar
> > > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com;
> Jan
> > > Kiszka
> > > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > > <vatsa@codeaurora.org>;
> > > > > Jean-
> > > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu
> Poirier
> > > > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > > <Artem_Mygaiev@epam.com>;
> > > > > > > > Julien
> > > > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> > > Paul
> > > > > Durrant
> > > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > > backends
> > > > > > > > > >
> > > > > > > > > > Hi Wei, Oleksandr,
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
> > > here.
> > > > > > > > > > > This proposal is still discussing in Xen and KVM
> > > communities.
> > > > > > > > > > > The main work is to decouple the kvmtool from KVM and
> make
> > > > > > > > > > > other hypervisors can reuse the virtual device
> > > implementations.
> > > > > > > > > > >
> > > > > > > > > > > In this case, we need to introduce an intermediate
> > > hypervisor
> > > > > > > > > > > layer for VMM abstraction, Which is, I think it's very
> > > close
> > > > > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > > > > >
> > > > > > > > > > # My proposal[1] comes from my own idea and doesn't
> always
> > > > > represent
> > > > > > > > > > # Linaro's view on this subject nor reflect Alex's
> concerns.
> > > > > > > > Nevertheless,
> > > > > > > > > >
> > > > > > > > > > Your idea and my proposal seem to share the same
> background.
> > > > > > > > > > Both have the similar goal and currently start with, at
> > > first,
> > > > > Xen
> > > > > > > > > > and are based on kvm-tool. (Actually, my work is derived
> > > from
> > > > > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > > > > >
> > > > > > > > > > In particular, the abstraction of hypervisor interfaces
> has
> > > a
> > > > > same
> > > > > > > > > > set of interfaces (for your "struct vmm_impl" and my
> "RPC
> > > > > interfaces").
> > > > > > > > > > This is not co-incident as we both share the same origin
> as
> > > I
> > > > > said
> > > > > > > > above.
> > > > > > > > > > And so we will also share the same issues. One of them
> is a
> > > way
> > > > > of
> > > > > > > > > > "sharing/mapping FE's memory". There is some trade-off
> > > between
> > > > > > > > > > the portability and the performance impact.
> > > > > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > > > > (See Alex's original email, too).
> > > > > > > > > >
> > > > > > > > > Yes, I agree.
> > > > > > > > >
> > > > > > > > > > On the other hand, my approach aims to create a "single-
> > > binary"
> > > > > > > > solution
> > > > > > > > > > in which the same binary of BE vm could run on any
> > > hypervisors.
> > > > > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
> > > solution,
> > > > > all
> > > > > > > > > > the hypervisor-specific code would be put into another
> > > entity
> > > > > (VM),
> > > > > > > > > > named "virtio-proxy" and the abstracted operations are
> > > served
> > > > > via RPC.
> > > > > > > > > > (In this sense, BE is hypervisor-agnostic but might have
> OS
> > > > > > > > dependency.)
> > > > > > > > > > But I know that we need discuss if this is a requirement
> > > even
> > > > > > > > > > in Stratos project or not. (Maybe not)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Sorry, I haven't had time to finish reading your virtio-
> proxy
> > > > > completely
> > > > > > > > > (I will do it ASAP). But from your description, it seems
> we
> > > need a
> > > > > > > > > 3rd VM between FE and BE? My concern is that, if my
> assumption
> > > is
> > > > > right,
> > > > > > > > > will it increase the latency in data transport path? Even
> if
> > > we're
> > > > > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > > > > >
> > > > > > > > Yes, you're right. But I'm afraid that it is a matter of
> degree.
> > > > > > > > As far as we execute 'mapping' operations at every fetch of
> > > payload,
> > > > > > > > we will see latency issue (even in your case) and if we have
> > > some
> > > > > solution
> > > > > > > > for it, we won't see it neither in my proposal :)
> > > > > > > >
> > > > > > >
> > > > > > > Oleksandr has sent a proposal to Xen mailing list to reduce
> this
> > > kind
> > > > > > > of "mapping/unmapping" operations. So the latency caused by
> this
> > > > > behavior
> > > > > > > on Xen may eventually be eliminated, and Linux-KVM doesn't
> have
> > > that
> > > > > problem.
> > > > > >
> > > > > > Obviously, I have not yet caught up there in the discussion.
> > > > > > Which patch specifically?
> > > > >
> > > > > Can you give me the link to the discussion or patch, please?
> > > > >
> > > >
> > > > It's a RFC discussion. We have tested this RFC patch internally.
> > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > 07/msg01532.html
> > >
> > > I'm afraid that I miss something here, but I don't know
> > > why this proposed API will lead to eliminating 'mmap' in accessing
> > > the queued payload at every request?
> > >
> >
> > This API give Xen device model (QEMU or kvmtool) the ability to map
> > whole guest RAM in device model's address space. In this case, device
> > model doesn't need dynamic hypercall to map/unmap payload memory.
> > It can use a flat offset to access payload memory in its address
> > space directly. Just Like KVM device model does now.
> 
> Thank you. Quickly, let me make sure one thing:
> This API itself doesn't do any mapping operations, right?
> So I suppose that virtio BE guest is responsible to
> 1) fetch the information about all the memory regions in FE,
> 2) call this API to allocate a big chunk of unused space in BE,
> 3) create grant/foreign mappings for FE onto this region(S)
> in the initialization/configuration of emulated virtio devices.
> 
> Is this the way this API is expected to be used?
> Does Xen already has an interface for (1)?
> 

They are discussing in that thread to find a proper way to do it.
Because this API is common, both x86 and Arm should be considered.

> -Takahiro Akashi
> 
> > Before this API, When device model to map whole guest memory, will
> > severely consume the physical pages of Dom-0/Dom-D.
> >
> > > -Takahiro Akashi
> > >
> > >
> > > > > Thanks,
> > > > > -Takahiro Akashi
> > > > >
> > > > > > -Takahiro Akashi
> > > > > >
> > > > > > > > > > Specifically speaking about kvm-tool, I have a concern
> about
> > > its
> > > > > > > > > > license term; Targeting different hypervisors and
> different
> > > OSs
> > > > > > > > > > (which I assume includes RTOS's), the resultant library
> > > should
> > > > > be
> > > > > > > > > > license permissive and GPL for kvm-tool might be an
> issue.
> > > > > > > > > > Any thoughts?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes. If user want to implement a FreeBSD device model, but
> the
> > > > > virtio
> > > > > > > > > library is GPL. Then GPL would be a problem. If we have
> > > another
> > > > > good
> > > > > > > > > candidate, I am open to it.
> > > > > > > >
> > > > > > > > I have some candidates, particularly for vq/vring, in my
> mind:
> > > > > > > > * Open-AMP, or
> > > > > > > > * corresponding Free-BSD code
> > > > > > > >
> > > > > > >
> > > > > > > Interesting, I will look into them : )
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Wei Chen
> > > > > > >
> > > > > > > > -Takahiro Akashi
> > > > > > > >
> > > > > > > >
> > > > > > > > > > -Takahiro Akashi
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> dev/2021-
> > > > > > > > > > August/000548.html
> > > > > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>;
> > > Stefano
> > > > > > > > Stabellini
> > > > > > > > > > <sstabellini@kernel.org>
> > > > > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > > Mailing
> > > > > List
> > > > > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-
> dev@lists.oasis-
> > > > > open.org;
> > > > > > > > Arnd
> > > > > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com;
> Jan
> > > Kiszka
> > > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > > <vatsa@codeaurora.org>;
> > > > > Jean-
> > > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu
> Poirier
> > > > > > > > > > <mathieu.poirier@linaro.org>; Wei Chen
> <Wei.Chen@arm.com>;
> > > > > Oleksandr
> > > > > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand
> Marquis
> > > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > > <Artem_Mygaiev@epam.com>;
> > > > > > > > Julien
> > > > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> > > Paul
> > > > > Durrant
> > > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for
> VirtIO
> > > > > backends
> > > > > > > > > > > >
> > > > > > > > > > > > Hello, all.
> > > > > > > > > > > >
> > > > > > > > > > > > Please see some comments below. And sorry for the
> > > possible
> > > > > format
> > > > > > > > > > issues.
> > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> > > > > Stabellini
> > > > > > > > wrote:
> > > > > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs.
> Not
> > > > > trimming
> > > > > > > > the
> > > > > > > > > > original
> > > > > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My comments below are related to a potential Xen
> > > > > > > > implementation,
> > > > > > > > > > not
> > > > > > > > > > > > > > because it is the only implementation that
> matters,
> > > but
> > > > > > > > because it
> > > > > > > > > > is
> > > > > > > > > > > > > > the one I know best.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please note that my proposal (and hence the
> working
> > > > > prototype)[1]
> > > > > > > > > > > > > is based on Xen's virtio implementation (i.e.
> IOREQ)
> > > and
> > > > > > > > > > particularly
> > > > > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > > > > It has been, I believe, well generalized but is
> still
> > > a
> > > > > bit
> > > > > > > > biased
> > > > > > > > > > > > > toward this original design.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > > > > dev/2021-
> > > > > > > > > > August/000546.html
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let me take this opportunity to explain a bit more
> > > about
> > > > > my
> > > > > > > > approach
> > > > > > > > > > below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > > > > https://marc.info/?l=xen-
> devel&m=162373754705233&w=2
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > One of the goals of Project Stratos is to
> enable
> > > > > hypervisor
> > > > > > > > > > agnostic
> > > > > > > > > > > > > > > backends so we can enable as much re-use of
> code
> > > as
> > > > > possible
> > > > > > > > and
> > > > > > > > > > avoid
> > > > > > > > > > > > > > > repeating ourselves. This is the flip side of
> the
> > > > > front end
> > > > > > > > > > where
> > > > > > > > > > > > > > > multiple front-end implementations are
> required -
> > > one
> > > > > per OS,
> > > > > > > > > > assuming
> > > > > > > > > > > > > > > you don't just want Linux guests. The
> resultant
> > > guests
> > > > > are
> > > > > > > > > > trivially
> > > > > > > > > > > > > > > movable between hypervisors modulo any
> abstracted
> > > > > paravirt
> > > > > > > > type
> > > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In my original thumb nail sketch of a solution
> I
> > > > > envisioned
> > > > > > > > > > vhost-user
> > > > > > > > > > > > > > > daemons running in a broadly POSIX like
> > > environment.
> > > > > The
> > > > > > > > > > interface to
> > > > > > > > > > > > > > > the daemon is fairly simple requiring only
> some
> > > mapped
> > > > > > > > memory
> > > > > > > > > > and some
> > > > > > > > > > > > > > > sort of signalling for events (on Linux this
> is
> > > > > eventfd).
> > > > > > > > The
> > > > > > > > > > idea was a
> > > > > > > > > > > > > > > stub binary would be responsible for any
> > > hypervisor
> > > > > specific
> > > > > > > > > > setup and
> > > > > > > > > > > > > > > then launch a common binary to deal with the
> > > actual
> > > > > > > > virtqueue
> > > > > > > > > > requests
> > > > > > > > > > > > > > > themselves.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Since that original sketch we've seen an
> expansion
> > > in
> > > > > the
> > > > > > > > sort
> > > > > > > > > > of ways
> > > > > > > > > > > > > > > backends could be created. There is interest
> in
> > > > > > > > encapsulating
> > > > > > > > > > backends
> > > > > > > > > > > > > > > in RTOSes or unikernels for solutions like
> SCMI.
> > > There
> > > > > > > > interest
> > > > > > > > > > in Rust
> > > > > > > > > > > > > > > has prompted ideas of using the trait
> interface to
> > > > > abstract
> > > > > > > > > > differences
> > > > > > > > > > > > > > > away as well as the idea of bare-metal Rust
> > > backends.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > > > > Standardisation"
> > > > > > > > which
> > > > > > > > > > > > > > > calls for a description of the APIs needed
> from
> > > the
> > > > > > > > hypervisor
> > > > > > > > > > side to
> > > > > > > > > > > > > > > support VirtIO guests and their backends.
> However
> > > we
> > > > > are
> > > > > > > > some
> > > > > > > > > > way off
> > > > > > > > > > > > > > > from that at the moment as I think we need to
> at
> > > least
> > > > > > > > > > demonstrate one
> > > > > > > > > > > > > > > portable backend before we start codifying
> > > > > requirements. To
> > > > > > > > that
> > > > > > > > > > end I
> > > > > > > > > > > > > > > want to think about what we need for a backend
> to
> > > > > function.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Configuration
> > > > > > > > > > > > > > > =============
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
> > > simple
> > > > > because
> > > > > > > > the
> > > > > > > > > > host
> > > > > > > > > > > > > > > system can orchestrate the various modules
> that
> > > make
> > > > > up the
> > > > > > > > > > complete
> > > > > > > > > > > > > > > system. In the type-1 case (or even type-2
> with
> > > > > delegated
> > > > > > > > > > service VMs)
> > > > > > > > > > > > > > > we need some sort of mechanism to inform the
> > > backend
> > > > > VM
> > > > > > > > about
> > > > > > > > > > key
> > > > > > > > > > > > > > > details about the system:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   - where virt queue memory is in it's address
> > > space
> > > > > > > > > > > > > > >   - how it's going to receive (interrupt) and
> > > trigger
> > > > > (kick)
> > > > > > > > > > events
> > > > > > > > > > > > > > >   - what (if any) resources the backend needs
> to
> > > > > connect to
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Obviously you can elide over configuration
> issues
> > > by
> > > > > having
> > > > > > > > > > static
> > > > > > > > > > > > > > > configurations and baking the assumptions into
> > > your
> > > > > guest
> > > > > > > > images
> > > > > > > > > > however
> > > > > > > > > > > > > > > this isn't scalable in the long term. The
> obvious
> > > > > solution
> > > > > > > > seems
> > > > > > > > > > to be
> > > > > > > > > > > > > > > extending a subset of Device Tree data to user
> > > space
> > > > > but
> > > > > > > > perhaps
> > > > > > > > > > there
> > > > > > > > > > > > > > > are other approaches?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Before any virtio transactions can take place
> the
> > > > > > > > appropriate
> > > > > > > > > > memory
> > > > > > > > > > > > > > > mappings need to be made between the FE guest
> and
> > > the
> > > > > BE
> > > > > > > > guest.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Currently the whole of the FE guests address
> space
> > > > > needs to
> > > > > > > > be
> > > > > > > > > > visible
> > > > > > > > > > > > > > > to whatever is serving the virtio requests. I
> can
> > > > > envision 3
> > > > > > > > > > approaches:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  This would entail the guest OS knowing where
> in
> > > it's
> > > > > Guest
> > > > > > > > > > Physical
> > > > > > > > > > > > > > >  Address space is already taken up and
> avoiding
> > > > > clashing. I
> > > > > > > > > > would assume
> > > > > > > > > > > > > > >  in this case you would want a standard
> interface
> > > to
> > > > > > > > userspace
> > > > > > > > > > to then
> > > > > > > > > > > > > > >  make that address space visible to the
> backend
> > > daemon.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yet another way here is that we would have well
> known
> > > > > "shared
> > > > > > > > > > memory" between
> > > > > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us
> good
> > > > > insights on
> > > > > > > > this
> > > > > > > > > > matter
> > > > > > > > > > > > > and that it can even be an alternative for
> hypervisor-
> > > > > agnostic
> > > > > > > > > > solution.
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Please note memory regions in ivshmem appear as a
> PCI
> > > > > device
> > > > > > > > and
> > > > > > > > > > can be
> > > > > > > > > > > > > mapped locally.)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I want to add this shared memory aspect to my
> virtio-
> > > proxy,
> > > > > but
> > > > > > > > > > > > > the resultant solution would eventually look
> similar
> > > to
> > > > > ivshmem.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
> > > memory
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  The BE guest is then free to map the FE's
> memory
> > > to
> > > > > where
> > > > > > > > it
> > > > > > > > > > wants in
> > > > > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I cannot see how this could work for Xen. There
> is
> > > no
> > > > > "handle"
> > > > > > > > to
> > > > > > > > > > give
> > > > > > > > > > > > > > to the backend if the backend is not running in
> dom0.
> > > So
> > > > > for
> > > > > > > > Xen I
> > > > > > > > > > think
> > > > > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > > > > >
> > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the
> following
> > > > > information
> > > > > > > > is
> > > > > > > > > > expected
> > > > > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > > > > (I know that this is a tentative approach though.)
> > > > > > > > > > > > >    - the start address of configuration space
> > > > > > > > > > > > >    - interrupt number
> > > > > > > > > > > > >    - file path for backing storage
> > > > > > > > > > > > >    - read-only flag
> > > > > > > > > > > > > And the BE server have to call a particular
> hypervisor
> > > > > interface
> > > > > > > > to
> > > > > > > > > > > > > map the configuration space.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > > > > configuration
> > > > > > > > info to
> > > > > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > > > > I remember, there was a wish to avoid using Xenstore
> in
> > > > > Virtio
> > > > > > > > backend
> > > > > > > > > > itself if possible, so for non-toolstack domain, this
> could
> > > done
> > > > > with
> > > > > > > > > > adjusting devd (daemon that listens for devices and
> launches
> > > > > backends)
> > > > > > > > > > > > to read backend configuration from the Xenstore
> anyway
> > > and
> > > > > pass it
> > > > > > > > to
> > > > > > > > > > the backend via command line arguments.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass
> > > device
> > > > > > > > > > configuration.
> > > > > > > > > > > We also designed a static device configuration parse
> > > method
> > > > > for
> > > > > > > > Dom0less
> > > > > > > > > > or
> > > > > > > > > > > other scenarios don't have xentool. yes, it's from
> device
> > > > > model
> > > > > > > > command
> > > > > > > > > > line
> > > > > > > > > > > or a config file.
> > > > > > > > > > >
> > > > > > > > > > > > But, if ...
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > > > > hypervisor)-
> > > > > > > > > > specific
> > > > > > > > > > > > > stuffs are contained in virtio-proxy, yet another
> VM,
> > > to
> > > > > hide
> > > > > > > > all
> > > > > > > > > > details.
> > > > > > > > > > > >
> > > > > > > > > > > > ... the solution how to overcome that is already
> found
> > > and
> > > > > proven
> > > > > > > > to
> > > > > > > > > > work then even better.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
> > > > > executing
> > > > > > > > mapping.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > > > > toolstack (also see below.) Or we would have to
> > > invent a
> > > > > new
> > > > > > > > Xen
> > > > > > > > > > > > > > hypervisor interface and Xen virtual machine
> > > privileges
> > > > > to
> > > > > > > > allow
> > > > > > > > > > this
> > > > > > > > > > > > > > kind of mapping.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If we run the backend in Dom0 that we have no
> > > problems
> > > > > of
> > > > > > > > course.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One of difficulties on Xen that I found in my
> approach
> > > is
> > > > > that
> > > > > > > > > > calling
> > > > > > > > > > > > > such hypervisor intefaces (registering IOREQ,
> mapping
> > > > > memory) is
> > > > > > > > > > only
> > > > > > > > > > > > > allowed on BE servers themselvies and so we will
> have
> > > to
> > > > > extend
> > > > > > > > > > those
> > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > This, however, will raise some concern on security
> and
> > > > > privilege
> > > > > > > > > > distribution
> > > > > > > > > > > > > as Stefan suggested.
> > > > > > > > > > > >
> > > > > > > > > > > > We also faced policy related issues with Virtio
> backend
> > > > > running in
> > > > > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our
> target
> > > > > system we
> > > > > > > > run
> > > > > > > > > > the backend in a driver
> > > > > > > > > > > > domain (we call it DomD) where the underlying H/W
> > > resides.
> > > > > We
> > > > > > > > trust it,
> > > > > > > > > > so we wrote policy rules (to be used in "flask" xsm mode)
> to
> > > > > provide
> > > > > > > > it
> > > > > > > > > > with a little bit more privileges than a simple DomU had.
> > > > > > > > > > > > Now it is permitted to issue device-model, resource
> and
> > > > > memory
> > > > > > > > > > mappings, etc calls.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > > > > >  require some sort of hypercall to the
> hypervisor.
> > > I
> > > > > can see
> > > > > > > > two
> > > > > > > > > > options
> > > > > > > > > > > > > > >  at this point:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   - expose the handle to userspace for
> > > daemon/helper
> > > > > to
> > > > > > > > trigger
> > > > > > > > > > the
> > > > > > > > > > > > > > >     mapping via existing hypercall interfaces.
> If
> > > > > using a
> > > > > > > > helper
> > > > > > > > > > you
> > > > > > > > > > > > > > >     would have a hypervisor specific one to
> avoid
> > > the
> > > > > daemon
> > > > > > > > > > having to
> > > > > > > > > > > > > > >     care too much about the details or push
> that
> > > > > complexity
> > > > > > > > into
> > > > > > > > > > a
> > > > > > > > > > > > > > >     compile time option for the daemon which
> would
> > > > > result in
> > > > > > > > > > different
> > > > > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   - expose a new kernel ABI to abstract the
> > > hypercall
> > > > > > > > > > differences away
> > > > > > > > > > > > > > >     in the guest kernel. In this case the
> > > userspace
> > > > > would
> > > > > > > > > > essentially
> > > > > > > > > > > > > > >     ask for an abstract "map guest N memory to
> > > > > userspace
> > > > > > > > ptr"
> > > > > > > > > > and let
> > > > > > > > > > > > > > >     the kernel deal with the different
> hypercall
> > > > > interfaces.
> > > > > > > > > > This of
> > > > > > > > > > > > > > >     course assumes the majority of BE guests
> would
> > > be
> > > > > Linux
> > > > > > > > > > kernels and
> > > > > > > > > > > > > > >     leaves the bare-metal/unikernel approaches
> to
> > > > > their own
> > > > > > > > > > devices.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Operation
> > > > > > > > > > > > > > > =========
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
> > > simple.
> > > > > Once
> > > > > > > > the
> > > > > > > > > > > > > > > vhost-user feature negotiation is done it's a
> case
> > > of
> > > > > > > > receiving
> > > > > > > > > > update
> > > > > > > > > > > > > > > events and parsing the resultant virt queue
> for
> > > data.
> > > > > The
> > > > > > > > vhost-
> > > > > > > > > > user
> > > > > > > > > > > > > > > specification handles a bunch of setup before
> that
> > > > > point,
> > > > > > > > mostly
> > > > > > > > > > to
> > > > > > > > > > > > > > > detail where the virt queues are set up FD's
> for
> > > > > memory and
> > > > > > > > > > event
> > > > > > > > > > > > > > > communication. This is where the envisioned
> stub
> > > > > process
> > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > > > > > > responsible for getting the daemon up and
> ready to
> > > run.
> > > > > This
> > > > > > > > is
> > > > > > > > > > > > > > > currently done inside a big VMM like QEMU but
> I
> > > > > suspect a
> > > > > > > > modern
> > > > > > > > > > > > > > > approach would be to use the rust-vmm vhost
> crate.
> > > It
> > > > > would
> > > > > > > > then
> > > > > > > > > > either
> > > > > > > > > > > > > > > communicate with the kernel's abstracted ABI
> or be
> > > re-
> > > > > > > > targeted
> > > > > > > > > > as a
> > > > > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
> > > doesn't
> > > > > have
> > > > > > > > VMMs
> > > > > > > > > > the
> > > > > > > > > > > > > > way they are typically envisioned and described
> in
> > > other
> > > > > > > > > > environments.
> > > > > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them
> > > connects
> > > > > > > > > > independently to
> > > > > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple
> > > QEMUs
> > > > > could
> > > > > > > > be
> > > > > > > > > > used as
> > > > > > > > > > > > > > emulators for a single Xen VM, each of them
> > > connecting
> > > > > to Xen
> > > > > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The component responsible for starting a daemon
> > > and/or
> > > > > setting
> > > > > > > > up
> > > > > > > > > > shared
> > > > > > > > > > > > > > interfaces is the toolstack: the xl command and
> the
> > > > > > > > libxl/libxc
> > > > > > > > > > > > > > libraries.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think that VM configuration management (or
> > > orchestration
> > > > > in
> > > > > > > > > > Startos
> > > > > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > > > > Otherwise, is there any good assumption to avoid
> it
> > > right
> > > > > now?
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Oleksandr and others I CCed have been working on
> > > ways
> > > > > for the
> > > > > > > > > > toolstack
> > > > > > > > > > > > > > to create virtio backends and setup memory
> mappings.
> > > > > They
> > > > > > > > might be
> > > > > > > > > > able
> > > > > > > > > > > > > > to provide more info on the subject. I do think
> we
> > > miss
> > > > > a way
> > > > > > > > to
> > > > > > > > > > provide
> > > > > > > > > > > > > > the configuration to the backend and anything
> else
> > > that
> > > > > the
> > > > > > > > > > backend
> > > > > > > > > > > > > > might require to start doing its job.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, some work has been done for the toolstack to
> handle
> > > > > Virtio
> > > > > > > > MMIO
> > > > > > > > > > devices in
> > > > > > > > > > > > general and Virtio block devices in particular.
> However,
> > > it
> > > > > has
> > > > > > > > not
> > > > > > > > > > been upstreaned yet.
> > > > > > > > > > > > Updated patches on review now:
> > > > > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-
> 1-
> > > git-
> > > > > send-
> > > > > > > > email-
> > > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > > >
> > > > > > > > > > > > There is an additional (also important) activity to
> > > > > improve/fix
> > > > > > > > > > foreign memory mapping on Arm which I am also involved
> in.
> > > > > > > > > > > > The foreign memory mapping is proposed to be used
> for
> > > Virtio
> > > > > > > > backends
> > > > > > > > > > (device emulators) if there is a need to run guest OS
> > > completely
> > > > > > > > > > unmodified.
> > > > > > > > > > > > Of course, the more secure way would be to use grant
> > > memory
> > > > > > > > mapping.
> > > > > > > > > > Brietly, the main difference between them is that with
> > > foreign
> > > > > mapping
> > > > > > > > the
> > > > > > > > > > backend
> > > > > > > > > > > > can map any guest memory it wants to map, but with
> grant
> > > > > mapping
> > > > > > > > it is
> > > > > > > > > > allowed to map only what was previously granted by the
> > > frontend.
> > > > > > > > > > > >
> > > > > > > > > > > > So, there might be a problem if we want to pre-map
> some
> > > > > guest
> > > > > > > > memory
> > > > > > > > > > in advance or to cache mappings in the backend in order
> to
> > > > > improve
> > > > > > > > > > performance (because the mapping/unmapping guest pages
> every
> > > > > request
> > > > > > > > > > requires a lot of back and forth to Xen + P2M updates).
> In a
> > > > > nutshell,
> > > > > > > > > > currently, in order to map a guest page into the backend
> > > address
> > > > > space
> > > > > > > > we
> > > > > > > > > > need to steal a real physical page from the backend
> domain.
> > > So,
> > > > > with
> > > > > > > > the
> > > > > > > > > > said optimizations we might end up with no free memory
> in
> > > the
> > > > > backend
> > > > > > > > > > domain (see XSA-300). And what we try to achieve is to
> not
> > > waste
> > > > > a
> > > > > > > > real
> > > > > > > > > > domain memory at all by providing safe non-allocated-yet
> (so
> > > > > unused)
> > > > > > > > > > address space for the foreign (and grant) pages to be
> mapped
> > > > > into,
> > > > > > > > this
> > > > > > > > > > enabling work implies Xen and Linux (and likely DTB
> bindings)
> > > > > changes.
> > > > > > > > > > However, as it turned out, for this to work in a proper
> and
> > > safe
> > > > > way
> > > > > > > > some
> > > > > > > > > > prereq work needs to be done.
> > > > > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-
> 1-
> > > git-
> > > > > send-
> > > > > > > > email-
> > > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > One question is how to best handle
> notification
> > > and
> > > > > kicks.
> > > > > > > > The
> > > > > > > > > > existing
> > > > > > > > > > > > > > > vhost-user framework uses eventfd to signal
> the
> > > daemon
> > > > > > > > (although
> > > > > > > > > > QEMU
> > > > > > > > > > > > > > > is quite capable of simulating them when you
> use
> > > TCG).
> > > > > Xen
> > > > > > > > has
> > > > > > > > > > it's own
> > > > > > > > > > > > > > > IOREQ mechanism. However latency is an
> important
> > > > > factor and
> > > > > > > > > > having
> > > > > > > > > > > > > > > events go through the stub would add quite a
> lot.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yeah I think, regardless of anything else, we
> want
> > > the
> > > > > > > > backends to
> > > > > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In my approach,
> > > > > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> > > > > hypervisor
> > > > > > > > > > interface
> > > > > > > > > > > > >               via virtio-proxy
> > > > > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in
> event
> > > > > channels),
> > > > > > > > > > which is
> > > > > > > > > > > > >               converted to a callback to BE via
> > > virtio-
> > > > > proxy
> > > > > > > > > > > > >               (Xen's event channel is internnally
> > > > > implemented by
> > > > > > > > > > interrupts.)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don't know what "connect directly" means here,
> but
> > > > > sending
> > > > > > > > > > interrupts
> > > > > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > > > > Ivshmem, I suppose, takes this approach by
> utilizing
> > > PCI's
> > > > > msi-x
> > > > > > > > > > mechanism.
> > > > > > > > > > > >
> > > > > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > > > > At the moment, in order to notify the frontend, the
> > > backend
> > > > > issues
> > > > > > > > a
> > > > > > > > > > specific device-model call to query Xen to inject a
> > > > > corresponding SPI
> > > > > > > > to
> > > > > > > > > > the guest.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Could we consider the kernel internally
> converting
> > > > > IOREQ
> > > > > > > > > > messages from
> > > > > > > > > > > > > > > the Xen hypervisor to eventfd events? Would
> this
> > > scale
> > > > > with
> > > > > > > > > > other kernel
> > > > > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So any thoughts on what directions are worth
> > > > > experimenting
> > > > > > > > with?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One option we should consider is for each
> backend to
> > > > > connect
> > > > > > > > to
> > > > > > > > > > Xen via
> > > > > > > > > > > > > > the IOREQ interface. We could generalize the
> IOREQ
> > > > > interface
> > > > > > > > and
> > > > > > > > > > make it
> > > > > > > > > > > > > > hypervisor agnostic. The interface is really
> trivial
> > > and
> > > > > easy
> > > > > > > > to
> > > > > > > > > > add.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As I said above, my proposal does the same thing
> that
> > > you
> > > > > > > > mentioned
> > > > > > > > > > here :)
> > > > > > > > > > > > > The difference is that I do call hypervisor
> interfaces
> > > via
> > > > > > > > virtio-
> > > > > > > > > > proxy.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > The only Xen-specific part is the notification
> > > mechanism,
> > > > > > > > which is
> > > > > > > > > > an
> > > > > > > > > > > > > > event channel. If we replaced the event channel
> with
> > > > > something
> > > > > > > > > > else the
> > > > > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I don't think that translating IOREQs to eventfd
> in
> > > the
> > > > > kernel
> > > > > > > > is
> > > > > > > > > > a
> > > > > > > > > > > > > > good idea: if feels like it would be extra
> > > complexity
> > > > > and that
> > > > > > > > the
> > > > > > > > > > > > > > kernel shouldn't be involved as this is a
> backend-
> > > > > hypervisor
> > > > > > > > > > interface.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that we may want to implement BE as a bare-
> metal
> > > > > > > > application
> > > > > > > > > > > > > as I did on Zephyr, I don't think that the
> translation
> > > > > would not
> > > > > > > > be
> > > > > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > > > > It will be some kind of abstraction layer of
> interrupt
> > > > > handling
> > > > > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
> > > trying to
> > > > > > > > design an
> > > > > > > > > > > > > > interface that could work well for RTOSes too.
> If we
> > > > > want to
> > > > > > > > do
> > > > > > > > > > > > > > something different, both OS-agnostic and
> > > hypervisor-
> > > > > agnostic,
> > > > > > > > > > perhaps
> > > > > > > > > > > > > > we could design a new interface. One that could
> be
> > > > > > > > implementable
> > > > > > > > > > in the
> > > > > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course
> any
> > > > > other
> > > > > > > > > > hypervisor
> > > > > > > > > > > > > > too.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There is also another problem. IOREQ is probably
> not
> > > be
> > > > > the
> > > > > > > > only
> > > > > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > > > > https://marc.info/?l=xen-
> devel&m=162373754705233&w=2.
> > > > > Don't we
> > > > > > > > > > also need
> > > > > > > > > > > > > > an interface for the backend to inject
> interrupts
> > > into
> > > > > the
> > > > > > > > > > frontend? And
> > > > > > > > > > > > > > if the backend requires dynamic memory mappings
> of
> > > > > frontend
> > > > > > > > pages,
> > > > > > > > > > then
> > > > > > > > > > > > > > we would also need an interface to map/unmap
> domU
> > > pages.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My proposal document might help here; All the
> > > interfaces
> > > > > > > > required
> > > > > > > > > > for
> > > > > > > > > > > > > virtio-proxy (or hypervisor-related interfaces)
> are
> > > listed
> > > > > as
> > > > > > > > > > > > > RPC protocols :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > These interfaces are a lot more problematic than
> > > IOREQ:
> > > > > IOREQ
> > > > > > > > is
> > > > > > > > > > tiny
> > > > > > > > > > > > > > and self-contained. It is easy to add anywhere.
> A
> > > new
> > > > > > > > interface to
> > > > > > > > > > > > > > inject interrupts or map pages is more difficult
> to
> > > > > manage
> > > > > > > > because
> > > > > > > > > > it
> > > > > > > > > > > > > > would require changes scattered across the
> various
> > > > > emulators.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Exactly. I have no confident yet that my approach
> will
> > > > > also
> > > > > > > > apply
> > > > > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > > > > Technically, yes, but whether people can accept it
> or
> > > not
> > > > > is a
> > > > > > > > > > different
> > > > > > > > > > > > > matter.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > -Takahiro Akashi
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Oleksandr Tyshchenko


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-02  1:30                             ` Wei Chen
@ 2021-09-02  1:50                               ` Wei Chen
  0 siblings, 0 replies; 66+ messages in thread
From: Wei Chen @ 2021-09-02  1:50 UTC (permalink / raw)
  To: Wei Chen, AKASHI Takahiro
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e, Kaly Xin,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, stefanha, Jan Kiszka, Carl van Schaik,
	pratikp, Srivatsa Vaddagiri, Jean-Philippe Brucker,
	Mathieu Poirier, Oleksandr Tyshchenko, Bertrand Marquis,
	Artem Mygaiev, Julien Grall, Juergen Gross, Paul Durrant, nd,
	Xen Devel

Hi Akashi, Oleksandr,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
> Chen
> Sent: 2021年9月2日 9:31
> To: AKASHI Takahiro <takahiro.akashi@linaro.org>
> Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly Xin
> <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-lists.linaro.org>;
> virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>; Julien
> Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> <paul@xen.org>; nd <nd@arm.com>; Xen Devel <xen-devel@lists.xen.org>
> Subject: RE: Enabling hypervisor agnosticism for VirtIO backends
> 
> Hi Akashi,
> 
> > -----Original Message-----
> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > Sent: 2021年9月1日 20:29
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> Xin
> > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> lists.linaro.org>;
> > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> <arnd.bergmann@linaro.org>;
> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> Julien
> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> > <paul@xen.org>; nd <nd@arm.com>; Xen Devel <xen-devel@lists.xen.org>
> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >
> > Hi Wei,
> >
> > On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
> > > Hi Akashi,
> > >
> > > > -----Original Message-----
> > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > Sent: 2021年8月31日 14:18
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> Kaly
> > Xin
> > > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> > lists.linaro.org>;
> > > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> > <arnd.bergmann@linaro.org>;
> > > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > <jan.kiszka@siemens.com>; Carl van Schaik
> <cvanscha@qti.qualcomm.com>;
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> Jean-
> > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> > Julien
> > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> Durrant
> > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > >
> > > > Wei,
> > > >
> > > > On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
> > > > > Hi Akashi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > Sent: 2021年8月26日 17:41
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> Stabellini
> > > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> > Kaly
> > > > Xin
> > > > > > <Kaly.Xin@arm.com>; Stratos Mailing List <stratos-dev@op-
> > > > lists.linaro.org>;
> > > > > > virtio-dev@lists.oasis-open.org; Arnd Bergmann
> > > > <arnd.bergmann@linaro.org>;
> > > > > > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > <cvanscha@qti.qualcomm.com>;
> > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> > Jean-
> > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> <Artem_Mygaiev@epam.com>;
> > > > Julien
> > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> > Durrant
> > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> > > > > >
> > > > > > Hi Wei,
> > > > > >
> > > > > > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> > > > > > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> > > > > > > > Hi Akashi,
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > > Sent: 2021年8月18日 13:39
> > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> > > > Stabellini
> > > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> > <alex.bennee@linaro.org>;
> > > > > > Stratos
> > > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> > > > > > dev@lists.oasis-
> > > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> > Kumar
> > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> > Kiszka
> > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > <vatsa@codeaurora.org>;
> > > > > > Jean-
> > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu
> Poirier
> > > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > <Artem_Mygaiev@epam.com>;
> > > > > > Julien
> > > > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>;
> > Paul
> > > > > > Durrant
> > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> > backends
> > > > > > > > >
> > > > > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> > > > > > > > > > Hi Akashi,
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> > > > > > > > > > > Sent: 2021年8月17日 16:08
> > > > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>;
> Stefano
> > > > > > Stabellini
> > > > > > > > > > > <sstabellini@kernel.org>; Alex Benn??e
> > > > <alex.bennee@linaro.org>;
> > > > > > > > > Stratos
> > > > > > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>;
> virtio-
> > > > > > > > > dev@lists.oasis-
> > > > > > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>;
> > Viresh
> > > > Kumar
> > > > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com;
> > Jan
> > > > Kiszka
> > > > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > > > <vatsa@codeaurora.org>;
> > > > > > Jean-
> > > > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu
> > Poirier
> > > > > > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> > > > > > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> > > > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > > > <Artem_Mygaiev@epam.com>;
> > > > > > > > > Julien
> > > > > > > > > > > Grall <julien@xen.org>; Juergen Gross
> <jgross@suse.com>;
> > > > Paul
> > > > > > Durrant
> > > > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for
> VirtIO
> > > > backends
> > > > > > > > > > >
> > > > > > > > > > > Hi Wei, Oleksandr,
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen
> wrote:
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen
> proposal
> > > > here.
> > > > > > > > > > > > This proposal is still discussing in Xen and KVM
> > > > communities.
> > > > > > > > > > > > The main work is to decouple the kvmtool from KVM
> and
> > make
> > > > > > > > > > > > other hypervisors can reuse the virtual device
> > > > implementations.
> > > > > > > > > > > >
> > > > > > > > > > > > In this case, we need to introduce an intermediate
> > > > hypervisor
> > > > > > > > > > > > layer for VMM abstraction, Which is, I think it's
> very
> > > > close
> > > > > > > > > > > > to stratos' virtio hypervisor agnosticism work.
> > > > > > > > > > >
> > > > > > > > > > > # My proposal[1] comes from my own idea and doesn't
> > always
> > > > > > represent
> > > > > > > > > > > # Linaro's view on this subject nor reflect Alex's
> > concerns.
> > > > > > > > > Nevertheless,
> > > > > > > > > > >
> > > > > > > > > > > Your idea and my proposal seem to share the same
> > background.
> > > > > > > > > > > Both have the similar goal and currently start with,
> at
> > > > first,
> > > > > > Xen
> > > > > > > > > > > and are based on kvm-tool. (Actually, my work is
> derived
> > > > from
> > > > > > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> > > > > > > > > > >
> > > > > > > > > > > In particular, the abstraction of hypervisor
> interfaces
> > has
> > > > a
> > > > > > same
> > > > > > > > > > > set of interfaces (for your "struct vmm_impl" and my
> > "RPC
> > > > > > interfaces").
> > > > > > > > > > > This is not co-incident as we both share the same
> origin
> > as
> > > > I
> > > > > > said
> > > > > > > > > above.
> > > > > > > > > > > And so we will also share the same issues. One of them
> > is a
> > > > way
> > > > > > of
> > > > > > > > > > > "sharing/mapping FE's memory". There is some trade-off
> > > > between
> > > > > > > > > > > the portability and the performance impact.
> > > > > > > > > > > So we can discuss the topic here in this ML, too.
> > > > > > > > > > > (See Alex's original email, too).
> > > > > > > > > > >
> > > > > > > > > > Yes, I agree.
> > > > > > > > > >
> > > > > > > > > > > On the other hand, my approach aims to create a
> "single-
> > > > binary"
> > > > > > > > > solution
> > > > > > > > > > > in which the same binary of BE vm could run on any
> > > > hypervisors.
> > > > > > > > > > > Somehow similar to your "proposal-#2" in [2], but in
> my
> > > > solution,
> > > > > > all
> > > > > > > > > > > the hypervisor-specific code would be put into another
> > > > entity
> > > > > > (VM),
> > > > > > > > > > > named "virtio-proxy" and the abstracted operations are
> > > > served
> > > > > > via RPC.
> > > > > > > > > > > (In this sense, BE is hypervisor-agnostic but might
> have
> > OS
> > > > > > > > > dependency.)
> > > > > > > > > > > But I know that we need discuss if this is a
> requirement
> > > > even
> > > > > > > > > > > in Stratos project or not. (Maybe not)
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Sorry, I haven't had time to finish reading your virtio-
> > proxy
> > > > > > completely
> > > > > > > > > > (I will do it ASAP). But from your description, it seems
> > we
> > > > need a
> > > > > > > > > > 3rd VM between FE and BE? My concern is that, if my
> > assumption
> > > > is
> > > > > > right,
> > > > > > > > > > will it increase the latency in data transport path?
> Even
> > if
> > > > we're
> > > > > > > > > > using some lightweight guest like RTOS or Unikernel,
> > > > > > > > >
> > > > > > > > > Yes, you're right. But I'm afraid that it is a matter of
> > degree.
> > > > > > > > > As far as we execute 'mapping' operations at every fetch
> of
> > > > payload,
> > > > > > > > > we will see latency issue (even in your case) and if we
> have
> > > > some
> > > > > > solution
> > > > > > > > > for it, we won't see it neither in my proposal :)
> > > > > > > > >
> > > > > > > >
> > > > > > > > Oleksandr has sent a proposal to Xen mailing list to reduce
> > this
> > > > kind
> > > > > > > > of "mapping/unmapping" operations. So the latency caused by
> > this
> > > > > > behavior
> > > > > > > > on Xen may eventually be eliminated, and Linux-KVM doesn't
> > have
> > > > that
> > > > > > problem.
> > > > > > >
> > > > > > > Obviously, I have not yet caught up there in the discussion.
> > > > > > > Which patch specifically?
> > > > > >
> > > > > > Can you give me the link to the discussion or patch, please?
> > > > > >
> > > > >
> > > > > It's a RFC discussion. We have tested this RFC patch internally.
> > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > 07/msg01532.html
> > > >
> > > > I'm afraid that I miss something here, but I don't know
> > > > why this proposed API will lead to eliminating 'mmap' in accessing
> > > > the queued payload at every request?
> > > >
> > >
> > > This API give Xen device model (QEMU or kvmtool) the ability to map
> > > whole guest RAM in device model's address space. In this case, device
> > > model doesn't need dynamic hypercall to map/unmap payload memory.
> > > It can use a flat offset to access payload memory in its address
> > > space directly. Just Like KVM device model does now.
> >
> > Thank you. Quickly, let me make sure one thing:
> > This API itself doesn't do any mapping operations, right?
> > So I suppose that virtio BE guest is responsible to
> > 1) fetch the information about all the memory regions in FE,
> > 2) call this API to allocate a big chunk of unused space in BE,
> > 3) create grant/foreign mappings for FE onto this region(S)
> > in the initialization/configuration of emulated virtio devices.
> >
> > Is this the way this API is expected to be used?
> > Does Xen already has an interface for (1)?
> >
> 
> They are discussing in that thread to find a proper way to do it.
> Because this API is common, both x86 and Arm should be considered.
> 

Please ignore my above reply. I hadn't seen Oleksandr had replied
this question. Sorry about it!

> > -Takahiro Akashi
> >
> > > Before this API, When device model to map whole guest memory, will
> > > severely consume the physical pages of Dom-0/Dom-D.
> > >
> > > > -Takahiro Akashi
> > > >
> > > >
> > > > > > Thanks,
> > > > > > -Takahiro Akashi
> > > > > >
> > > > > > > -Takahiro Akashi
> > > > > > >
> > > > > > > > > > > Specifically speaking about kvm-tool, I have a concern
> > about
> > > > its
> > > > > > > > > > > license term; Targeting different hypervisors and
> > different
> > > > OSs
> > > > > > > > > > > (which I assume includes RTOS's), the resultant
> library
> > > > should
> > > > > > be
> > > > > > > > > > > license permissive and GPL for kvm-tool might be an
> > issue.
> > > > > > > > > > > Any thoughts?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yes. If user want to implement a FreeBSD device model,
> but
> > the
> > > > > > virtio
> > > > > > > > > > library is GPL. Then GPL would be a problem. If we have
> > > > another
> > > > > > good
> > > > > > > > > > candidate, I am open to it.
> > > > > > > > >
> > > > > > > > > I have some candidates, particularly for vq/vring, in my
> > mind:
> > > > > > > > > * Open-AMP, or
> > > > > > > > > * corresponding Free-BSD code
> > > > > > > > >
> > > > > > > >
> > > > > > > > Interesting, I will look into them : )
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Wei Chen
> > > > > > > >
> > > > > > > > > -Takahiro Akashi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > > -Takahiro Akashi
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> > dev/2021-
> > > > > > > > > > > August/000548.html
> > > > > > > > > > > [2] https://marc.info/?l=xen-
> devel&m=162373754705233&w=2
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> > > > > > > > > > > > > Sent: 2021年8月14日 23:38
> > > > > > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>;
> > > > Stefano
> > > > > > > > > Stabellini
> > > > > > > > > > > <sstabellini@kernel.org>
> > > > > > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
> > > > Mailing
> > > > > > List
> > > > > > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-
> > dev@lists.oasis-
> > > > > > open.org;
> > > > > > > > > Arnd
> > > > > > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> > > > > > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> > > > > > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com;
> > Jan
> > > > Kiszka
> > > > > > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> > > > > > <cvanscha@qti.qualcomm.com>;
> > > > > > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
> > > > <vatsa@codeaurora.org>;
> > > > > > Jean-
> > > > > > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu
> > Poirier
> > > > > > > > > > > <mathieu.poirier@linaro.org>; Wei Chen
> > <Wei.Chen@arm.com>;
> > > > > > Oleksandr
> > > > > > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand
> > Marquis
> > > > > > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> > > > > > <Artem_Mygaiev@epam.com>;
> > > > > > > > > Julien
> > > > > > > > > > > Grall <julien@xen.org>; Juergen Gross
> <jgross@suse.com>;
> > > > Paul
> > > > > > Durrant
> > > > > > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> > > > > > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for
> > VirtIO
> > > > > > backends
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello, all.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please see some comments below. And sorry for the
> > > > possible
> > > > > > format
> > > > > > > > > > > issues.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> > > > > > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> > > > > > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700,
> Stefano
> > > > > > Stabellini
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs.
> > Not
> > > > > > trimming
> > > > > > > > > the
> > > > > > > > > > > original
> > > > > > > > > > > > > > > email to let them read the full context.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My comments below are related to a potential
> Xen
> > > > > > > > > implementation,
> > > > > > > > > > > not
> > > > > > > > > > > > > > > because it is the only implementation that
> > matters,
> > > > but
> > > > > > > > > because it
> > > > > > > > > > > is
> > > > > > > > > > > > > > > the one I know best.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please note that my proposal (and hence the
> > working
> > > > > > prototype)[1]
> > > > > > > > > > > > > > is based on Xen's virtio implementation (i.e.
> > IOREQ)
> > > > and
> > > > > > > > > > > particularly
> > > > > > > > > > > > > > EPAM's virtio-disk application (backend server).
> > > > > > > > > > > > > > It has been, I believe, well generalized but is
> > still
> > > > a
> > > > > > bit
> > > > > > > > > biased
> > > > > > > > > > > > > > toward this original design.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So I hope you like my approach :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1] https://op-
> lists.linaro.org/pipermail/stratos-
> > > > > > dev/2021-
> > > > > > > > > > > August/000546.html
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Let me take this opportunity to explain a bit
> more
> > > > about
> > > > > > my
> > > > > > > > > approach
> > > > > > > > > > > below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Also, please see this relevant email thread:
> > > > > > > > > > > > > > > https://marc.info/?l=xen-
> > devel&m=162373754705233&w=2
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > One of the goals of Project Stratos is to
> > enable
> > > > > > hypervisor
> > > > > > > > > > > agnostic
> > > > > > > > > > > > > > > > backends so we can enable as much re-use of
> > code
> > > > as
> > > > > > possible
> > > > > > > > > and
> > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > repeating ourselves. This is the flip side
> of
> > the
> > > > > > front end
> > > > > > > > > > > where
> > > > > > > > > > > > > > > > multiple front-end implementations are
> > required -
> > > > one
> > > > > > per OS,
> > > > > > > > > > > assuming
> > > > > > > > > > > > > > > > you don't just want Linux guests. The
> > resultant
> > > > guests
> > > > > > are
> > > > > > > > > > > trivially
> > > > > > > > > > > > > > > > movable between hypervisors modulo any
> > abstracted
> > > > > > paravirt
> > > > > > > > > type
> > > > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In my original thumb nail sketch of a
> solution
> > I
> > > > > > envisioned
> > > > > > > > > > > vhost-user
> > > > > > > > > > > > > > > > daemons running in a broadly POSIX like
> > > > environment.
> > > > > > The
> > > > > > > > > > > interface to
> > > > > > > > > > > > > > > > the daemon is fairly simple requiring only
> > some
> > > > mapped
> > > > > > > > > memory
> > > > > > > > > > > and some
> > > > > > > > > > > > > > > > sort of signalling for events (on Linux this
> > is
> > > > > > eventfd).
> > > > > > > > > The
> > > > > > > > > > > idea was a
> > > > > > > > > > > > > > > > stub binary would be responsible for any
> > > > hypervisor
> > > > > > specific
> > > > > > > > > > > setup and
> > > > > > > > > > > > > > > > then launch a common binary to deal with the
> > > > actual
> > > > > > > > > virtqueue
> > > > > > > > > > > requests
> > > > > > > > > > > > > > > > themselves.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Since that original sketch we've seen an
> > expansion
> > > > in
> > > > > > the
> > > > > > > > > sort
> > > > > > > > > > > of ways
> > > > > > > > > > > > > > > > backends could be created. There is interest
> > in
> > > > > > > > > encapsulating
> > > > > > > > > > > backends
> > > > > > > > > > > > > > > > in RTOSes or unikernels for solutions like
> > SCMI.
> > > > There
> > > > > > > > > interest
> > > > > > > > > > > in Rust
> > > > > > > > > > > > > > > > has prompted ideas of using the trait
> > interface to
> > > > > > abstract
> > > > > > > > > > > differences
> > > > > > > > > > > > > > > > away as well as the idea of bare-metal Rust
> > > > backends.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> > > > > > Standardisation"
> > > > > > > > > which
> > > > > > > > > > > > > > > > calls for a description of the APIs needed
> > from
> > > > the
> > > > > > > > > hypervisor
> > > > > > > > > > > side to
> > > > > > > > > > > > > > > > support VirtIO guests and their backends.
> > However
> > > > we
> > > > > > are
> > > > > > > > > some
> > > > > > > > > > > way off
> > > > > > > > > > > > > > > > from that at the moment as I think we need
> to
> > at
> > > > least
> > > > > > > > > > > demonstrate one
> > > > > > > > > > > > > > > > portable backend before we start codifying
> > > > > > requirements. To
> > > > > > > > > that
> > > > > > > > > > > end I
> > > > > > > > > > > > > > > > want to think about what we need for a
> backend
> > to
> > > > > > function.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Configuration
> > > > > > > > > > > > > > > > =============
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
> > > > simple
> > > > > > because
> > > > > > > > > the
> > > > > > > > > > > host
> > > > > > > > > > > > > > > > system can orchestrate the various modules
> > that
> > > > make
> > > > > > up the
> > > > > > > > > > > complete
> > > > > > > > > > > > > > > > system. In the type-1 case (or even type-2
> > with
> > > > > > delegated
> > > > > > > > > > > service VMs)
> > > > > > > > > > > > > > > > we need some sort of mechanism to inform the
> > > > backend
> > > > > > VM
> > > > > > > > > about
> > > > > > > > > > > key
> > > > > > > > > > > > > > > > details about the system:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   - where virt queue memory is in it's
> address
> > > > space
> > > > > > > > > > > > > > > >   - how it's going to receive (interrupt)
> and
> > > > trigger
> > > > > > (kick)
> > > > > > > > > > > events
> > > > > > > > > > > > > > > >   - what (if any) resources the backend
> needs
> > to
> > > > > > connect to
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Obviously you can elide over configuration
> > issues
> > > > by
> > > > > > having
> > > > > > > > > > > static
> > > > > > > > > > > > > > > > configurations and baking the assumptions
> into
> > > > your
> > > > > > guest
> > > > > > > > > images
> > > > > > > > > > > however
> > > > > > > > > > > > > > > > this isn't scalable in the long term. The
> > obvious
> > > > > > solution
> > > > > > > > > seems
> > > > > > > > > > > to be
> > > > > > > > > > > > > > > > extending a subset of Device Tree data to
> user
> > > > space
> > > > > > but
> > > > > > > > > perhaps
> > > > > > > > > > > there
> > > > > > > > > > > > > > > > are other approaches?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Before any virtio transactions can take
> place
> > the
> > > > > > > > > appropriate
> > > > > > > > > > > memory
> > > > > > > > > > > > > > > > mappings need to be made between the FE
> guest
> > and
> > > > the
> > > > > > BE
> > > > > > > > > guest.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Currently the whole of the FE guests address
> > space
> > > > > > needs to
> > > > > > > > > be
> > > > > > > > > > > visible
> > > > > > > > > > > > > > > > to whatever is serving the virtio requests.
> I
> > can
> > > > > > envision 3
> > > > > > > > > > > approaches:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  * BE guest boots with memory already mapped
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  This would entail the guest OS knowing
> where
> > in
> > > > it's
> > > > > > Guest
> > > > > > > > > > > Physical
> > > > > > > > > > > > > > > >  Address space is already taken up and
> > avoiding
> > > > > > clashing. I
> > > > > > > > > > > would assume
> > > > > > > > > > > > > > > >  in this case you would want a standard
> > interface
> > > > to
> > > > > > > > > userspace
> > > > > > > > > > > to then
> > > > > > > > > > > > > > > >  make that address space visible to the
> > backend
> > > > daemon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yet another way here is that we would have well
> > known
> > > > > > "shared
> > > > > > > > > > > memory" between
> > > > > > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us
> > good
> > > > > > insights on
> > > > > > > > > this
> > > > > > > > > > > matter
> > > > > > > > > > > > > > and that it can even be an alternative for
> > hypervisor-
> > > > > > agnostic
> > > > > > > > > > > solution.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (Please note memory regions in ivshmem appear as
> a
> > PCI
> > > > > > device
> > > > > > > > > and
> > > > > > > > > > > can be
> > > > > > > > > > > > > > mapped locally.)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I want to add this shared memory aspect to my
> > virtio-
> > > > proxy,
> > > > > > but
> > > > > > > > > > > > > > the resultant solution would eventually look
> > similar
> > > > to
> > > > > > ivshmem.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  * BE guests boots with a hypervisor handle
> to
> > > > memory
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  The BE guest is then free to map the FE's
> > memory
> > > > to
> > > > > > where
> > > > > > > > > it
> > > > > > > > > > > wants in
> > > > > > > > > > > > > > > >  the BE's guest physical address space.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I cannot see how this could work for Xen.
> There
> > is
> > > > no
> > > > > > "handle"
> > > > > > > > > to
> > > > > > > > > > > give
> > > > > > > > > > > > > > > to the backend if the backend is not running
> in
> > dom0.
> > > > So
> > > > > > for
> > > > > > > > > Xen I
> > > > > > > > > > > think
> > > > > > > > > > > > > > > the memory has to be already mapped
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the
> > following
> > > > > > information
> > > > > > > > > is
> > > > > > > > > > > expected
> > > > > > > > > > > > > > to be exposed to BE via Xenstore:
> > > > > > > > > > > > > > (I know that this is a tentative approach
> though.)
> > > > > > > > > > > > > >    - the start address of configuration space
> > > > > > > > > > > > > >    - interrupt number
> > > > > > > > > > > > > >    - file path for backing storage
> > > > > > > > > > > > > >    - read-only flag
> > > > > > > > > > > > > > And the BE server have to call a particular
> > hypervisor
> > > > > > interface
> > > > > > > > > to
> > > > > > > > > > > > > > map the configuration space.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> > > > > > configuration
> > > > > > > > > info to
> > > > > > > > > > > the backend running in a non-toolstack domain.
> > > > > > > > > > > > > I remember, there was a wish to avoid using
> Xenstore
> > in
> > > > > > Virtio
> > > > > > > > > backend
> > > > > > > > > > > itself if possible, so for non-toolstack domain, this
> > could
> > > > done
> > > > > > with
> > > > > > > > > > > adjusting devd (daemon that listens for devices and
> > launches
> > > > > > backends)
> > > > > > > > > > > > > to read backend configuration from the Xenstore
> > anyway
> > > > and
> > > > > > pass it
> > > > > > > > > to
> > > > > > > > > > > the backend via command line arguments.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, in current PoC code we're using xenstore to
> pass
> > > > device
> > > > > > > > > > > configuration.
> > > > > > > > > > > > We also designed a static device configuration parse
> > > > method
> > > > > > for
> > > > > > > > > Dom0less
> > > > > > > > > > > or
> > > > > > > > > > > > other scenarios don't have xentool. yes, it's from
> > device
> > > > > > model
> > > > > > > > > command
> > > > > > > > > > > line
> > > > > > > > > > > > or a config file.
> > > > > > > > > > > >
> > > > > > > > > > > > > But, if ...
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> > > > > > hypervisor)-
> > > > > > > > > > > specific
> > > > > > > > > > > > > > stuffs are contained in virtio-proxy, yet
> another
> > VM,
> > > > to
> > > > > > hide
> > > > > > > > > all
> > > > > > > > > > > details.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ... the solution how to overcome that is already
> > found
> > > > and
> > > > > > proven
> > > > > > > > > to
> > > > > > > > > > > work then even better.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > # My point is that a "handle" is not mandatory
> for
> > > > > > executing
> > > > > > > > > mapping.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > and the mapping probably done by the
> > > > > > > > > > > > > > > toolstack (also see below.) Or we would have
> to
> > > > invent a
> > > > > > new
> > > > > > > > > Xen
> > > > > > > > > > > > > > > hypervisor interface and Xen virtual machine
> > > > privileges
> > > > > > to
> > > > > > > > > allow
> > > > > > > > > > > this
> > > > > > > > > > > > > > > kind of mapping.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If we run the backend in Dom0 that we have no
> > > > problems
> > > > > > of
> > > > > > > > > course.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One of difficulties on Xen that I found in my
> > approach
> > > > is
> > > > > > that
> > > > > > > > > > > calling
> > > > > > > > > > > > > > such hypervisor intefaces (registering IOREQ,
> > mapping
> > > > > > memory) is
> > > > > > > > > > > only
> > > > > > > > > > > > > > allowed on BE servers themselvies and so we will
> > have
> > > > to
> > > > > > extend
> > > > > > > > > > > those
> > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > > This, however, will raise some concern on
> security
> > and
> > > > > > privilege
> > > > > > > > > > > distribution
> > > > > > > > > > > > > > as Stefan suggested.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We also faced policy related issues with Virtio
> > backend
> > > > > > running in
> > > > > > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our
> > target
> > > > > > system we
> > > > > > > > > run
> > > > > > > > > > > the backend in a driver
> > > > > > > > > > > > > domain (we call it DomD) where the underlying H/W
> > > > resides.
> > > > > > We
> > > > > > > > > trust it,
> > > > > > > > > > > so we wrote policy rules (to be used in "flask" xsm
> mode)
> > to
> > > > > > provide
> > > > > > > > > it
> > > > > > > > > > > with a little bit more privileges than a simple DomU
> had.
> > > > > > > > > > > > > Now it is permitted to issue device-model,
> resource
> > and
> > > > > > memory
> > > > > > > > > > > mappings, etc calls.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > To activate the mapping will
> > > > > > > > > > > > > > > >  require some sort of hypercall to the
> > hypervisor.
> > > > I
> > > > > > can see
> > > > > > > > > two
> > > > > > > > > > > options
> > > > > > > > > > > > > > > >  at this point:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   - expose the handle to userspace for
> > > > daemon/helper
> > > > > > to
> > > > > > > > > trigger
> > > > > > > > > > > the
> > > > > > > > > > > > > > > >     mapping via existing hypercall
> interfaces.
> > If
> > > > > > using a
> > > > > > > > > helper
> > > > > > > > > > > you
> > > > > > > > > > > > > > > >     would have a hypervisor specific one to
> > avoid
> > > > the
> > > > > > daemon
> > > > > > > > > > > having to
> > > > > > > > > > > > > > > >     care too much about the details or push
> > that
> > > > > > complexity
> > > > > > > > > into
> > > > > > > > > > > a
> > > > > > > > > > > > > > > >     compile time option for the daemon which
> > would
> > > > > > result in
> > > > > > > > > > > different
> > > > > > > > > > > > > > > >     binaries although a common source base.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   - expose a new kernel ABI to abstract the
> > > > hypercall
> > > > > > > > > > > differences away
> > > > > > > > > > > > > > > >     in the guest kernel. In this case the
> > > > userspace
> > > > > > would
> > > > > > > > > > > essentially
> > > > > > > > > > > > > > > >     ask for an abstract "map guest N memory
> to
> > > > > > userspace
> > > > > > > > > ptr"
> > > > > > > > > > > and let
> > > > > > > > > > > > > > > >     the kernel deal with the different
> > hypercall
> > > > > > interfaces.
> > > > > > > > > > > This of
> > > > > > > > > > > > > > > >     course assumes the majority of BE guests
> > would
> > > > be
> > > > > > Linux
> > > > > > > > > > > kernels and
> > > > > > > > > > > > > > > >     leaves the bare-metal/unikernel
> approaches
> > to
> > > > > > their own
> > > > > > > > > > > devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Operation
> > > > > > > > > > > > > > > > =========
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The core of the operation of VirtIO is
> fairly
> > > > simple.
> > > > > > Once
> > > > > > > > > the
> > > > > > > > > > > > > > > > vhost-user feature negotiation is done it's
> a
> > case
> > > > of
> > > > > > > > > receiving
> > > > > > > > > > > update
> > > > > > > > > > > > > > > > events and parsing the resultant virt queue
> > for
> > > > data.
> > > > > > The
> > > > > > > > > vhost-
> > > > > > > > > > > user
> > > > > > > > > > > > > > > > specification handles a bunch of setup
> before
> > that
> > > > > > point,
> > > > > > > > > mostly
> > > > > > > > > > > to
> > > > > > > > > > > > > > > > detail where the virt queues are set up FD's
> > for
> > > > > > memory and
> > > > > > > > > > > event
> > > > > > > > > > > > > > > > communication. This is where the envisioned
> > stub
> > > > > > process
> > > > > > > > > would
> > > > > > > > > > > be
> > > > > > > > > > > > > > > > responsible for getting the daemon up and
> > ready to
> > > > run.
> > > > > > This
> > > > > > > > > is
> > > > > > > > > > > > > > > > currently done inside a big VMM like QEMU
> but
> > I
> > > > > > suspect a
> > > > > > > > > modern
> > > > > > > > > > > > > > > > approach would be to use the rust-vmm vhost
> > crate.
> > > > It
> > > > > > would
> > > > > > > > > then
> > > > > > > > > > > either
> > > > > > > > > > > > > > > > communicate with the kernel's abstracted ABI
> > or be
> > > > re-
> > > > > > > > > targeted
> > > > > > > > > > > as a
> > > > > > > > > > > > > > > > build option for the various hypervisors.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > One thing I mentioned before to Alex is that
> Xen
> > > > doesn't
> > > > > > have
> > > > > > > > > VMMs
> > > > > > > > > > > the
> > > > > > > > > > > > > > > way they are typically envisioned and
> described
> > in
> > > > other
> > > > > > > > > > > environments.
> > > > > > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them
> > > > connects
> > > > > > > > > > > independently to
> > > > > > > > > > > > > > > Xen via the IOREQ interface. E.g. today
> multiple
> > > > QEMUs
> > > > > > could
> > > > > > > > > be
> > > > > > > > > > > used as
> > > > > > > > > > > > > > > emulators for a single Xen VM, each of them
> > > > connecting
> > > > > > to Xen
> > > > > > > > > > > > > > > independently via the IOREQ interface.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The component responsible for starting a
> daemon
> > > > and/or
> > > > > > setting
> > > > > > > > > up
> > > > > > > > > > > shared
> > > > > > > > > > > > > > > interfaces is the toolstack: the xl command
> and
> > the
> > > > > > > > > libxl/libxc
> > > > > > > > > > > > > > > libraries.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think that VM configuration management (or
> > > > orchestration
> > > > > > in
> > > > > > > > > > > Startos
> > > > > > > > > > > > > > jargon?) is a subject to debate in parallel.
> > > > > > > > > > > > > > Otherwise, is there any good assumption to avoid
> > it
> > > > right
> > > > > > now?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Oleksandr and others I CCed have been working
> on
> > > > ways
> > > > > > for the
> > > > > > > > > > > toolstack
> > > > > > > > > > > > > > > to create virtio backends and setup memory
> > mappings.
> > > > > > They
> > > > > > > > > might be
> > > > > > > > > > > able
> > > > > > > > > > > > > > > to provide more info on the subject. I do
> think
> > we
> > > > miss
> > > > > > a way
> > > > > > > > > to
> > > > > > > > > > > provide
> > > > > > > > > > > > > > > the configuration to the backend and anything
> > else
> > > > that
> > > > > > the
> > > > > > > > > > > backend
> > > > > > > > > > > > > > > might require to start doing its job.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, some work has been done for the toolstack to
> > handle
> > > > > > Virtio
> > > > > > > > > MMIO
> > > > > > > > > > > devices in
> > > > > > > > > > > > > general and Virtio block devices in particular.
> > However,
> > > > it
> > > > > > has
> > > > > > > > > not
> > > > > > > > > > > been upstreaned yet.
> > > > > > > > > > > > > Updated patches on review now:
> > > > > > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-
> 29076-
> > 1-
> > > > git-
> > > > > > send-
> > > > > > > > > email-
> > > > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > > > >
> > > > > > > > > > > > > There is an additional (also important) activity
> to
> > > > > > improve/fix
> > > > > > > > > > > foreign memory mapping on Arm which I am also involved
> > in.
> > > > > > > > > > > > > The foreign memory mapping is proposed to be used
> > for
> > > > Virtio
> > > > > > > > > backends
> > > > > > > > > > > (device emulators) if there is a need to run guest OS
> > > > completely
> > > > > > > > > > > unmodified.
> > > > > > > > > > > > > Of course, the more secure way would be to use
> grant
> > > > memory
> > > > > > > > > mapping.
> > > > > > > > > > > Brietly, the main difference between them is that with
> > > > foreign
> > > > > > mapping
> > > > > > > > > the
> > > > > > > > > > > backend
> > > > > > > > > > > > > can map any guest memory it wants to map, but with
> > grant
> > > > > > mapping
> > > > > > > > > it is
> > > > > > > > > > > allowed to map only what was previously granted by the
> > > > frontend.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So, there might be a problem if we want to pre-map
> > some
> > > > > > guest
> > > > > > > > > memory
> > > > > > > > > > > in advance or to cache mappings in the backend in
> order
> > to
> > > > > > improve
> > > > > > > > > > > performance (because the mapping/unmapping guest pages
> > every
> > > > > > request
> > > > > > > > > > > requires a lot of back and forth to Xen + P2M updates).
> > In a
> > > > > > nutshell,
> > > > > > > > > > > currently, in order to map a guest page into the
> backend
> > > > address
> > > > > > space
> > > > > > > > > we
> > > > > > > > > > > need to steal a real physical page from the backend
> > domain.
> > > > So,
> > > > > > with
> > > > > > > > > the
> > > > > > > > > > > said optimizations we might end up with no free memory
> > in
> > > > the
> > > > > > backend
> > > > > > > > > > > domain (see XSA-300). And what we try to achieve is to
> > not
> > > > waste
> > > > > > a
> > > > > > > > > real
> > > > > > > > > > > domain memory at all by providing safe non-allocated-
> yet
> > (so
> > > > > > unused)
> > > > > > > > > > > address space for the foreign (and grant) pages to be
> > mapped
> > > > > > into,
> > > > > > > > > this
> > > > > > > > > > > enabling work implies Xen and Linux (and likely DTB
> > bindings)
> > > > > > changes.
> > > > > > > > > > > However, as it turned out, for this to work in a
> proper
> > and
> > > > safe
> > > > > > way
> > > > > > > > > some
> > > > > > > > > > > prereq work needs to be done.
> > > > > > > > > > > > > You can find the related Xen discussion at:
> > > > > > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-
> 25633-
> > 1-
> > > > git-
> > > > > > send-
> > > > > > > > > email-
> > > > > > > > > > > olekstysh@gmail.com/
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > One question is how to best handle
> > notification
> > > > and
> > > > > > kicks.
> > > > > > > > > The
> > > > > > > > > > > existing
> > > > > > > > > > > > > > > > vhost-user framework uses eventfd to signal
> > the
> > > > daemon
> > > > > > > > > (although
> > > > > > > > > > > QEMU
> > > > > > > > > > > > > > > > is quite capable of simulating them when you
> > use
> > > > TCG).
> > > > > > Xen
> > > > > > > > > has
> > > > > > > > > > > it's own
> > > > > > > > > > > > > > > > IOREQ mechanism. However latency is an
> > important
> > > > > > factor and
> > > > > > > > > > > having
> > > > > > > > > > > > > > > > events go through the stub would add quite a
> > lot.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yeah I think, regardless of anything else, we
> > want
> > > > the
> > > > > > > > > backends to
> > > > > > > > > > > > > > > connect directly to the Xen hypervisor.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In my approach,
> > > > > > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling
> a
> > > > > > hypervisor
> > > > > > > > > > > interface
> > > > > > > > > > > > > >               via virtio-proxy
> > > > > > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in
> > event
> > > > > > channels),
> > > > > > > > > > > which is
> > > > > > > > > > > > > >               converted to a callback to BE via
> > > > virtio-
> > > > > > proxy
> > > > > > > > > > > > > >               (Xen's event channel is
> internnally
> > > > > > implemented by
> > > > > > > > > > > interrupts.)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I don't know what "connect directly" means here,
> > but
> > > > > > sending
> > > > > > > > > > > interrupts
> > > > > > > > > > > > > > to the opposite side would be best efficient.
> > > > > > > > > > > > > > Ivshmem, I suppose, takes this approach by
> > utilizing
> > > > PCI's
> > > > > > msi-x
> > > > > > > > > > > mechanism.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Agree that MSI would be more efficient than SPI...
> > > > > > > > > > > > > At the moment, in order to notify the frontend,
> the
> > > > backend
> > > > > > issues
> > > > > > > > > a
> > > > > > > > > > > specific device-model call to query Xen to inject a
> > > > > > corresponding SPI
> > > > > > > > > to
> > > > > > > > > > > the guest.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Could we consider the kernel internally
> > converting
> > > > > > IOREQ
> > > > > > > > > > > messages from
> > > > > > > > > > > > > > > > the Xen hypervisor to eventfd events? Would
> > this
> > > > scale
> > > > > > with
> > > > > > > > > > > other kernel
> > > > > > > > > > > > > > > > hypercall interfaces?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So any thoughts on what directions are worth
> > > > > > experimenting
> > > > > > > > > with?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > One option we should consider is for each
> > backend to
> > > > > > connect
> > > > > > > > > to
> > > > > > > > > > > Xen via
> > > > > > > > > > > > > > > the IOREQ interface. We could generalize the
> > IOREQ
> > > > > > interface
> > > > > > > > > and
> > > > > > > > > > > make it
> > > > > > > > > > > > > > > hypervisor agnostic. The interface is really
> > trivial
> > > > and
> > > > > > easy
> > > > > > > > > to
> > > > > > > > > > > add.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As I said above, my proposal does the same thing
> > that
> > > > you
> > > > > > > > > mentioned
> > > > > > > > > > > here :)
> > > > > > > > > > > > > > The difference is that I do call hypervisor
> > interfaces
> > > > via
> > > > > > > > > virtio-
> > > > > > > > > > > proxy.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The only Xen-specific part is the notification
> > > > mechanism,
> > > > > > > > > which is
> > > > > > > > > > > an
> > > > > > > > > > > > > > > event channel. If we replaced the event
> channel
> > with
> > > > > > something
> > > > > > > > > > > else the
> > > > > > > > > > > > > > > interface would be generic. See:
> > > > > > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> > > > > > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I don't think that translating IOREQs to
> eventfd
> > in
> > > > the
> > > > > > kernel
> > > > > > > > > is
> > > > > > > > > > > a
> > > > > > > > > > > > > > > good idea: if feels like it would be extra
> > > > complexity
> > > > > > and that
> > > > > > > > > the
> > > > > > > > > > > > > > > kernel shouldn't be involved as this is a
> > backend-
> > > > > > hypervisor
> > > > > > > > > > > interface.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Given that we may want to implement BE as a
> bare-
> > metal
> > > > > > > > > application
> > > > > > > > > > > > > > as I did on Zephyr, I don't think that the
> > translation
> > > > > > would not
> > > > > > > > > be
> > > > > > > > > > > > > > a big issue, especially on RTOS's.
> > > > > > > > > > > > > > It will be some kind of abstraction layer of
> > interrupt
> > > > > > handling
> > > > > > > > > > > > > > (or nothing but a callback mechanism).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
> > > > trying to
> > > > > > > > > design an
> > > > > > > > > > > > > > > interface that could work well for RTOSes too.
> > If we
> > > > > > want to
> > > > > > > > > do
> > > > > > > > > > > > > > > something different, both OS-agnostic and
> > > > hypervisor-
> > > > > > agnostic,
> > > > > > > > > > > perhaps
> > > > > > > > > > > > > > > we could design a new interface. One that
> could
> > be
> > > > > > > > > implementable
> > > > > > > > > > > in the
> > > > > > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of
> course
> > any
> > > > > > other
> > > > > > > > > > > hypervisor
> > > > > > > > > > > > > > > too.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There is also another problem. IOREQ is
> probably
> > not
> > > > be
> > > > > > the
> > > > > > > > > only
> > > > > > > > > > > > > > > interface needed. Have a look at
> > > > > > > > > > > > > > > https://marc.info/?l=xen-
> > devel&m=162373754705233&w=2.
> > > > > > Don't we
> > > > > > > > > > > also need
> > > > > > > > > > > > > > > an interface for the backend to inject
> > interrupts
> > > > into
> > > > > > the
> > > > > > > > > > > frontend? And
> > > > > > > > > > > > > > > if the backend requires dynamic memory
> mappings
> > of
> > > > > > frontend
> > > > > > > > > pages,
> > > > > > > > > > > then
> > > > > > > > > > > > > > > we would also need an interface to map/unmap
> > domU
> > > > pages.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My proposal document might help here; All the
> > > > interfaces
> > > > > > > > > required
> > > > > > > > > > > for
> > > > > > > > > > > > > > virtio-proxy (or hypervisor-related interfaces)
> > are
> > > > listed
> > > > > > as
> > > > > > > > > > > > > > RPC protocols :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > These interfaces are a lot more problematic
> than
> > > > IOREQ:
> > > > > > IOREQ
> > > > > > > > > is
> > > > > > > > > > > tiny
> > > > > > > > > > > > > > > and self-contained. It is easy to add anywhere.
> > A
> > > > new
> > > > > > > > > interface to
> > > > > > > > > > > > > > > inject interrupts or map pages is more
> difficult
> > to
> > > > > > manage
> > > > > > > > > because
> > > > > > > > > > > it
> > > > > > > > > > > > > > > would require changes scattered across the
> > various
> > > > > > emulators.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Exactly. I have no confident yet that my
> approach
> > will
> > > > > > also
> > > > > > > > > apply
> > > > > > > > > > > > > > to other hypervisors than Xen.
> > > > > > > > > > > > > > Technically, yes, but whether people can accept
> it
> > or
> > > > not
> > > > > > is a
> > > > > > > > > > > different
> > > > > > > > > > > > > > matter.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > -Takahiro Akashi
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Oleksandr Tyshchenko


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-08-30 19:53                           ` [virtio-dev] " Christopher Clark
  (?)
@ 2021-09-02  7:19                           ` AKASHI Takahiro
  2021-09-07  0:57                               ` [virtio-dev] " Christopher Clark
  -1 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-02  7:19 UTC (permalink / raw)
  To: Christopher Clark
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith

Hi Christopher,

Thank you for your feedback.

On Mon, Aug 30, 2021 at 12:53:00PM -0700, Christopher Clark wrote:
> [ resending message to ensure delivery to the CCd mailing lists
> post-subscription ]
> 
> Apologies for being late to this thread, but I hope to be able to
> contribute to
> this discussion in a meaningful way. I am grateful for the level of
> interest in
> this topic. I would like to draw your attention to Argo as a suitable
> technology for development of VirtIO's hypervisor-agnostic interfaces.
> 
> * Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
> that
>   can send and receive hypervisor-mediated notifications and messages
> between
>   domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
> over
>   all communication between domains. It is derived from the earlier v4v,
> which
>   has been deployed on millions of machines with the HP/Bromium uXen
> hypervisor
>   and with OpenXT.
> 
> * Argo has a simple interface with a small number of operations that was
>   designed for ease of integration into OS primitives on both Linux
> (sockets)
>   and Windows (ReadFile/WriteFile) [2].
>     - A unikernel example of using it has also been developed for XTF. [3]
> 
> * There has been recent discussion and support in the Xen community for
> making
>   revisions to the Argo interface to make it hypervisor-agnostic, and
> support
>   implementations of Argo on other hypervisors. This will enable a single
>   interface for an OS kernel binary to use for inter-VM communication that
> will
>   work on multiple hypervisors -- this applies equally to both backends and
>   frontend implementations. [4]

Regarding virtio-over-Argo, let me ask a few questions:
(In figure "Virtual device buffer access:Virtio+Argo" in [4])
1) How the configuration is managed?
   On either virtio-mmio or virtio-pci, there always takes place
   some negotiation between the FE and BE through the "configuration"
   space. How can this be done in virtio-over-Argo?
2) Do there physically exist virtio's available/used vrings as well as
   descriptors, or are they virtually emulated over Argo (rings)?
3) The payload in a request will be copied into the receiver's Argo ring.
   What does the address in a descriptor mean?
   Address/offset in a ring buffer?
4) Estimate of performance or latency?
   It appears that, on FE side, at least three hypervisor calls (and data
   copying) need to be invoked at every request, right?

Thanks,
-Takahiro Akashi


> * Here are the design documents for building VirtIO-over-Argo, to support a
>   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> 
> The Development Plan to build VirtIO virtual device support over Argo
> transport:
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> 
> A design for using VirtIO over Argo, describing how VirtIO data structures
> and communication is handled over the Argo transport:
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> 
> Diagram (from the above document) showing how VirtIO rings are synchronized
> between domains without using shared memory:
> https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> 
> Please note that the above design documents show that the existing VirtIO
> device drivers, and both vring and virtqueue data structures can be
> preserved
> while interdomain communication can be performed with no shared memory
> required
> for most drivers; (the exceptions where further design is required are those
> such as virtual framebuffer devices where shared memory regions are
> intentionally
> added to the communication structure beyond the vrings and virtqueues).
> 
> An analysis of VirtIO and Argo, informing the design:
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> 
> * Argo can be used for a communication path for configuration between the
> backend
>   and the toolstack, avoiding the need for a dependency on XenStore, which
> is an
>   advantage for any hypervisor-agnostic design. It is also amenable to a
> notification
>   mechanism that is not based on Xen event channels.
> 
> * Argo does not use or require shared memory between VMs and provides an
> alternative
>   to the use of foreign shared memory mappings. It avoids some of the
> complexities
>   involved with using grants (eg. XSA-300).
> 
> * Argo supports Mandatory Access Control by the hypervisor, satisfying a
> common
>   certification requirement.
> 
> * The Argo headers are BSD-licensed and the Xen hypervisor implementation
> is GPLv2 but
>   accessible via the hypercall interface. The licensing should not present
> an obstacle
>   to adoption of Argo in guest software or implementation by other
> hypervisors.
> 
> * Since the interface that Argo presents to a guest VM is similar to DMA, a
> VirtIO-Argo
>   frontend transport driver should be able to operate with a physical
> VirtIO-enabled
>   smart-NIC if the toolstack and an Argo-aware backend provide support.
> 
> The next Xen Community Call is next week and I would be happy to answer
> questions
> about Argo and on this topic. I will also be following this thread.
> 
> Christopher
> (Argo maintainer, Xen Community)
> 
> --------------------------------------------------------------------------------
> [1]
> An introduction to Argo:
> https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> Xen Wiki page for Argo:
> https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> 
> [2]
> OpenXT Linux Argo driver and userspace library:
> https://github.com/openxt/linux-xen-argo
> 
> Windows V4V at OpenXT wiki:
> https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> Windows v4v driver source:
> https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> 
> HP/Bromium uXen V4V driver:
> https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> 
> [3]
> v2 of the Argo test unikernel for XTF:
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> 
> [4]
> Argo HMX Transport for VirtIO meeting minutes:
> https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> 
> VirtIO-Argo Development wiki page:
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> 
> 
> > On Thu, Aug 26, 2021 at 5:11 AM Wei Chen <Wei.Chen@arm.com> wrote:
> >
> >> Hi Akashi,
> >>
> >> > -----Original Message-----
> >> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >> > Sent: 2021年8月26日 17:41
> >> > To: Wei Chen <Wei.Chen@arm.com>
> >> > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano Stabellini
> >> > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>; Kaly
> >> Xin
> >> > <Kaly.Xin@arm.com>; Stratos Mailing List <
> >> stratos-dev@op-lists.linaro.org>;
> >> > virtio-dev@lists.oasis-open.org; Arnd Bergmann <
> >> arnd.bergmann@linaro.org>;
> >> > Viresh Kumar <viresh.kumar@linaro.org>; Stefano Stabellini
> >> > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> >> > <jan.kiszka@siemens.com>; Carl van Schaik <cvanscha@qti.qualcomm.com>;
> >> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>; Jean-
> >> > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> >> > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> >> > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> >> > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com>;
> >> Julien
> >> > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul Durrant
> >> > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> >> > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >> >
> >> > Hi Wei,
> >> >
> >> > On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
> >> > > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
> >> > > > Hi Akashi,
> >> > > >
> >> > > > > -----Original Message-----
> >> > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >> > > > > Sent: 2021年8月18日 13:39
> >> > > > > To: Wei Chen <Wei.Chen@arm.com>
> >> > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> >> Stabellini
> >> > > > > <sstabellini@kernel.org>; Alex Benn??e <alex.bennee@linaro.org>;
> >> > Stratos
> >> > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> >> > dev@lists.oasis-
> >> > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> >> > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> >> > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan Kiszka
> >> > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> >> > <cvanscha@qti.qualcomm.com>;
> >> > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org>;
> >> > Jean-
> >> > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> >> > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> >> > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> >> > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev <Artem_Mygaiev@epam.com
> >> >;
> >> > Julien
> >> > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> >> > Durrant
> >> > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> >> > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
> >> > > > >
> >> > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
> >> > > > > > Hi Akashi,
> >> > > > > >
> >> > > > > > > -----Original Message-----
> >> > > > > > > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >> > > > > > > Sent: 2021年8月17日 16:08
> >> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> >> > > > > > > Cc: Oleksandr Tyshchenko <olekstysh@gmail.com>; Stefano
> >> > Stabellini
> >> > > > > > > <sstabellini@kernel.org>; Alex Benn??e <
> >> alex.bennee@linaro.org>;
> >> > > > > Stratos
> >> > > > > > > Mailing List <stratos-dev@op-lists.linaro.org>; virtio-
> >> > > > > dev@lists.oasis-
> >> > > > > > > open.org; Arnd Bergmann <arnd.bergmann@linaro.org>; Viresh
> >> Kumar
> >> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> >> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> >> Kiszka
> >> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> >> > <cvanscha@qti.qualcomm.com>;
> >> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
> >> >;
> >> > Jean-
> >> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> >> > > > > > > <mathieu.poirier@linaro.org>; Oleksandr Tyshchenko
> >> > > > > > > <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> >> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> >> > <Artem_Mygaiev@epam.com>;
> >> > > > > Julien
> >> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> >> > Durrant
> >> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> >> > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> >> backends
> >> > > > > > >
> >> > > > > > > Hi Wei, Oleksandr,
> >> > > > > > >
> >> > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
> >> > > > > > > > Hi All,
> >> > > > > > > >
> >> > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here.
> >> > > > > > > > This proposal is still discussing in Xen and KVM
> >> communities.
> >> > > > > > > > The main work is to decouple the kvmtool from KVM and make
> >> > > > > > > > other hypervisors can reuse the virtual device
> >> implementations.
> >> > > > > > > >
> >> > > > > > > > In this case, we need to introduce an intermediate
> >> hypervisor
> >> > > > > > > > layer for VMM abstraction, Which is, I think it's very close
> >> > > > > > > > to stratos' virtio hypervisor agnosticism work.
> >> > > > > > >
> >> > > > > > > # My proposal[1] comes from my own idea and doesn't always
> >> > represent
> >> > > > > > > # Linaro's view on this subject nor reflect Alex's concerns.
> >> > > > > Nevertheless,
> >> > > > > > >
> >> > > > > > > Your idea and my proposal seem to share the same background.
> >> > > > > > > Both have the similar goal and currently start with, at first,
> >> > Xen
> >> > > > > > > and are based on kvm-tool. (Actually, my work is derived from
> >> > > > > > > EPAM's virtio-disk, which is also based on kvm-tool.)
> >> > > > > > >
> >> > > > > > > In particular, the abstraction of hypervisor interfaces has a
> >> > same
> >> > > > > > > set of interfaces (for your "struct vmm_impl" and my "RPC
> >> > interfaces").
> >> > > > > > > This is not co-incident as we both share the same origin as I
> >> > said
> >> > > > > above.
> >> > > > > > > And so we will also share the same issues. One of them is a
> >> way
> >> > of
> >> > > > > > > "sharing/mapping FE's memory". There is some trade-off between
> >> > > > > > > the portability and the performance impact.
> >> > > > > > > So we can discuss the topic here in this ML, too.
> >> > > > > > > (See Alex's original email, too).
> >> > > > > > >
> >> > > > > > Yes, I agree.
> >> > > > > >
> >> > > > > > > On the other hand, my approach aims to create a
> >> "single-binary"
> >> > > > > solution
> >> > > > > > > in which the same binary of BE vm could run on any
> >> hypervisors.
> >> > > > > > > Somehow similar to your "proposal-#2" in [2], but in my
> >> solution,
> >> > all
> >> > > > > > > the hypervisor-specific code would be put into another entity
> >> > (VM),
> >> > > > > > > named "virtio-proxy" and the abstracted operations are served
> >> > via RPC.
> >> > > > > > > (In this sense, BE is hypervisor-agnostic but might have OS
> >> > > > > dependency.)
> >> > > > > > > But I know that we need discuss if this is a requirement even
> >> > > > > > > in Stratos project or not. (Maybe not)
> >> > > > > > >
> >> > > > > >
> >> > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
> >> > completely
> >> > > > > > (I will do it ASAP). But from your description, it seems we
> >> need a
> >> > > > > > 3rd VM between FE and BE? My concern is that, if my assumption
> >> is
> >> > right,
> >> > > > > > will it increase the latency in data transport path? Even if
> >> we're
> >> > > > > > using some lightweight guest like RTOS or Unikernel,
> >> > > > >
> >> > > > > Yes, you're right. But I'm afraid that it is a matter of degree.
> >> > > > > As far as we execute 'mapping' operations at every fetch of
> >> payload,
> >> > > > > we will see latency issue (even in your case) and if we have some
> >> > solution
> >> > > > > for it, we won't see it neither in my proposal :)
> >> > > > >
> >> > > >
> >> > > > Oleksandr has sent a proposal to Xen mailing list to reduce this
> >> kind
> >> > > > of "mapping/unmapping" operations. So the latency caused by this
> >> > behavior
> >> > > > on Xen may eventually be eliminated, and Linux-KVM doesn't have that
> >> > problem.
> >> > >
> >> > > Obviously, I have not yet caught up there in the discussion.
> >> > > Which patch specifically?
> >> >
> >> > Can you give me the link to the discussion or patch, please?
> >> >
> >>
> >> It's a RFC discussion. We have tested this RFC patch internally.
> >> https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
> >>
> >> > Thanks,
> >> > -Takahiro Akashi
> >> >
> >> > > -Takahiro Akashi
> >> > >
> >> > > > > > > Specifically speaking about kvm-tool, I have a concern about
> >> its
> >> > > > > > > license term; Targeting different hypervisors and different
> >> OSs
> >> > > > > > > (which I assume includes RTOS's), the resultant library should
> >> > be
> >> > > > > > > license permissive and GPL for kvm-tool might be an issue.
> >> > > > > > > Any thoughts?
> >> > > > > > >
> >> > > > > >
> >> > > > > > Yes. If user want to implement a FreeBSD device model, but the
> >> > virtio
> >> > > > > > library is GPL. Then GPL would be a problem. If we have another
> >> > good
> >> > > > > > candidate, I am open to it.
> >> > > > >
> >> > > > > I have some candidates, particularly for vq/vring, in my mind:
> >> > > > > * Open-AMP, or
> >> > > > > * corresponding Free-BSD code
> >> > > > >
> >> > > >
> >> > > > Interesting, I will look into them : )
> >> > > >
> >> > > > Cheers,
> >> > > > Wei Chen
> >> > > >
> >> > > > > -Takahiro Akashi
> >> > > > >
> >> > > > >
> >> > > > > > > -Takahiro Akashi
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
> >> > > > > > > August/000548.html
> >> > > > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > > From: Oleksandr Tyshchenko <olekstysh@gmail.com>
> >> > > > > > > > > Sent: 2021年8月14日 23:38
> >> > > > > > > > > To: AKASHI Takahiro <takahiro.akashi@linaro.org>; Stefano
> >> > > > > Stabellini
> >> > > > > > > <sstabellini@kernel.org>
> >> > > > > > > > > Cc: Alex Benn??e <alex.bennee@linaro.org>; Stratos
> >> Mailing
> >> > List
> >> > > > > > > <stratos-dev@op-lists.linaro.org>; virtio-dev@lists.oasis-
> >> > open.org;
> >> > > > > Arnd
> >> > > > > > > Bergmann <arnd.bergmann@linaro.org>; Viresh Kumar
> >> > > > > > > <viresh.kumar@linaro.org>; Stefano Stabellini
> >> > > > > > > <stefano.stabellini@xilinx.com>; stefanha@redhat.com; Jan
> >> Kiszka
> >> > > > > > > <jan.kiszka@siemens.com>; Carl van Schaik
> >> > <cvanscha@qti.qualcomm.com>;
> >> > > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
> >> >;
> >> > Jean-
> >> > > > > > > Philippe Brucker <jean-philippe@linaro.org>; Mathieu Poirier
> >> > > > > > > <mathieu.poirier@linaro.org>; Wei Chen <Wei.Chen@arm.com>;
> >> > Oleksandr
> >> > > > > > > Tyshchenko <Oleksandr_Tyshchenko@epam.com>; Bertrand Marquis
> >> > > > > > > <Bertrand.Marquis@arm.com>; Artem Mygaiev
> >> > <Artem_Mygaiev@epam.com>;
> >> > > > > Julien
> >> > > > > > > Grall <julien@xen.org>; Juergen Gross <jgross@suse.com>; Paul
> >> > Durrant
> >> > > > > > > <paul@xen.org>; Xen Devel <xen-devel@lists.xen.org>
> >> > > > > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
> >> > backends
> >> > > > > > > > >
> >> > > > > > > > > Hello, all.
> >> > > > > > > > >
> >> > > > > > > > > Please see some comments below. And sorry for the possible
> >> > format
> >> > > > > > > issues.
> >> > > > > > > > >
> >> > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
> >> > > > > > > <mailto:takahiro.akashi@linaro.org> wrote:
> >> > > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
> >> > Stabellini
> >> > > > > wrote:
> >> > > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
> >> > trimming
> >> > > > > the
> >> > > > > > > original
> >> > > > > > > > > > > email to let them read the full context.
> >> > > > > > > > > > >
> >> > > > > > > > > > > My comments below are related to a potential Xen
> >> > > > > implementation,
> >> > > > > > > not
> >> > > > > > > > > > > because it is the only implementation that matters,
> >> but
> >> > > > > because it
> >> > > > > > > is
> >> > > > > > > > > > > the one I know best.
> >> > > > > > > > > >
> >> > > > > > > > > > Please note that my proposal (and hence the working
> >> > prototype)[1]
> >> > > > > > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and
> >> > > > > > > particularly
> >> > > > > > > > > > EPAM's virtio-disk application (backend server).
> >> > > > > > > > > > It has been, I believe, well generalized but is still a
> >> > bit
> >> > > > > biased
> >> > > > > > > > > > toward this original design.
> >> > > > > > > > > >
> >> > > > > > > > > > So I hope you like my approach :)
> >> > > > > > > > > >
> >> > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
> >> > dev/2021-
> >> > > > > > > August/000546.html
> >> > > > > > > > > >
> >> > > > > > > > > > Let me take this opportunity to explain a bit more about
> >> > my
> >> > > > > approach
> >> > > > > > > below.
> >> > > > > > > > > >
> >> > > > > > > > > > > Also, please see this relevant email thread:
> >> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote:
> >> > > > > > > > > > > > Hi,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > One of the goals of Project Stratos is to enable
> >> > hypervisor
> >> > > > > > > agnostic
> >> > > > > > > > > > > > backends so we can enable as much re-use of code as
> >> > possible
> >> > > > > and
> >> > > > > > > avoid
> >> > > > > > > > > > > > repeating ourselves. This is the flip side of the
> >> > front end
> >> > > > > > > where
> >> > > > > > > > > > > > multiple front-end implementations are required -
> >> one
> >> > per OS,
> >> > > > > > > assuming
> >> > > > > > > > > > > > you don't just want Linux guests. The resultant
> >> guests
> >> > are
> >> > > > > > > trivially
> >> > > > > > > > > > > > movable between hypervisors modulo any abstracted
> >> > paravirt
> >> > > > > type
> >> > > > > > > > > > > > interfaces.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > In my original thumb nail sketch of a solution I
> >> > envisioned
> >> > > > > > > vhost-user
> >> > > > > > > > > > > > daemons running in a broadly POSIX like environment.
> >> > The
> >> > > > > > > interface to
> >> > > > > > > > > > > > the daemon is fairly simple requiring only some
> >> mapped
> >> > > > > memory
> >> > > > > > > and some
> >> > > > > > > > > > > > sort of signalling for events (on Linux this is
> >> > eventfd).
> >> > > > > The
> >> > > > > > > idea was a
> >> > > > > > > > > > > > stub binary would be responsible for any hypervisor
> >> > specific
> >> > > > > > > setup and
> >> > > > > > > > > > > > then launch a common binary to deal with the actual
> >> > > > > virtqueue
> >> > > > > > > requests
> >> > > > > > > > > > > > themselves.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Since that original sketch we've seen an expansion
> >> in
> >> > the
> >> > > > > sort
> >> > > > > > > of ways
> >> > > > > > > > > > > > backends could be created. There is interest in
> >> > > > > encapsulating
> >> > > > > > > backends
> >> > > > > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
> >> There
> >> > > > > interest
> >> > > > > > > in Rust
> >> > > > > > > > > > > > has prompted ideas of using the trait interface to
> >> > abstract
> >> > > > > > > differences
> >> > > > > > > > > > > > away as well as the idea of bare-metal Rust
> >> backends.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > We have a card (STR-12) called "Hypercall
> >> > Standardisation"
> >> > > > > which
> >> > > > > > > > > > > > calls for a description of the APIs needed from the
> >> > > > > hypervisor
> >> > > > > > > side to
> >> > > > > > > > > > > > support VirtIO guests and their backends. However we
> >> > are
> >> > > > > some
> >> > > > > > > way off
> >> > > > > > > > > > > > from that at the moment as I think we need to at
> >> least
> >> > > > > > > demonstrate one
> >> > > > > > > > > > > > portable backend before we start codifying
> >> > requirements. To
> >> > > > > that
> >> > > > > > > end I
> >> > > > > > > > > > > > want to think about what we need for a backend to
> >> > function.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Configuration
> >> > > > > > > > > > > > =============
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > In the type-2 setup this is typically fairly simple
> >> > because
> >> > > > > the
> >> > > > > > > host
> >> > > > > > > > > > > > system can orchestrate the various modules that make
> >> > up the
> >> > > > > > > complete
> >> > > > > > > > > > > > system. In the type-1 case (or even type-2 with
> >> > delegated
> >> > > > > > > service VMs)
> >> > > > > > > > > > > > we need some sort of mechanism to inform the backend
> >> > VM
> >> > > > > about
> >> > > > > > > key
> >> > > > > > > > > > > > details about the system:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   - where virt queue memory is in it's address space
> >> > > > > > > > > > > >   - how it's going to receive (interrupt) and
> >> trigger
> >> > (kick)
> >> > > > > > > events
> >> > > > > > > > > > > >   - what (if any) resources the backend needs to
> >> > connect to
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Obviously you can elide over configuration issues by
> >> > having
> >> > > > > > > static
> >> > > > > > > > > > > > configurations and baking the assumptions into your
> >> > guest
> >> > > > > images
> >> > > > > > > however
> >> > > > > > > > > > > > this isn't scalable in the long term. The obvious
> >> > solution
> >> > > > > seems
> >> > > > > > > to be
> >> > > > > > > > > > > > extending a subset of Device Tree data to user space
> >> > but
> >> > > > > perhaps
> >> > > > > > > there
> >> > > > > > > > > > > > are other approaches?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Before any virtio transactions can take place the
> >> > > > > appropriate
> >> > > > > > > memory
> >> > > > > > > > > > > > mappings need to be made between the FE guest and
> >> the
> >> > BE
> >> > > > > guest.
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Currently the whole of the FE guests address space
> >> > needs to
> >> > > > > be
> >> > > > > > > visible
> >> > > > > > > > > > > > to whatever is serving the virtio requests. I can
> >> > envision 3
> >> > > > > > > approaches:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >  * BE guest boots with memory already mapped
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >  This would entail the guest OS knowing where in
> >> it's
> >> > Guest
> >> > > > > > > Physical
> >> > > > > > > > > > > >  Address space is already taken up and avoiding
> >> > clashing. I
> >> > > > > > > would assume
> >> > > > > > > > > > > >  in this case you would want a standard interface to
> >> > > > > userspace
> >> > > > > > > to then
> >> > > > > > > > > > > >  make that address space visible to the backend
> >> daemon.
> >> > > > > > > > > >
> >> > > > > > > > > > Yet another way here is that we would have well known
> >> > "shared
> >> > > > > > > memory" between
> >> > > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
> >> > insights on
> >> > > > > this
> >> > > > > > > matter
> >> > > > > > > > > > and that it can even be an alternative for hypervisor-
> >> > agnostic
> >> > > > > > > solution.
> >> > > > > > > > > >
> >> > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
> >> > device
> >> > > > > and
> >> > > > > > > can be
> >> > > > > > > > > > mapped locally.)
> >> > > > > > > > > >
> >> > > > > > > > > > I want to add this shared memory aspect to my
> >> virtio-proxy,
> >> > but
> >> > > > > > > > > > the resultant solution would eventually look similar to
> >> > ivshmem.
> >> > > > > > > > > >
> >> > > > > > > > > > > >  * BE guests boots with a hypervisor handle to
> >> memory
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >  The BE guest is then free to map the FE's memory to
> >> > where
> >> > > > > it
> >> > > > > > > wants in
> >> > > > > > > > > > > >  the BE's guest physical address space.
> >> > > > > > > > > > >
> >> > > > > > > > > > > I cannot see how this could work for Xen. There is no
> >> > "handle"
> >> > > > > to
> >> > > > > > > give
> >> > > > > > > > > > > to the backend if the backend is not running in dom0.
> >> So
> >> > for
> >> > > > > Xen I
> >> > > > > > > think
> >> > > > > > > > > > > the memory has to be already mapped
> >> > > > > > > > > >
> >> > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
> >> > information
> >> > > > > is
> >> > > > > > > expected
> >> > > > > > > > > > to be exposed to BE via Xenstore:
> >> > > > > > > > > > (I know that this is a tentative approach though.)
> >> > > > > > > > > >    - the start address of configuration space
> >> > > > > > > > > >    - interrupt number
> >> > > > > > > > > >    - file path for backing storage
> >> > > > > > > > > >    - read-only flag
> >> > > > > > > > > > And the BE server have to call a particular hypervisor
> >> > interface
> >> > > > > to
> >> > > > > > > > > > map the configuration space.
> >> > > > > > > > >
> >> > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
> >> > configuration
> >> > > > > info to
> >> > > > > > > the backend running in a non-toolstack domain.
> >> > > > > > > > > I remember, there was a wish to avoid using Xenstore in
> >> > Virtio
> >> > > > > backend
> >> > > > > > > itself if possible, so for non-toolstack domain, this could
> >> done
> >> > with
> >> > > > > > > adjusting devd (daemon that listens for devices and launches
> >> > backends)
> >> > > > > > > > > to read backend configuration from the Xenstore anyway and
> >> > pass it
> >> > > > > to
> >> > > > > > > the backend via command line arguments.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Yes, in current PoC code we're using xenstore to pass device
> >> > > > > > > configuration.
> >> > > > > > > > We also designed a static device configuration parse method
> >> > for
> >> > > > > Dom0less
> >> > > > > > > or
> >> > > > > > > > other scenarios don't have xentool. yes, it's from device
> >> > model
> >> > > > > command
> >> > > > > > > line
> >> > > > > > > > or a config file.
> >> > > > > > > >
> >> > > > > > > > > But, if ...
> >> > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
> >> > hypervisor)-
> >> > > > > > > specific
> >> > > > > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
> >> > hide
> >> > > > > all
> >> > > > > > > details.
> >> > > > > > > > >
> >> > > > > > > > > ... the solution how to overcome that is already found and
> >> > proven
> >> > > > > to
> >> > > > > > > work then even better.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > > # My point is that a "handle" is not mandatory for
> >> > executing
> >> > > > > mapping.
> >> > > > > > > > > >
> >> > > > > > > > > > > and the mapping probably done by the
> >> > > > > > > > > > > toolstack (also see below.) Or we would have to
> >> invent a
> >> > new
> >> > > > > Xen
> >> > > > > > > > > > > hypervisor interface and Xen virtual machine
> >> privileges
> >> > to
> >> > > > > allow
> >> > > > > > > this
> >> > > > > > > > > > > kind of mapping.
> >> > > > > > > > > >
> >> > > > > > > > > > > If we run the backend in Dom0 that we have no problems
> >> > of
> >> > > > > course.
> >> > > > > > > > > >
> >> > > > > > > > > > One of difficulties on Xen that I found in my approach
> >> is
> >> > that
> >> > > > > > > calling
> >> > > > > > > > > > such hypervisor intefaces (registering IOREQ, mapping
> >> > memory) is
> >> > > > > > > only
> >> > > > > > > > > > allowed on BE servers themselvies and so we will have to
> >> > extend
> >> > > > > > > those
> >> > > > > > > > > > interfaces.
> >> > > > > > > > > > This, however, will raise some concern on security and
> >> > privilege
> >> > > > > > > distribution
> >> > > > > > > > > > as Stefan suggested.
> >> > > > > > > > >
> >> > > > > > > > > We also faced policy related issues with Virtio backend
> >> > running in
> >> > > > > > > other than Dom0 domain in a "dummy" xsm mode. In our target
> >> > system we
> >> > > > > run
> >> > > > > > > the backend in a driver
> >> > > > > > > > > domain (we call it DomD) where the underlying H/W resides.
> >> > We
> >> > > > > trust it,
> >> > > > > > > so we wrote policy rules (to be used in "flask" xsm mode) to
> >> > provide
> >> > > > > it
> >> > > > > > > with a little bit more privileges than a simple DomU had.
> >> > > > > > > > > Now it is permitted to issue device-model, resource and
> >> > memory
> >> > > > > > > mappings, etc calls.
> >> > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > To activate the mapping will
> >> > > > > > > > > > > >  require some sort of hypercall to the hypervisor. I
> >> > can see
> >> > > > > two
> >> > > > > > > options
> >> > > > > > > > > > > >  at this point:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   - expose the handle to userspace for daemon/helper
> >> > to
> >> > > > > trigger
> >> > > > > > > the
> >> > > > > > > > > > > >     mapping via existing hypercall interfaces. If
> >> > using a
> >> > > > > helper
> >> > > > > > > you
> >> > > > > > > > > > > >     would have a hypervisor specific one to avoid
> >> the
> >> > daemon
> >> > > > > > > having to
> >> > > > > > > > > > > >     care too much about the details or push that
> >> > complexity
> >> > > > > into
> >> > > > > > > a
> >> > > > > > > > > > > >     compile time option for the daemon which would
> >> > result in
> >> > > > > > > different
> >> > > > > > > > > > > >     binaries although a common source base.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >   - expose a new kernel ABI to abstract the
> >> hypercall
> >> > > > > > > differences away
> >> > > > > > > > > > > >     in the guest kernel. In this case the userspace
> >> > would
> >> > > > > > > essentially
> >> > > > > > > > > > > >     ask for an abstract "map guest N memory to
> >> > userspace
> >> > > > > ptr"
> >> > > > > > > and let
> >> > > > > > > > > > > >     the kernel deal with the different hypercall
> >> > interfaces.
> >> > > > > > > This of
> >> > > > > > > > > > > >     course assumes the majority of BE guests would
> >> be
> >> > Linux
> >> > > > > > > kernels and
> >> > > > > > > > > > > >     leaves the bare-metal/unikernel approaches to
> >> > their own
> >> > > > > > > devices.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Operation
> >> > > > > > > > > > > > =========
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > The core of the operation of VirtIO is fairly
> >> simple.
> >> > Once
> >> > > > > the
> >> > > > > > > > > > > > vhost-user feature negotiation is done it's a case
> >> of
> >> > > > > receiving
> >> > > > > > > update
> >> > > > > > > > > > > > events and parsing the resultant virt queue for
> >> data.
> >> > The
> >> > > > > vhost-
> >> > > > > > > user
> >> > > > > > > > > > > > specification handles a bunch of setup before that
> >> > point,
> >> > > > > mostly
> >> > > > > > > to
> >> > > > > > > > > > > > detail where the virt queues are set up FD's for
> >> > memory and
> >> > > > > > > event
> >> > > > > > > > > > > > communication. This is where the envisioned stub
> >> > process
> >> > > > > would
> >> > > > > > > be
> >> > > > > > > > > > > > responsible for getting the daemon up and ready to
> >> run.
> >> > This
> >> > > > > is
> >> > > > > > > > > > > > currently done inside a big VMM like QEMU but I
> >> > suspect a
> >> > > > > modern
> >> > > > > > > > > > > > approach would be to use the rust-vmm vhost crate.
> >> It
> >> > would
> >> > > > > then
> >> > > > > > > either
> >> > > > > > > > > > > > communicate with the kernel's abstracted ABI or be
> >> re-
> >> > > > > targeted
> >> > > > > > > as a
> >> > > > > > > > > > > > build option for the various hypervisors.
> >> > > > > > > > > > >
> >> > > > > > > > > > > One thing I mentioned before to Alex is that Xen
> >> doesn't
> >> > have
> >> > > > > VMMs
> >> > > > > > > the
> >> > > > > > > > > > > way they are typically envisioned and described in
> >> other
> >> > > > > > > environments.
> >> > > > > > > > > > > Instead, Xen has IOREQ servers. Each of them connects
> >> > > > > > > independently to
> >> > > > > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
> >> > could
> >> > > > > be
> >> > > > > > > used as
> >> > > > > > > > > > > emulators for a single Xen VM, each of them connecting
> >> > to Xen
> >> > > > > > > > > > > independently via the IOREQ interface.
> >> > > > > > > > > > >
> >> > > > > > > > > > > The component responsible for starting a daemon and/or
> >> > setting
> >> > > > > up
> >> > > > > > > shared
> >> > > > > > > > > > > interfaces is the toolstack: the xl command and the
> >> > > > > libxl/libxc
> >> > > > > > > > > > > libraries.
> >> > > > > > > > > >
> >> > > > > > > > > > I think that VM configuration management (or
> >> orchestration
> >> > in
> >> > > > > > > Startos
> >> > > > > > > > > > jargon?) is a subject to debate in parallel.
> >> > > > > > > > > > Otherwise, is there any good assumption to avoid it
> >> right
> >> > now?
> >> > > > > > > > > >
> >> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
> >> > for the
> >> > > > > > > toolstack
> >> > > > > > > > > > > to create virtio backends and setup memory mappings.
> >> > They
> >> > > > > might be
> >> > > > > > > able
> >> > > > > > > > > > > to provide more info on the subject. I do think we
> >> miss
> >> > a way
> >> > > > > to
> >> > > > > > > provide
> >> > > > > > > > > > > the configuration to the backend and anything else
> >> that
> >> > the
> >> > > > > > > backend
> >> > > > > > > > > > > might require to start doing its job.
> >> > > > > > > > >
> >> > > > > > > > > Yes, some work has been done for the toolstack to handle
> >> > Virtio
> >> > > > > MMIO
> >> > > > > > > devices in
> >> > > > > > > > > general and Virtio block devices in particular. However,
> >> it
> >> > has
> >> > > > > not
> >> > > > > > > been upstreaned yet.
> >> > > > > > > > > Updated patches on review now:
> >> > > > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
> >> > send-
> >> > > > > email-
> >> > > > > > > olekstysh@gmail.com/
> >> > > > > > > > >
> >> > > > > > > > > There is an additional (also important) activity to
> >> > improve/fix
> >> > > > > > > foreign memory mapping on Arm which I am also involved in.
> >> > > > > > > > > The foreign memory mapping is proposed to be used for
> >> Virtio
> >> > > > > backends
> >> > > > > > > (device emulators) if there is a need to run guest OS
> >> completely
> >> > > > > > > unmodified.
> >> > > > > > > > > Of course, the more secure way would be to use grant
> >> memory
> >> > > > > mapping.
> >> > > > > > > Brietly, the main difference between them is that with foreign
> >> > mapping
> >> > > > > the
> >> > > > > > > backend
> >> > > > > > > > > can map any guest memory it wants to map, but with grant
> >> > mapping
> >> > > > > it is
> >> > > > > > > allowed to map only what was previously granted by the
> >> frontend.
> >> > > > > > > > >
> >> > > > > > > > > So, there might be a problem if we want to pre-map some
> >> > guest
> >> > > > > memory
> >> > > > > > > in advance or to cache mappings in the backend in order to
> >> > improve
> >> > > > > > > performance (because the mapping/unmapping guest pages every
> >> > request
> >> > > > > > > requires a lot of back and forth to Xen + P2M updates). In a
> >> > nutshell,
> >> > > > > > > currently, in order to map a guest page into the backend
> >> address
> >> > space
> >> > > > > we
> >> > > > > > > need to steal a real physical page from the backend domain.
> >> So,
> >> > with
> >> > > > > the
> >> > > > > > > said optimizations we might end up with no free memory in the
> >> > backend
> >> > > > > > > domain (see XSA-300). And what we try to achieve is to not
> >> waste
> >> > a
> >> > > > > real
> >> > > > > > > domain memory at all by providing safe non-allocated-yet (so
> >> > unused)
> >> > > > > > > address space for the foreign (and grant) pages to be mapped
> >> > into,
> >> > > > > this
> >> > > > > > > enabling work implies Xen and Linux (and likely DTB bindings)
> >> > changes.
> >> > > > > > > However, as it turned out, for this to work in a proper and
> >> safe
> >> > way
> >> > > > > some
> >> > > > > > > prereq work needs to be done.
> >> > > > > > > > > You can find the related Xen discussion at:
> >> > > > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
> >> > send-
> >> > > > > email-
> >> > > > > > > olekstysh@gmail.com/
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > One question is how to best handle notification and
> >> > kicks.
> >> > > > > The
> >> > > > > > > existing
> >> > > > > > > > > > > > vhost-user framework uses eventfd to signal the
> >> daemon
> >> > > > > (although
> >> > > > > > > QEMU
> >> > > > > > > > > > > > is quite capable of simulating them when you use
> >> TCG).
> >> > Xen
> >> > > > > has
> >> > > > > > > it's own
> >> > > > > > > > > > > > IOREQ mechanism. However latency is an important
> >> > factor and
> >> > > > > > > having
> >> > > > > > > > > > > > events go through the stub would add quite a lot.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Yeah I think, regardless of anything else, we want the
> >> > > > > backends to
> >> > > > > > > > > > > connect directly to the Xen hypervisor.
> >> > > > > > > > > >
> >> > > > > > > > > > In my approach,
> >> > > > > > > > > >  a) BE -> FE: interrupts triggered by BE calling a
> >> > hypervisor
> >> > > > > > > interface
> >> > > > > > > > > >               via virtio-proxy
> >> > > > > > > > > >  b) FE -> BE: MMIO to config raises events (in event
> >> > channels),
> >> > > > > > > which is
> >> > > > > > > > > >               converted to a callback to BE via virtio-
> >> > proxy
> >> > > > > > > > > >               (Xen's event channel is internnally
> >> > implemented by
> >> > > > > > > interrupts.)
> >> > > > > > > > > >
> >> > > > > > > > > > I don't know what "connect directly" means here, but
> >> > sending
> >> > > > > > > interrupts
> >> > > > > > > > > > to the opposite side would be best efficient.
> >> > > > > > > > > > Ivshmem, I suppose, takes this approach by utilizing
> >> PCI's
> >> > msi-x
> >> > > > > > > mechanism.
> >> > > > > > > > >
> >> > > > > > > > > Agree that MSI would be more efficient than SPI...
> >> > > > > > > > > At the moment, in order to notify the frontend, the
> >> backend
> >> > issues
> >> > > > > a
> >> > > > > > > specific device-model call to query Xen to inject a
> >> > corresponding SPI
> >> > > > > to
> >> > > > > > > the guest.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Could we consider the kernel internally converting
> >> > IOREQ
> >> > > > > > > messages from
> >> > > > > > > > > > > > the Xen hypervisor to eventfd events? Would this
> >> scale
> >> > with
> >> > > > > > > other kernel
> >> > > > > > > > > > > > hypercall interfaces?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > So any thoughts on what directions are worth
> >> > experimenting
> >> > > > > with?
> >> > > > > > > > > > >
> >> > > > > > > > > > > One option we should consider is for each backend to
> >> > connect
> >> > > > > to
> >> > > > > > > Xen via
> >> > > > > > > > > > > the IOREQ interface. We could generalize the IOREQ
> >> > interface
> >> > > > > and
> >> > > > > > > make it
> >> > > > > > > > > > > hypervisor agnostic. The interface is really trivial
> >> and
> >> > easy
> >> > > > > to
> >> > > > > > > add.
> >> > > > > > > > > >
> >> > > > > > > > > > As I said above, my proposal does the same thing that
> >> you
> >> > > > > mentioned
> >> > > > > > > here :)
> >> > > > > > > > > > The difference is that I do call hypervisor interfaces
> >> via
> >> > > > > virtio-
> >> > > > > > > proxy.
> >> > > > > > > > > >
> >> > > > > > > > > > > The only Xen-specific part is the notification
> >> mechanism,
> >> > > > > which is
> >> > > > > > > an
> >> > > > > > > > > > > event channel. If we replaced the event channel with
> >> > something
> >> > > > > > > else the
> >> > > > > > > > > > > interface would be generic. See:
> >> > > > > > > > > > > https://gitlab.com/xen-project/xen/-
> >> > > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52
> >> > > > > > > > > > >
> >> > > > > > > > > > > I don't think that translating IOREQs to eventfd in
> >> the
> >> > kernel
> >> > > > > is
> >> > > > > > > a
> >> > > > > > > > > > > good idea: if feels like it would be extra complexity
> >> > and that
> >> > > > > the
> >> > > > > > > > > > > kernel shouldn't be involved as this is a backend-
> >> > hypervisor
> >> > > > > > > interface.
> >> > > > > > > > > >
> >> > > > > > > > > > Given that we may want to implement BE as a bare-metal
> >> > > > > application
> >> > > > > > > > > > as I did on Zephyr, I don't think that the translation
> >> > would not
> >> > > > > be
> >> > > > > > > > > > a big issue, especially on RTOS's.
> >> > > > > > > > > > It will be some kind of abstraction layer of interrupt
> >> > handling
> >> > > > > > > > > > (or nothing but a callback mechanism).
> >> > > > > > > > > >
> >> > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
> >> to
> >> > > > > design an
> >> > > > > > > > > > > interface that could work well for RTOSes too. If we
> >> > want to
> >> > > > > do
> >> > > > > > > > > > > something different, both OS-agnostic and hypervisor-
> >> > agnostic,
> >> > > > > > > perhaps
> >> > > > > > > > > > > we could design a new interface. One that could be
> >> > > > > implementable
> >> > > > > > > in the
> >> > > > > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
> >> > other
> >> > > > > > > hypervisor
> >> > > > > > > > > > > too.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > There is also another problem. IOREQ is probably not
> >> be
> >> > the
> >> > > > > only
> >> > > > > > > > > > > interface needed. Have a look at
> >> > > > > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
> >> > Don't we
> >> > > > > > > also need
> >> > > > > > > > > > > an interface for the backend to inject interrupts into
> >> > the
> >> > > > > > > frontend? And
> >> > > > > > > > > > > if the backend requires dynamic memory mappings of
> >> > frontend
> >> > > > > pages,
> >> > > > > > > then
> >> > > > > > > > > > > we would also need an interface to map/unmap domU
> >> pages.
> >> > > > > > > > > >
> >> > > > > > > > > > My proposal document might help here; All the interfaces
> >> > > > > required
> >> > > > > > > for
> >> > > > > > > > > > virtio-proxy (or hypervisor-related interfaces) are
> >> listed
> >> > as
> >> > > > > > > > > > RPC protocols :)
> >> > > > > > > > > >
> >> > > > > > > > > > > These interfaces are a lot more problematic than
> >> IOREQ:
> >> > IOREQ
> >> > > > > is
> >> > > > > > > tiny
> >> > > > > > > > > > > and self-contained. It is easy to add anywhere. A new
> >> > > > > interface to
> >> > > > > > > > > > > inject interrupts or map pages is more difficult to
> >> > manage
> >> > > > > because
> >> > > > > > > it
> >> > > > > > > > > > > would require changes scattered across the various
> >> > emulators.
> >> > > > > > > > > >
> >> > > > > > > > > > Exactly. I have no confident yet that my approach will
> >> > also
> >> > > > > apply
> >> > > > > > > > > > to other hypervisors than Xen.
> >> > > > > > > > > > Technically, yes, but whether people can accept it or
> >> not
> >> > is a
> >> > > > > > > different
> >> > > > > > > > > > matter.
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks,
> >> > > > > > > > > > -Takahiro Akashi
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Regards,
> >> > > > > > > > >
> >> > > > > > > > > Oleksandr Tyshchenko
> >> > > > > > > > IMPORTANT NOTICE: The contents of this email and any
> >> > attachments are
> >> > > > > > > confidential and may also be privileged. If you are not the
> >> > intended
> >> > > > > > > recipient, please notify the sender immediately and do not
> >> > disclose
> >> > > > > the
> >> > > > > > > contents to any other person, use it for any purpose, or store
> >> > or copy
> >> > > > > the
> >> > > > > > > information in any medium. Thank you.
> >> > > > > > IMPORTANT NOTICE: The contents of this email and any attachments
> >> > are
> >> > > > > confidential and may also be privileged. If you are not the
> >> intended
> >> > > > > recipient, please notify the sender immediately and do not
> >> disclose
> >> > the
> >> > > > > contents to any other person, use it for any purpose, or store or
> >> > copy the
> >> > > > > information in any medium. Thank you.
> >> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> >> > confidential and may also be privileged. If you are not the intended
> >> > recipient, please notify the sender immediately and do not disclose the
> >> > contents to any other person, use it for any purpose, or store or copy
> >> the
> >> > information in any medium. Thank you.
> >> IMPORTANT NOTICE: The contents of this email and any attachments are
> >> confidential and may also be privileged. If you are not the intended
> >> recipient, please notify the sender immediately and do not disclose the
> >> contents to any other person, use it for any purpose, or store or copy the
> >> information in any medium. Thank you.
> >>
> >


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-01 12:53       ` [virtio-dev] " Alex Bennée
@ 2021-09-02  9:12         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-09-02  9:12 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stefano Stabellini, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova

[-- Attachment #1: Type: text/plain, Size: 3717 bytes --]

On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> >> > Could we consider the kernel internally converting IOREQ messages from
> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> >> > hypercall interfaces?
> >> > 
> >> > So any thoughts on what directions are worth experimenting with?
> >>  
> >> One option we should consider is for each backend to connect to Xen via
> >> the IOREQ interface. We could generalize the IOREQ interface and make it
> >> hypervisor agnostic. The interface is really trivial and easy to add.
> >> The only Xen-specific part is the notification mechanism, which is an
> >> event channel. If we replaced the event channel with something else the
> >> interface would be generic. See:
> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> 
> Reading the cover letter was very useful in showing how this provides a
> separate channel for signalling IO events to userspace instead of using
> the normal type-2 vmexit type event. I wonder how deeply tied the
> userspace facing side of this is to KVM? Could it provide a common FD
> type interface to IOREQ?

I wondered this too after reading Stefano's link to Xen's ioreq. They
seem to be quite similar. ioregionfd is closer to have PIO/MMIO vmexits
are handled in KVM while I guess ioreq is closer to how Xen handles
them, but those are small details.

It may be possible to use the ioreq struct instead of ioregionfd in KVM,
but I haven't checked each field.

> As I understand IOREQ this is currently a direct communication between
> userspace and the hypervisor using the existing Xen message bus. My
> worry would be that by adding knowledge of what the underlying
> hypervisor is we'd end up with excess complexity in the kernel. For one
> thing we certainly wouldn't want an API version dependency on the kernel
> to understand which version of the Xen hypervisor it was running on.
> 
> >> There is also another problem. IOREQ is probably not be the only
> >> interface needed. Have a look at
> >> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> >> an interface for the backend to inject interrupts into the frontend? And
> >> if the backend requires dynamic memory mappings of frontend pages, then
> >> we would also need an interface to map/unmap domU pages.
> >> 
> >> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> >> and self-contained. It is easy to add anywhere. A new interface to
> >> inject interrupts or map pages is more difficult to manage because it
> >> would require changes scattered across the various emulators.
> >
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to
> > support.
> 
> It's true our focus is just VirtIO which does support alternative
> transport options however most implementations seem to be targeting
> virtio-mmio for it's relative simplicity and understood semantics
> (modulo a desire for MSI to reduce round trip latency handling
> signalling).

Okay.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-02  9:12         ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-09-02  9:12 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stefano Stabellini, Stratos Mailing List, virtio-dev,
	Arnd Bergmann, Viresh Kumar, AKASHI Takahiro, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova

[-- Attachment #1: Type: text/plain, Size: 3717 bytes --]

On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> >> > Could we consider the kernel internally converting IOREQ messages from
> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> >> > hypercall interfaces?
> >> > 
> >> > So any thoughts on what directions are worth experimenting with?
> >>  
> >> One option we should consider is for each backend to connect to Xen via
> >> the IOREQ interface. We could generalize the IOREQ interface and make it
> >> hypervisor agnostic. The interface is really trivial and easy to add.
> >> The only Xen-specific part is the notification mechanism, which is an
> >> event channel. If we replaced the event channel with something else the
> >> interface would be generic. See:
> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> 
> Reading the cover letter was very useful in showing how this provides a
> separate channel for signalling IO events to userspace instead of using
> the normal type-2 vmexit type event. I wonder how deeply tied the
> userspace facing side of this is to KVM? Could it provide a common FD
> type interface to IOREQ?

I wondered this too after reading Stefano's link to Xen's ioreq. They
seem to be quite similar. ioregionfd is closer to have PIO/MMIO vmexits
are handled in KVM while I guess ioreq is closer to how Xen handles
them, but those are small details.

It may be possible to use the ioreq struct instead of ioregionfd in KVM,
but I haven't checked each field.

> As I understand IOREQ this is currently a direct communication between
> userspace and the hypervisor using the existing Xen message bus. My
> worry would be that by adding knowledge of what the underlying
> hypervisor is we'd end up with excess complexity in the kernel. For one
> thing we certainly wouldn't want an API version dependency on the kernel
> to understand which version of the Xen hypervisor it was running on.
> 
> >> There is also another problem. IOREQ is probably not be the only
> >> interface needed. Have a look at
> >> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> >> an interface for the backend to inject interrupts into the frontend? And
> >> if the backend requires dynamic memory mappings of frontend pages, then
> >> we would also need an interface to map/unmap domU pages.
> >> 
> >> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> >> and self-contained. It is easy to add anywhere. A new interface to
> >> inject interrupts or map pages is more difficult to manage because it
> >> would require changes scattered across the various emulators.
> >
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to
> > support.
> 
> It's true our focus is just VirtIO which does support alternative
> transport options however most implementations seem to be targeting
> virtio-mmio for it's relative simplicity and understood semantics
> (modulo a desire for MSI to reduce round trip latency handling
> signalling).

Okay.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-01 12:53       ` [virtio-dev] " Alex Bennée
  (?)
  (?)
@ 2021-09-03  8:06       ` AKASHI Takahiro
  2021-09-03  9:28           ` [virtio-dev] " Alex Bennée
  -1 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-03  8:06 UTC (permalink / raw)
  To: Alex Benn??e
  Cc: Stefan Hajnoczi, Stefano Stabellini, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova

Alex,

On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> >> > Could we consider the kernel internally converting IOREQ messages from
> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> >> > hypercall interfaces?
> >> > 
> >> > So any thoughts on what directions are worth experimenting with?
> >>  
> >> One option we should consider is for each backend to connect to Xen via
> >> the IOREQ interface. We could generalize the IOREQ interface and make it
> >> hypervisor agnostic. The interface is really trivial and easy to add.
> >> The only Xen-specific part is the notification mechanism, which is an
> >> event channel. If we replaced the event channel with something else the
> >> interface would be generic. See:
> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> 
> Reading the cover letter was very useful in showing how this provides a
> separate channel for signalling IO events to userspace instead of using
> the normal type-2 vmexit type event. I wonder how deeply tied the
> userspace facing side of this is to KVM? Could it provide a common FD
> type interface to IOREQ?

Why do you stick to a "FD" type interface?

> As I understand IOREQ this is currently a direct communication between
> userspace and the hypervisor using the existing Xen message bus. My

With IOREQ server, IO event occurrences are notified to BE via Xen's event
channel, while the actual contexts of IO events (see struct ioreq in ioreq.h)
are put in a queue on a single shared memory page which is to be assigned
beforehand with xenforeignmemory_map_resource hypervisor call.

> worry would be that by adding knowledge of what the underlying
> hypervisor is we'd end up with excess complexity in the kernel. For one
> thing we certainly wouldn't want an API version dependency on the kernel
> to understand which version of the Xen hypervisor it was running on.

That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor-
specific details of IO event handlings are contained in virtio-proxy
and virtio BE will communicate with virtio-proxy through a virtqueue
(yes, virtio-proxy is seen as yet another virtio device on BE) and will
get IO event-related *RPC* callbacks, either MMIO read or write, from
virtio-proxy.

See page 8 (protocol flow) and 10 (interfaces) in [1].

If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm
will hopefully be implemented using ioregionfd.

-Takahiro Akashi

[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html

> >> There is also another problem. IOREQ is probably not be the only
> >> interface needed. Have a look at
> >> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> >> an interface for the backend to inject interrupts into the frontend? And
> >> if the backend requires dynamic memory mappings of frontend pages, then
> >> we would also need an interface to map/unmap domU pages.
> >> 
> >> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> >> and self-contained. It is easy to add anywhere. A new interface to
> >> inject interrupts or map pages is more difficult to manage because it
> >> would require changes scattered across the various emulators.
> >
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to
> > support.
> 
> It's true our focus is just VirtIO which does support alternative
> transport options however most implementations seem to be targeting
> virtio-mmio for it's relative simplicity and understood semantics
> (modulo a desire for MSI to reduce round trip latency handling
> signalling).
> 
> >
> > Stefan
> >
> > [[End of PGP Signed Part]]
> 
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-03  8:06       ` AKASHI Takahiro
@ 2021-09-03  9:28           ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-03  9:28 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefan Hajnoczi, Stefano Stabellini, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


AKASHI Takahiro <takahiro.akashi@linaro.org> writes:

> Alex,
>
> On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
>> 
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> 
>> > [[PGP Signed Part:Undecided]]
>> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
>> >> > Could we consider the kernel internally converting IOREQ messages from
>> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
>> >> > hypercall interfaces?
>> >> > 
>> >> > So any thoughts on what directions are worth experimenting with?
>> >>  
>> >> One option we should consider is for each backend to connect to Xen via
>> >> the IOREQ interface. We could generalize the IOREQ interface and make it
>> >> hypervisor agnostic. The interface is really trivial and easy to add.
>> >> The only Xen-specific part is the notification mechanism, which is an
>> >> event channel. If we replaced the event channel with something else the
>> >> interface would be generic. See:
>> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
>> >
>> > There have been experiments with something kind of similar in KVM
>> > recently (see struct ioregionfd_cmd):
>> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
>> 
>> Reading the cover letter was very useful in showing how this provides a
>> separate channel for signalling IO events to userspace instead of using
>> the normal type-2 vmexit type event. I wonder how deeply tied the
>> userspace facing side of this is to KVM? Could it provide a common FD
>> type interface to IOREQ?
>
> Why do you stick to a "FD" type interface?

I mean most user space interfaces on POSIX start with a file descriptor
and the usual read/write semantics or a series of ioctls.

>> As I understand IOREQ this is currently a direct communication between
>> userspace and the hypervisor using the existing Xen message bus. My
>
> With IOREQ server, IO event occurrences are notified to BE via Xen's event
> channel, while the actual contexts of IO events (see struct ioreq in ioreq.h)
> are put in a queue on a single shared memory page which is to be assigned
> beforehand with xenforeignmemory_map_resource hypervisor call.

If we abstracted the IOREQ via the kernel interface you would probably
just want to put the ioreq structure on a queue rather than expose the
shared page to userspace. 

>> worry would be that by adding knowledge of what the underlying
>> hypervisor is we'd end up with excess complexity in the kernel. For one
>> thing we certainly wouldn't want an API version dependency on the kernel
>> to understand which version of the Xen hypervisor it was running on.
>
> That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor-
> specific details of IO event handlings are contained in virtio-proxy
> and virtio BE will communicate with virtio-proxy through a virtqueue
> (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> get IO event-related *RPC* callbacks, either MMIO read or write, from
> virtio-proxy.
>
> See page 8 (protocol flow) and 10 (interfaces) in [1].

There are two areas of concern with the proxy approach at the moment.
The first is how the bootstrap of the virtio-proxy channel happens and
the second is how many context switches are involved in a transaction.
Of course with all things there is a trade off. Things involving the
very tightest latency would probably opt for a bare metal backend which
I think would imply hypervisor knowledge in the backend binary.

>
> If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm
> will hopefully be implemented using ioregionfd.
>
> -Takahiro Akashi
>
> [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-03  9:28           ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-03  9:28 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Stefan Hajnoczi, Stefano Stabellini, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


AKASHI Takahiro <takahiro.akashi@linaro.org> writes:

> Alex,
>
> On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
>> 
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> 
>> > [[PGP Signed Part:Undecided]]
>> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
>> >> > Could we consider the kernel internally converting IOREQ messages from
>> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
>> >> > hypercall interfaces?
>> >> > 
>> >> > So any thoughts on what directions are worth experimenting with?
>> >>  
>> >> One option we should consider is for each backend to connect to Xen via
>> >> the IOREQ interface. We could generalize the IOREQ interface and make it
>> >> hypervisor agnostic. The interface is really trivial and easy to add.
>> >> The only Xen-specific part is the notification mechanism, which is an
>> >> event channel. If we replaced the event channel with something else the
>> >> interface would be generic. See:
>> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
>> >
>> > There have been experiments with something kind of similar in KVM
>> > recently (see struct ioregionfd_cmd):
>> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
>> 
>> Reading the cover letter was very useful in showing how this provides a
>> separate channel for signalling IO events to userspace instead of using
>> the normal type-2 vmexit type event. I wonder how deeply tied the
>> userspace facing side of this is to KVM? Could it provide a common FD
>> type interface to IOREQ?
>
> Why do you stick to a "FD" type interface?

I mean most user space interfaces on POSIX start with a file descriptor
and the usual read/write semantics or a series of ioctls.

>> As I understand IOREQ this is currently a direct communication between
>> userspace and the hypervisor using the existing Xen message bus. My
>
> With IOREQ server, IO event occurrences are notified to BE via Xen's event
> channel, while the actual contexts of IO events (see struct ioreq in ioreq.h)
> are put in a queue on a single shared memory page which is to be assigned
> beforehand with xenforeignmemory_map_resource hypervisor call.

If we abstracted the IOREQ via the kernel interface you would probably
just want to put the ioreq structure on a queue rather than expose the
shared page to userspace. 

>> worry would be that by adding knowledge of what the underlying
>> hypervisor is we'd end up with excess complexity in the kernel. For one
>> thing we certainly wouldn't want an API version dependency on the kernel
>> to understand which version of the Xen hypervisor it was running on.
>
> That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor-
> specific details of IO event handlings are contained in virtio-proxy
> and virtio BE will communicate with virtio-proxy through a virtqueue
> (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> get IO event-related *RPC* callbacks, either MMIO read or write, from
> virtio-proxy.
>
> See page 8 (protocol flow) and 10 (interfaces) in [1].

There are two areas of concern with the proxy approach at the moment.
The first is how the bootstrap of the virtio-proxy channel happens and
the second is how many context switches are involved in a transaction.
Of course with all things there is a trade off. Things involving the
very tightest latency would probably opt for a bare metal backend which
I think would imply hypervisor knowledge in the backend binary.

>
> If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm
> will hopefully be implemented using ioregionfd.
>
> -Takahiro Akashi
>
> [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-03  9:28           ` [virtio-dev] " Alex Bennée
  (?)
@ 2021-09-06  2:23           ` AKASHI Takahiro
  2021-09-07  2:41               ` [virtio-dev] " Christopher Clark
  2021-09-13 23:51             ` Stefano Stabellini
  -1 siblings, 2 replies; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-06  2:23 UTC (permalink / raw)
  To: Alex Benn??e
  Cc: Stefan Hajnoczi, Stefano Stabellini, Stratos Mailing List,
	virtio-dev, Arnd Bergmann, Viresh Kumar, Stefano Stabellini,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova

Alex,

On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
> 
> AKASHI Takahiro <takahiro.akashi@linaro.org> writes:
> 
> > Alex,
> >
> > On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
> >> 
> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >> 
> >> > [[PGP Signed Part:Undecided]]
> >> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> >> >> > Could we consider the kernel internally converting IOREQ messages from
> >> >> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> >> >> > hypercall interfaces?
> >> >> > 
> >> >> > So any thoughts on what directions are worth experimenting with?
> >> >>  
> >> >> One option we should consider is for each backend to connect to Xen via
> >> >> the IOREQ interface. We could generalize the IOREQ interface and make it
> >> >> hypervisor agnostic. The interface is really trivial and easy to add.
> >> >> The only Xen-specific part is the notification mechanism, which is an
> >> >> event channel. If we replaced the event channel with something else the
> >> >> interface would be generic. See:
> >> >> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> >> >
> >> > There have been experiments with something kind of similar in KVM
> >> > recently (see struct ioregionfd_cmd):
> >> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> >> 
> >> Reading the cover letter was very useful in showing how this provides a
> >> separate channel for signalling IO events to userspace instead of using
> >> the normal type-2 vmexit type event. I wonder how deeply tied the
> >> userspace facing side of this is to KVM? Could it provide a common FD
> >> type interface to IOREQ?
> >
> > Why do you stick to a "FD" type interface?
> 
> I mean most user space interfaces on POSIX start with a file descriptor
> and the usual read/write semantics or a series of ioctls.

Who do you assume is responsible for implementing this kind of
fd semantics, OSs on BE or hypervisor itself?

I think such interfaces can only be easily implemented on type-2 hypervisors.

# In this sense, I don't think rust-vmm, as it is, cannot be
# a general solution.

> >> As I understand IOREQ this is currently a direct communication between
> >> userspace and the hypervisor using the existing Xen message bus. My
> >
> > With IOREQ server, IO event occurrences are notified to BE via Xen's event
> > channel, while the actual contexts of IO events (see struct ioreq in ioreq.h)
> > are put in a queue on a single shared memory page which is to be assigned
> > beforehand with xenforeignmemory_map_resource hypervisor call.
> 
> If we abstracted the IOREQ via the kernel interface you would probably
> just want to put the ioreq structure on a queue rather than expose the
> shared page to userspace. 

Where is that queue?

> >> worry would be that by adding knowledge of what the underlying
> >> hypervisor is we'd end up with excess complexity in the kernel. For one
> >> thing we certainly wouldn't want an API version dependency on the kernel
> >> to understand which version of the Xen hypervisor it was running on.
> >
> > That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor-
> > specific details of IO event handlings are contained in virtio-proxy
> > and virtio BE will communicate with virtio-proxy through a virtqueue
> > (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> > get IO event-related *RPC* callbacks, either MMIO read or write, from
> > virtio-proxy.
> >
> > See page 8 (protocol flow) and 10 (interfaces) in [1].
> 
> There are two areas of concern with the proxy approach at the moment.
> The first is how the bootstrap of the virtio-proxy channel happens and

As I said, from BE point of view, virtio-proxy would be seen
as yet another virtio device by which BE could talk to "virtio
proxy" vm or whatever else.

This way we guarantee BE's hypervisor-agnosticism instead of having
"common" hypervisor interfaces. That is the base of my idea.

> the second is how many context switches are involved in a transaction.
> Of course with all things there is a trade off. Things involving the
> very tightest latency would probably opt for a bare metal backend which
> I think would imply hypervisor knowledge in the backend binary.

In configuration phase of virtio device, the latency won't be a big matter.
In device operations (i.e. read/write to block devices), if we can
resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
how efficiently we can deliver notification to the opposite side. Right?
And this is a very common problem whatever approach we would take.

Anyhow, if we do care the latency in my approach, most of virtio-proxy-
related code can be re-implemented just as a stub (or shim?) library
since the protocols are defined as RPCs.
In this case, however, we would lose the benefit of providing "single binary"
BE.
(I know this is is an arguable requirement, though.)

# Would we better discuss what "hypervisor-agnosticism" means?

-Takahiro Akashi

> >
> > If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm
> > will hopefully be implemented using ioregionfd.
> >
> > -Takahiro Akashi
> >
> > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-02  7:19                           ` AKASHI Takahiro
@ 2021-09-07  0:57                               ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07  0:57 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 12671 bytes --]

On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
wrote:

> Hi Christopher,
>
> Thank you for your feedback.
>
> On Mon, Aug 30, 2021 at 12:53:00PM -0700, Christopher Clark wrote:
> > [ resending message to ensure delivery to the CCd mailing lists
> > post-subscription ]
> >
> > Apologies for being late to this thread, but I hope to be able to
> > contribute to
> > this discussion in a meaningful way. I am grateful for the level of
> > interest in
> > this topic. I would like to draw your attention to Argo as a suitable
> > technology for development of VirtIO's hypervisor-agnostic interfaces.
> >
> > * Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
> > that
> >   can send and receive hypervisor-mediated notifications and messages
> > between
> >   domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
> > over
> >   all communication between domains. It is derived from the earlier v4v,
> > which
> >   has been deployed on millions of machines with the HP/Bromium uXen
> > hypervisor
> >   and with OpenXT.
> >
> > * Argo has a simple interface with a small number of operations that was
> >   designed for ease of integration into OS primitives on both Linux
> > (sockets)
> >   and Windows (ReadFile/WriteFile) [2].
> >     - A unikernel example of using it has also been developed for XTF.
> [3]
> >
> > * There has been recent discussion and support in the Xen community for
> > making
> >   revisions to the Argo interface to make it hypervisor-agnostic, and
> > support
> >   implementations of Argo on other hypervisors. This will enable a single
> >   interface for an OS kernel binary to use for inter-VM communication
> that
> > will
> >   work on multiple hypervisors -- this applies equally to both backends
> and
> >   frontend implementations. [4]
>
> Regarding virtio-over-Argo, let me ask a few questions:
> (In figure "Virtual device buffer access:Virtio+Argo" in [4])
>

(for ref, this diagram is from this document:
 https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698 )

Takahiro, thanks for reading the Virtio-Argo materials.

Some relevant context before answering your questions below: the Argo
request
interface from the hypervisor to a guest, which is currently exposed only
via a
dedicated hypercall op, has been discussed within the Xen community and is
open
to being changed in order to better enable support for guest VM access to
Argo
functions in a hypervisor-agnostic way.

The proposal is to allow hypervisors the option to implement and expose any
of
multiple access mechanisms for Argo, and then enable a guest device driver
to
probe the hypervisor for methods that it is aware of and able to use. The
hypercall op is likely to be retained (in some form), and complemented at
least
on x86 with another interface via MSRs presented to the guests.



> 1) How the configuration is managed?
>    On either virtio-mmio or virtio-pci, there always takes place
>    some negotiation between the FE and BE through the "configuration"
>    space. How can this be done in virtio-over-Argo?
>

Just to be clear about my understanding: your question, in the context of a
Linux kernel virtio device driver implementation, is about how a virtio-argo
transport driver would implement the get_features function of the
virtio_config_ops, as a parallel to the work that vp_get_features does for
virtio-pci, and vm_get_features does for virtio-mmio.

The design is still open on this and options have been discussed, including:

* an extension to Argo to allow the system toolstack (which is responsible
for
  managing guest VMs and enabling connections from front-to-backends)
  to manage a table of "implicit destinations", so a guest can transmit Argo
  messages to eg. "my storage service" port and the hypervisor will deliver
it
  based on a destination table pre-programmed by the toolstack for the VM.
  [1]
     - ref: Notes from the December 2019 Xen F2F meeting in Cambridge, UK:
       [1] https://lists.archive.carbon60.com/xen/devel/577800#577800

  So within that feature negotiation function, communication with the
backend
  via that Argo channel will occur.

* IOREQ
The Xen IOREQ implementation is not currently appropriate for virtio-argo
since
it requires the use of foreign memory mappings of frontend memory in the
backend
guest. However, a new HMX interface from the hypervisor could support a new
DMA
Device Model Op to allow the backend to request the hypervisor to retrieve
specified
bytes from the frontend guest, which would enable plumbing for device
configuration
between an IOREQ server (device model backend implementation) and the guest
driver.
[2]

Feature negotiation in the front end in this case would look very similar to
the virtio-mmio implementation.

ref: Argo HMX Transport for VirtIO meeting minutes, from January 2021:
[2]
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html

* guest ACPI tables that surface the address of a remote Argo endpoint
  on behalf of the toolstack, and Argo communication can then negotiate
features

* emulation of a basic PCI device by the hypervisor (though details not
determined)



> 2) Do there physically exist virtio's available/used vrings as well as
>    descriptors, or are they virtually emulated over Argo (rings)?
>

In short: the latter.

In the analysis that I did when looking at this, my observation was that
each
side (front and backend) should be able to accurately maintain their own
local
copy of the available/used vrings as well as descriptors, and both be kept
synchronized by ensuring that updates are transmitted to the other side when
they are written to. eg. As part of this, in the Linux front end
implementation
the virtqueue_notify function uses a function pointer in the virtqueue that
is
populated by the transport driver, ie. the virtio-argo driver in this case,
which can implement the necessary logic to coordinate with the backend.


> 3) The payload in a request will be copied into the receiver's Argo ring.
>    What does the address in a descriptor mean?
>    Address/offset in a ring buffer?
>

Effectively yes. I would treat it as a handle that is used to identify and
retrieve data from messages exchanged between frontend transport driver and
the backend via Argo rings established for moving data for the data path.
In the diagram, those are "Argo ring for reads" and "Argo ring for writes".


> 4) Estimate of performance or latency?
>

Different access methods to Argo (ie. related to my answer to your question
'1)'
above --) will have different performance characteristics.

Data copying will necessarily involved for any Hypervisor-Mediated data
eXchange
(HMX) mechanism[1], such as Argo, where there is no shared memory between
guest
VMs, but the performance profile on modern CPUs with sizable caches has been
demonstrated to be acceptable for the guest virtual device drivers use case
in
the HP/Bromium vSentry uXen product. The VirtIO structure is somewhat
different
though.

Further performance profiling and measurement will be valuable for enabling
tuning of the implementation and development of additional interfaces (eg.
such
as an asynchronous send primitive) - some of this has been discussed and
described on the VirtIO-Argo-Development-Phase-1 wiki page[2].

[1]
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen

[2]
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development%3A+Phase+1


>    It appears that, on FE side, at least three hypervisor calls (and data
>    copying) need to be invoked at every request, right?
>

For a write, counting FE sendv ops:
1: the write data payload is sent via the "Argo ring for writes"
2: the descriptor is sent via a sync of the available/descriptor ring
  -- is there a third one that I am missing?

Christopher


>
> Thanks,
> -Takahiro Akashi
>
>
> > * Here are the design documents for building VirtIO-over-Argo, to
> support a
> >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> >
> > The Development Plan to build VirtIO virtual device support over Argo
> > transport:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> >
> > A design for using VirtIO over Argo, describing how VirtIO data
> structures
> > and communication is handled over the Argo transport:
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> >
> > Diagram (from the above document) showing how VirtIO rings are
> synchronized
> > between domains without using shared memory:
> >
> https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> >
> > Please note that the above design documents show that the existing VirtIO
> > device drivers, and both vring and virtqueue data structures can be
> > preserved
> > while interdomain communication can be performed with no shared memory
> > required
> > for most drivers; (the exceptions where further design is required are
> those
> > such as virtual framebuffer devices where shared memory regions are
> > intentionally
> > added to the communication structure beyond the vrings and virtqueues).
> >
> > An analysis of VirtIO and Argo, informing the design:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> >
> > * Argo can be used for a communication path for configuration between the
> > backend
> >   and the toolstack, avoiding the need for a dependency on XenStore,
> which
> > is an
> >   advantage for any hypervisor-agnostic design. It is also amenable to a
> > notification
> >   mechanism that is not based on Xen event channels.
> >
> > * Argo does not use or require shared memory between VMs and provides an
> > alternative
> >   to the use of foreign shared memory mappings. It avoids some of the
> > complexities
> >   involved with using grants (eg. XSA-300).
> >
> > * Argo supports Mandatory Access Control by the hypervisor, satisfying a
> > common
> >   certification requirement.
> >
> > * The Argo headers are BSD-licensed and the Xen hypervisor implementation
> > is GPLv2 but
> >   accessible via the hypercall interface. The licensing should not
> present
> > an obstacle
> >   to adoption of Argo in guest software or implementation by other
> > hypervisors.
> >
> > * Since the interface that Argo presents to a guest VM is similar to
> DMA, a
> > VirtIO-Argo
> >   frontend transport driver should be able to operate with a physical
> > VirtIO-enabled
> >   smart-NIC if the toolstack and an Argo-aware backend provide support.
> >
> > The next Xen Community Call is next week and I would be happy to answer
> > questions
> > about Argo and on this topic. I will also be following this thread.
> >
> > Christopher
> > (Argo maintainer, Xen Community)
> >
> >
> --------------------------------------------------------------------------------
> > [1]
> > An introduction to Argo:
> >
> https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > Xen Wiki page for Argo:
> >
> https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> >
> > [2]
> > OpenXT Linux Argo driver and userspace library:
> > https://github.com/openxt/linux-xen-argo
> >
> > Windows V4V at OpenXT wiki:
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > Windows v4v driver source:
> > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> >
> > HP/Bromium uXen V4V driver:
> > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> >
> > [3]
> > v2 of the Argo test unikernel for XTF:
> >
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> >
> > [4]
> > Argo HMX Transport for VirtIO meeting minutes:
> >
> https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> >
> > VirtIO-Argo Development wiki page:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> >
>
>

[-- Attachment #2: Type: text/html, Size: 17467 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-07  0:57                               ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07  0:57 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 12671 bytes --]

On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
wrote:

> Hi Christopher,
>
> Thank you for your feedback.
>
> On Mon, Aug 30, 2021 at 12:53:00PM -0700, Christopher Clark wrote:
> > [ resending message to ensure delivery to the CCd mailing lists
> > post-subscription ]
> >
> > Apologies for being late to this thread, but I hope to be able to
> > contribute to
> > this discussion in a meaningful way. I am grateful for the level of
> > interest in
> > this topic. I would like to draw your attention to Argo as a suitable
> > technology for development of VirtIO's hypervisor-agnostic interfaces.
> >
> > * Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
> > that
> >   can send and receive hypervisor-mediated notifications and messages
> > between
> >   domains (VMs). [1] The hypervisor can enforce Mandatory Access Control
> > over
> >   all communication between domains. It is derived from the earlier v4v,
> > which
> >   has been deployed on millions of machines with the HP/Bromium uXen
> > hypervisor
> >   and with OpenXT.
> >
> > * Argo has a simple interface with a small number of operations that was
> >   designed for ease of integration into OS primitives on both Linux
> > (sockets)
> >   and Windows (ReadFile/WriteFile) [2].
> >     - A unikernel example of using it has also been developed for XTF.
> [3]
> >
> > * There has been recent discussion and support in the Xen community for
> > making
> >   revisions to the Argo interface to make it hypervisor-agnostic, and
> > support
> >   implementations of Argo on other hypervisors. This will enable a single
> >   interface for an OS kernel binary to use for inter-VM communication
> that
> > will
> >   work on multiple hypervisors -- this applies equally to both backends
> and
> >   frontend implementations. [4]
>
> Regarding virtio-over-Argo, let me ask a few questions:
> (In figure "Virtual device buffer access:Virtio+Argo" in [4])
>

(for ref, this diagram is from this document:
 https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698 )

Takahiro, thanks for reading the Virtio-Argo materials.

Some relevant context before answering your questions below: the Argo
request
interface from the hypervisor to a guest, which is currently exposed only
via a
dedicated hypercall op, has been discussed within the Xen community and is
open
to being changed in order to better enable support for guest VM access to
Argo
functions in a hypervisor-agnostic way.

The proposal is to allow hypervisors the option to implement and expose any
of
multiple access mechanisms for Argo, and then enable a guest device driver
to
probe the hypervisor for methods that it is aware of and able to use. The
hypercall op is likely to be retained (in some form), and complemented at
least
on x86 with another interface via MSRs presented to the guests.



> 1) How the configuration is managed?
>    On either virtio-mmio or virtio-pci, there always takes place
>    some negotiation between the FE and BE through the "configuration"
>    space. How can this be done in virtio-over-Argo?
>

Just to be clear about my understanding: your question, in the context of a
Linux kernel virtio device driver implementation, is about how a virtio-argo
transport driver would implement the get_features function of the
virtio_config_ops, as a parallel to the work that vp_get_features does for
virtio-pci, and vm_get_features does for virtio-mmio.

The design is still open on this and options have been discussed, including:

* an extension to Argo to allow the system toolstack (which is responsible
for
  managing guest VMs and enabling connections from front-to-backends)
  to manage a table of "implicit destinations", so a guest can transmit Argo
  messages to eg. "my storage service" port and the hypervisor will deliver
it
  based on a destination table pre-programmed by the toolstack for the VM.
  [1]
     - ref: Notes from the December 2019 Xen F2F meeting in Cambridge, UK:
       [1] https://lists.archive.carbon60.com/xen/devel/577800#577800

  So within that feature negotiation function, communication with the
backend
  via that Argo channel will occur.

* IOREQ
The Xen IOREQ implementation is not currently appropriate for virtio-argo
since
it requires the use of foreign memory mappings of frontend memory in the
backend
guest. However, a new HMX interface from the hypervisor could support a new
DMA
Device Model Op to allow the backend to request the hypervisor to retrieve
specified
bytes from the frontend guest, which would enable plumbing for device
configuration
between an IOREQ server (device model backend implementation) and the guest
driver.
[2]

Feature negotiation in the front end in this case would look very similar to
the virtio-mmio implementation.

ref: Argo HMX Transport for VirtIO meeting minutes, from January 2021:
[2]
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html

* guest ACPI tables that surface the address of a remote Argo endpoint
  on behalf of the toolstack, and Argo communication can then negotiate
features

* emulation of a basic PCI device by the hypervisor (though details not
determined)



> 2) Do there physically exist virtio's available/used vrings as well as
>    descriptors, or are they virtually emulated over Argo (rings)?
>

In short: the latter.

In the analysis that I did when looking at this, my observation was that
each
side (front and backend) should be able to accurately maintain their own
local
copy of the available/used vrings as well as descriptors, and both be kept
synchronized by ensuring that updates are transmitted to the other side when
they are written to. eg. As part of this, in the Linux front end
implementation
the virtqueue_notify function uses a function pointer in the virtqueue that
is
populated by the transport driver, ie. the virtio-argo driver in this case,
which can implement the necessary logic to coordinate with the backend.


> 3) The payload in a request will be copied into the receiver's Argo ring.
>    What does the address in a descriptor mean?
>    Address/offset in a ring buffer?
>

Effectively yes. I would treat it as a handle that is used to identify and
retrieve data from messages exchanged between frontend transport driver and
the backend via Argo rings established for moving data for the data path.
In the diagram, those are "Argo ring for reads" and "Argo ring for writes".


> 4) Estimate of performance or latency?
>

Different access methods to Argo (ie. related to my answer to your question
'1)'
above --) will have different performance characteristics.

Data copying will necessarily involved for any Hypervisor-Mediated data
eXchange
(HMX) mechanism[1], such as Argo, where there is no shared memory between
guest
VMs, but the performance profile on modern CPUs with sizable caches has been
demonstrated to be acceptable for the guest virtual device drivers use case
in
the HP/Bromium vSentry uXen product. The VirtIO structure is somewhat
different
though.

Further performance profiling and measurement will be valuable for enabling
tuning of the implementation and development of additional interfaces (eg.
such
as an asynchronous send primitive) - some of this has been discussed and
described on the VirtIO-Argo-Development-Phase-1 wiki page[2].

[1]
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen

[2]
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development%3A+Phase+1


>    It appears that, on FE side, at least three hypervisor calls (and data
>    copying) need to be invoked at every request, right?
>

For a write, counting FE sendv ops:
1: the write data payload is sent via the "Argo ring for writes"
2: the descriptor is sent via a sync of the available/descriptor ring
  -- is there a third one that I am missing?

Christopher


>
> Thanks,
> -Takahiro Akashi
>
>
> > * Here are the design documents for building VirtIO-over-Argo, to
> support a
> >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> >
> > The Development Plan to build VirtIO virtual device support over Argo
> > transport:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> >
> > A design for using VirtIO over Argo, describing how VirtIO data
> structures
> > and communication is handled over the Argo transport:
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> >
> > Diagram (from the above document) showing how VirtIO rings are
> synchronized
> > between domains without using shared memory:
> >
> https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> >
> > Please note that the above design documents show that the existing VirtIO
> > device drivers, and both vring and virtqueue data structures can be
> > preserved
> > while interdomain communication can be performed with no shared memory
> > required
> > for most drivers; (the exceptions where further design is required are
> those
> > such as virtual framebuffer devices where shared memory regions are
> > intentionally
> > added to the communication structure beyond the vrings and virtqueues).
> >
> > An analysis of VirtIO and Argo, informing the design:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> >
> > * Argo can be used for a communication path for configuration between the
> > backend
> >   and the toolstack, avoiding the need for a dependency on XenStore,
> which
> > is an
> >   advantage for any hypervisor-agnostic design. It is also amenable to a
> > notification
> >   mechanism that is not based on Xen event channels.
> >
> > * Argo does not use or require shared memory between VMs and provides an
> > alternative
> >   to the use of foreign shared memory mappings. It avoids some of the
> > complexities
> >   involved with using grants (eg. XSA-300).
> >
> > * Argo supports Mandatory Access Control by the hypervisor, satisfying a
> > common
> >   certification requirement.
> >
> > * The Argo headers are BSD-licensed and the Xen hypervisor implementation
> > is GPLv2 but
> >   accessible via the hypercall interface. The licensing should not
> present
> > an obstacle
> >   to adoption of Argo in guest software or implementation by other
> > hypervisors.
> >
> > * Since the interface that Argo presents to a guest VM is similar to
> DMA, a
> > VirtIO-Argo
> >   frontend transport driver should be able to operate with a physical
> > VirtIO-enabled
> >   smart-NIC if the toolstack and an Argo-aware backend provide support.
> >
> > The next Xen Community Call is next week and I would be happy to answer
> > questions
> > about Argo and on this topic. I will also be following this thread.
> >
> > Christopher
> > (Argo maintainer, Xen Community)
> >
> >
> --------------------------------------------------------------------------------
> > [1]
> > An introduction to Argo:
> >
> https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > Xen Wiki page for Argo:
> >
> https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> >
> > [2]
> > OpenXT Linux Argo driver and userspace library:
> > https://github.com/openxt/linux-xen-argo
> >
> > Windows V4V at OpenXT wiki:
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > Windows v4v driver source:
> > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> >
> > HP/Bromium uXen V4V driver:
> > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> >
> > [3]
> > v2 of the Argo test unikernel for XTF:
> >
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> >
> > [4]
> > Argo HMX Transport for VirtIO meeting minutes:
> >
> https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> >
> > VirtIO-Argo Development wiki page:
> >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> >
>
>

[-- Attachment #2: Type: text/html, Size: 17467 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-06  2:23           ` AKASHI Takahiro
@ 2021-09-07  2:41               ` Christopher Clark
  2021-09-13 23:51             ` Stefano Stabellini
  1 sibling, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07  2:41 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Alex Benn??e, Wei Chen, Paul Durrant, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	Juergen Gross, Julien Grall, Carl van Schaik, Bertrand Marquis,
	Stefan Hajnoczi, Artem Mygaiev, Xen-devel, Oleksandr Tyshchenko,
	Oleksandr Tyshchenko, Elena Afanasova, James McKenzie,
	Andrew Cooper, Rich Persaud, Daniel Smith, Jason Andryuk,
	eric chanudet, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 6131 bytes --]

On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev <
stratos-dev@op-lists.linaro.org> wrote:

> Alex,
>
> On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
> >
> > AKASHI Takahiro <takahiro.akashi@linaro.org> writes:
> >
> > > Alex,
> > >
> > > On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
> > >>
> > >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> > >>
> > >> > [[PGP Signed Part:Undecided]]
> > >> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > >> >> > Could we consider the kernel internally converting IOREQ
> messages from
> > >> >> > the Xen hypervisor to eventfd events? Would this scale with
> other kernel
> > >> >> > hypercall interfaces?
> > >> >> >
> > >> >> > So any thoughts on what directions are worth experimenting with?
> > >> >>
> > >> >> One option we should consider is for each backend to connect to
> Xen via
> > >> >> the IOREQ interface. We could generalize the IOREQ interface and
> make it
> > >> >> hypervisor agnostic. The interface is really trivial and easy to
> add.
> > >> >> The only Xen-specific part is the notification mechanism, which is
> an
> > >> >> event channel. If we replaced the event channel with something
> else the
> > >> >> interface would be generic. See:
> > >> >>
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > >> >
> > >> > There have been experiments with something kind of similar in KVM
> > >> > recently (see struct ioregionfd_cmd):
> > >> >
> https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > >>
> > >> Reading the cover letter was very useful in showing how this provides
> a
> > >> separate channel for signalling IO events to userspace instead of
> using
> > >> the normal type-2 vmexit type event. I wonder how deeply tied the
> > >> userspace facing side of this is to KVM? Could it provide a common FD
> > >> type interface to IOREQ?
> > >
> > > Why do you stick to a "FD" type interface?
> >
> > I mean most user space interfaces on POSIX start with a file descriptor
> > and the usual read/write semantics or a series of ioctls.
>
> Who do you assume is responsible for implementing this kind of
> fd semantics, OSs on BE or hypervisor itself?
>
> I think such interfaces can only be easily implemented on type-2
> hypervisors.
>
> # In this sense, I don't think rust-vmm, as it is, cannot be
> # a general solution.
>
> > >> As I understand IOREQ this is currently a direct communication between
> > >> userspace and the hypervisor using the existing Xen message bus. My
> > >
> > > With IOREQ server, IO event occurrences are notified to BE via Xen's
> event
> > > channel, while the actual contexts of IO events (see struct ioreq in
> ioreq.h)
> > > are put in a queue on a single shared memory page which is to be
> assigned
> > > beforehand with xenforeignmemory_map_resource hypervisor call.
> >
> > If we abstracted the IOREQ via the kernel interface you would probably
> > just want to put the ioreq structure on a queue rather than expose the
> > shared page to userspace.
>
> Where is that queue?
>
> > >> worry would be that by adding knowledge of what the underlying
> > >> hypervisor is we'd end up with excess complexity in the kernel. For
> one
> > >> thing we certainly wouldn't want an API version dependency on the
> kernel
> > >> to understand which version of the Xen hypervisor it was running on.
> > >
> > > That's exactly what virtio-proxy in my proposal[1] does; All the
> hypervisor-
> > > specific details of IO event handlings are contained in virtio-proxy
> > > and virtio BE will communicate with virtio-proxy through a virtqueue
> > > (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> > > get IO event-related *RPC* callbacks, either MMIO read or write, from
> > > virtio-proxy.
> > >
> > > See page 8 (protocol flow) and 10 (interfaces) in [1].
> >
> > There are two areas of concern with the proxy approach at the moment.
> > The first is how the bootstrap of the virtio-proxy channel happens and
>
> As I said, from BE point of view, virtio-proxy would be seen
> as yet another virtio device by which BE could talk to "virtio
> proxy" vm or whatever else.
>
> This way we guarantee BE's hypervisor-agnosticism instead of having
> "common" hypervisor interfaces. That is the base of my idea.
>
> > the second is how many context switches are involved in a transaction.
> > Of course with all things there is a trade off. Things involving the
> > very tightest latency would probably opt for a bare metal backend which
> > I think would imply hypervisor knowledge in the backend binary.
>
> In configuration phase of virtio device, the latency won't be a big matter.
> In device operations (i.e. read/write to block devices), if we can
> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
> is
> how efficiently we can deliver notification to the opposite side. Right?
> And this is a very common problem whatever approach we would take.
>
> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> related code can be re-implemented just as a stub (or shim?) library
> since the protocols are defined as RPCs.
> In this case, however, we would lose the benefit of providing "single
> binary"
> BE.
> (I know this is is an arguable requirement, though.)
>
> # Would we better discuss what "hypervisor-agnosticism" means?
>
> Is there a call that you could recommend that we join to discuss this and
the topics of this thread?
There is definitely interest in pursuing a new interface for Argo that can
be implemented in other hypervisors and enable guest binary portability
between them, at least on the same hardware architecture, with VirtIO
transport as a primary use case.

The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF
for Xen and Guest OS, which include context about the several separate
approaches to VirtIO on Xen, have now been posted here:
https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html

Christopher



> -Takahiro Akashi
>
>
>

[-- Attachment #2: Type: text/html, Size: 8183 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-07  2:41               ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07  2:41 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Alex Benn??e, Wei Chen, Paul Durrant, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	Juergen Gross, Julien Grall, Carl van Schaik, Bertrand Marquis,
	Stefan Hajnoczi, Artem Mygaiev, Xen-devel, Oleksandr Tyshchenko,
	Oleksandr Tyshchenko, Elena Afanasova, James McKenzie,
	Andrew Cooper, Rich Persaud, Daniel Smith, Jason Andryuk,
	eric chanudet, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 6131 bytes --]

On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev <
stratos-dev@op-lists.linaro.org> wrote:

> Alex,
>
> On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
> >
> > AKASHI Takahiro <takahiro.akashi@linaro.org> writes:
> >
> > > Alex,
> > >
> > > On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
> > >>
> > >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> > >>
> > >> > [[PGP Signed Part:Undecided]]
> > >> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > >> >> > Could we consider the kernel internally converting IOREQ
> messages from
> > >> >> > the Xen hypervisor to eventfd events? Would this scale with
> other kernel
> > >> >> > hypercall interfaces?
> > >> >> >
> > >> >> > So any thoughts on what directions are worth experimenting with?
> > >> >>
> > >> >> One option we should consider is for each backend to connect to
> Xen via
> > >> >> the IOREQ interface. We could generalize the IOREQ interface and
> make it
> > >> >> hypervisor agnostic. The interface is really trivial and easy to
> add.
> > >> >> The only Xen-specific part is the notification mechanism, which is
> an
> > >> >> event channel. If we replaced the event channel with something
> else the
> > >> >> interface would be generic. See:
> > >> >>
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > >> >
> > >> > There have been experiments with something kind of similar in KVM
> > >> > recently (see struct ioregionfd_cmd):
> > >> >
> https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > >>
> > >> Reading the cover letter was very useful in showing how this provides
> a
> > >> separate channel for signalling IO events to userspace instead of
> using
> > >> the normal type-2 vmexit type event. I wonder how deeply tied the
> > >> userspace facing side of this is to KVM? Could it provide a common FD
> > >> type interface to IOREQ?
> > >
> > > Why do you stick to a "FD" type interface?
> >
> > I mean most user space interfaces on POSIX start with a file descriptor
> > and the usual read/write semantics or a series of ioctls.
>
> Who do you assume is responsible for implementing this kind of
> fd semantics, OSs on BE or hypervisor itself?
>
> I think such interfaces can only be easily implemented on type-2
> hypervisors.
>
> # In this sense, I don't think rust-vmm, as it is, cannot be
> # a general solution.
>
> > >> As I understand IOREQ this is currently a direct communication between
> > >> userspace and the hypervisor using the existing Xen message bus. My
> > >
> > > With IOREQ server, IO event occurrences are notified to BE via Xen's
> event
> > > channel, while the actual contexts of IO events (see struct ioreq in
> ioreq.h)
> > > are put in a queue on a single shared memory page which is to be
> assigned
> > > beforehand with xenforeignmemory_map_resource hypervisor call.
> >
> > If we abstracted the IOREQ via the kernel interface you would probably
> > just want to put the ioreq structure on a queue rather than expose the
> > shared page to userspace.
>
> Where is that queue?
>
> > >> worry would be that by adding knowledge of what the underlying
> > >> hypervisor is we'd end up with excess complexity in the kernel. For
> one
> > >> thing we certainly wouldn't want an API version dependency on the
> kernel
> > >> to understand which version of the Xen hypervisor it was running on.
> > >
> > > That's exactly what virtio-proxy in my proposal[1] does; All the
> hypervisor-
> > > specific details of IO event handlings are contained in virtio-proxy
> > > and virtio BE will communicate with virtio-proxy through a virtqueue
> > > (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> > > get IO event-related *RPC* callbacks, either MMIO read or write, from
> > > virtio-proxy.
> > >
> > > See page 8 (protocol flow) and 10 (interfaces) in [1].
> >
> > There are two areas of concern with the proxy approach at the moment.
> > The first is how the bootstrap of the virtio-proxy channel happens and
>
> As I said, from BE point of view, virtio-proxy would be seen
> as yet another virtio device by which BE could talk to "virtio
> proxy" vm or whatever else.
>
> This way we guarantee BE's hypervisor-agnosticism instead of having
> "common" hypervisor interfaces. That is the base of my idea.
>
> > the second is how many context switches are involved in a transaction.
> > Of course with all things there is a trade off. Things involving the
> > very tightest latency would probably opt for a bare metal backend which
> > I think would imply hypervisor knowledge in the backend binary.
>
> In configuration phase of virtio device, the latency won't be a big matter.
> In device operations (i.e. read/write to block devices), if we can
> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
> is
> how efficiently we can deliver notification to the opposite side. Right?
> And this is a very common problem whatever approach we would take.
>
> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> related code can be re-implemented just as a stub (or shim?) library
> since the protocols are defined as RPCs.
> In this case, however, we would lose the benefit of providing "single
> binary"
> BE.
> (I know this is is an arguable requirement, though.)
>
> # Would we better discuss what "hypervisor-agnosticism" means?
>
> Is there a call that you could recommend that we join to discuss this and
the topics of this thread?
There is definitely interest in pursuing a new interface for Argo that can
be implemented in other hypervisors and enable guest binary portability
between them, at least on the same hardware architecture, with VirtIO
transport as a primary use case.

The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF
for Xen and Guest OS, which include context about the several separate
approaches to VirtIO on Xen, have now been posted here:
https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html

Christopher



> -Takahiro Akashi
>
>
>

[-- Attachment #2: Type: text/html, Size: 8183 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-07  0:57                               ` [virtio-dev] " Christopher Clark
  (?)
@ 2021-09-07 11:55                               ` AKASHI Takahiro
  2021-09-07 18:09                                   ` [virtio-dev] " Christopher Clark
  -1 siblings, 1 reply; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-07 11:55 UTC (permalink / raw)
  To: Christopher Clark
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

Hi,

I have not covered all your comments below yet.
So just one comment:

On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
> On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
> wrote:

(snip)

> >    It appears that, on FE side, at least three hypervisor calls (and data
> >    copying) need to be invoked at every request, right?
> >
> 
> For a write, counting FE sendv ops:
> 1: the write data payload is sent via the "Argo ring for writes"
> 2: the descriptor is sent via a sync of the available/descriptor ring
>   -- is there a third one that I am missing?

In the picture, I can see
a) Data transmitted by Argo sendv
b) Descriptor written after data sendv
c) VirtIO ring sync'd to back-end via separate sendv

Oops, (b) is not a hypervisor call, is it?
(But I guess that you will have to have yet another call for notification
since there is no config register of QueueNotify?)

Thanks,
-Takahiro Akashi


> Christopher
> 
> 
> >
> > Thanks,
> > -Takahiro Akashi
> >
> >
> > > * Here are the design documents for building VirtIO-over-Argo, to
> > support a
> > >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> > >
> > > The Development Plan to build VirtIO virtual device support over Argo
> > > transport:
> > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > >
> > > A design for using VirtIO over Argo, describing how VirtIO data
> > structures
> > > and communication is handled over the Argo transport:
> > > https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> > >
> > > Diagram (from the above document) showing how VirtIO rings are
> > synchronized
> > > between domains without using shared memory:
> > >
> > https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> > >
> > > Please note that the above design documents show that the existing VirtIO
> > > device drivers, and both vring and virtqueue data structures can be
> > > preserved
> > > while interdomain communication can be performed with no shared memory
> > > required
> > > for most drivers; (the exceptions where further design is required are
> > those
> > > such as virtual framebuffer devices where shared memory regions are
> > > intentionally
> > > added to the communication structure beyond the vrings and virtqueues).
> > >
> > > An analysis of VirtIO and Argo, informing the design:
> > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> > >
> > > * Argo can be used for a communication path for configuration between the
> > > backend
> > >   and the toolstack, avoiding the need for a dependency on XenStore,
> > which
> > > is an
> > >   advantage for any hypervisor-agnostic design. It is also amenable to a
> > > notification
> > >   mechanism that is not based on Xen event channels.
> > >
> > > * Argo does not use or require shared memory between VMs and provides an
> > > alternative
> > >   to the use of foreign shared memory mappings. It avoids some of the
> > > complexities
> > >   involved with using grants (eg. XSA-300).
> > >
> > > * Argo supports Mandatory Access Control by the hypervisor, satisfying a
> > > common
> > >   certification requirement.
> > >
> > > * The Argo headers are BSD-licensed and the Xen hypervisor implementation
> > > is GPLv2 but
> > >   accessible via the hypercall interface. The licensing should not
> > present
> > > an obstacle
> > >   to adoption of Argo in guest software or implementation by other
> > > hypervisors.
> > >
> > > * Since the interface that Argo presents to a guest VM is similar to
> > DMA, a
> > > VirtIO-Argo
> > >   frontend transport driver should be able to operate with a physical
> > > VirtIO-enabled
> > >   smart-NIC if the toolstack and an Argo-aware backend provide support.
> > >
> > > The next Xen Community Call is next week and I would be happy to answer
> > > questions
> > > about Argo and on this topic. I will also be following this thread.
> > >
> > > Christopher
> > > (Argo maintainer, Xen Community)
> > >
> > >
> > --------------------------------------------------------------------------------
> > > [1]
> > > An introduction to Argo:
> > >
> > https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > > Xen Wiki page for Argo:
> > >
> > https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> > >
> > > [2]
> > > OpenXT Linux Argo driver and userspace library:
> > > https://github.com/openxt/linux-xen-argo
> > >
> > > Windows V4V at OpenXT wiki:
> > > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > > Windows v4v driver source:
> > > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> > >
> > > HP/Bromium uXen V4V driver:
> > > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> > >
> > > [3]
> > > v2 of the Argo test unikernel for XTF:
> > >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> > >
> > > [4]
> > > Argo HMX Transport for VirtIO meeting minutes:
> > >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> > >
> > > VirtIO-Argo Development wiki page:
> > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > >
> >
> >


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-07 11:55                               ` AKASHI Takahiro
@ 2021-09-07 18:09                                   ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07 18:09 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 7922 bytes --]

On Tue, Sep 7, 2021 at 4:55 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
wrote:

> Hi,
>
> I have not covered all your comments below yet.
> So just one comment:
>
> On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
> > On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <
> takahiro.akashi@linaro.org>
> > wrote:
>
> (snip)
>
> > >    It appears that, on FE side, at least three hypervisor calls (and
> data
> > >    copying) need to be invoked at every request, right?
> > >
> >
> > For a write, counting FE sendv ops:
> > 1: the write data payload is sent via the "Argo ring for writes"
> > 2: the descriptor is sent via a sync of the available/descriptor ring
> >   -- is there a third one that I am missing?
>
> In the picture, I can see
> a) Data transmitted by Argo sendv
> b) Descriptor written after data sendv
> c) VirtIO ring sync'd to back-end via separate sendv
>
> Oops, (b) is not a hypervisor call, is it?
>

That's correct, it is not - the blue arrows in the diagram are not
hypercalls, they are intended to show data movement or action in the flow
of performing the operation, and (b) is a data write within the guest's
address space into the descriptor ring.



> (But I guess that you will have to have yet another call for notification
> since there is no config register of QueueNotify?)
>

Reasoning about hypercalls necessary for data movement:

VirtIO transport drivers are responsible for instantiating virtqueues
(setup_vq) and are able to populate the notify function pointer in the
virtqueue that they supply. The virtio-argo transport driver can provide a
suitable notify function implementation that will issue the Argo hypercall
sendv hypercall(s) for sending data from the guest frontend to the backend.
By issuing the sendv at the time of the queuenotify, rather than as each
buffer is added to the virtqueue, the cost of the sendv hypercall can be
amortized over multiple buffer additions to the virtqueue.

I also understand that there has been some recent work in the Linaro
Project Stratos on "Fat Virtqueues", where the data to be transmitted is
included within an expanded virtqueue, which could further reduce the
number of hypercalls required, since the data can be transmitted inline
with the descriptors.
Reference here:
https://linaro.atlassian.net/wiki/spaces/STR/pages/25626313982/2021-01-21+Project+Stratos+Sync+Meeting+notes
https://linaro.atlassian.net/browse/STR-25

As a result of the above, I think that a single hypercall could be
sufficient for communicating data for multiple requests, and that a
two-hypercall-per-request (worst case) upper bound could also be
established.

Christopher



>
> Thanks,
> -Takahiro Akashi
>
>
> > Christopher
> >
> >
> > >
> > > Thanks,
> > > -Takahiro Akashi
> > >
> > >
> > > > * Here are the design documents for building VirtIO-over-Argo, to
> > > support a
> > > >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> > > >
> > > > The Development Plan to build VirtIO virtual device support over Argo
> > > > transport:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > >
> > > > A design for using VirtIO over Argo, describing how VirtIO data
> > > structures
> > > > and communication is handled over the Argo transport:
> > > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> > > >
> > > > Diagram (from the above document) showing how VirtIO rings are
> > > synchronized
> > > > between domains without using shared memory:
> > > >
> > >
> https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> > > >
> > > > Please note that the above design documents show that the existing
> VirtIO
> > > > device drivers, and both vring and virtqueue data structures can be
> > > > preserved
> > > > while interdomain communication can be performed with no shared
> memory
> > > > required
> > > > for most drivers; (the exceptions where further design is required
> are
> > > those
> > > > such as virtual framebuffer devices where shared memory regions are
> > > > intentionally
> > > > added to the communication structure beyond the vrings and
> virtqueues).
> > > >
> > > > An analysis of VirtIO and Argo, informing the design:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> > > >
> > > > * Argo can be used for a communication path for configuration
> between the
> > > > backend
> > > >   and the toolstack, avoiding the need for a dependency on XenStore,
> > > which
> > > > is an
> > > >   advantage for any hypervisor-agnostic design. It is also amenable
> to a
> > > > notification
> > > >   mechanism that is not based on Xen event channels.
> > > >
> > > > * Argo does not use or require shared memory between VMs and
> provides an
> > > > alternative
> > > >   to the use of foreign shared memory mappings. It avoids some of the
> > > > complexities
> > > >   involved with using grants (eg. XSA-300).
> > > >
> > > > * Argo supports Mandatory Access Control by the hypervisor,
> satisfying a
> > > > common
> > > >   certification requirement.
> > > >
> > > > * The Argo headers are BSD-licensed and the Xen hypervisor
> implementation
> > > > is GPLv2 but
> > > >   accessible via the hypercall interface. The licensing should not
> > > present
> > > > an obstacle
> > > >   to adoption of Argo in guest software or implementation by other
> > > > hypervisors.
> > > >
> > > > * Since the interface that Argo presents to a guest VM is similar to
> > > DMA, a
> > > > VirtIO-Argo
> > > >   frontend transport driver should be able to operate with a physical
> > > > VirtIO-enabled
> > > >   smart-NIC if the toolstack and an Argo-aware backend provide
> support.
> > > >
> > > > The next Xen Community Call is next week and I would be happy to
> answer
> > > > questions
> > > > about Argo and on this topic. I will also be following this thread.
> > > >
> > > > Christopher
> > > > (Argo maintainer, Xen Community)
> > > >
> > > >
> > >
> --------------------------------------------------------------------------------
> > > > [1]
> > > > An introduction to Argo:
> > > >
> > >
> https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > > > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > > > Xen Wiki page for Argo:
> > > >
> > >
> https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> > > >
> > > > [2]
> > > > OpenXT Linux Argo driver and userspace library:
> > > > https://github.com/openxt/linux-xen-argo
> > > >
> > > > Windows V4V at OpenXT wiki:
> > > > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > > > Windows v4v driver source:
> > > > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> > > >
> > > > HP/Bromium uXen V4V driver:
> > > > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> > > >
> > > > [3]
> > > > v2 of the Argo test unikernel for XTF:
> > > >
> > >
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> > > >
> > > > [4]
> > > > Argo HMX Transport for VirtIO meeting minutes:
> > > >
> > >
> https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> > > >
> > > > VirtIO-Argo Development wiki page:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > >
> > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 12365 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-07 18:09                                   ` Christopher Clark
  0 siblings, 0 replies; 66+ messages in thread
From: Christopher Clark @ 2021-09-07 18:09 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 7922 bytes --]

On Tue, Sep 7, 2021 at 4:55 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
wrote:

> Hi,
>
> I have not covered all your comments below yet.
> So just one comment:
>
> On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
> > On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <
> takahiro.akashi@linaro.org>
> > wrote:
>
> (snip)
>
> > >    It appears that, on FE side, at least three hypervisor calls (and
> data
> > >    copying) need to be invoked at every request, right?
> > >
> >
> > For a write, counting FE sendv ops:
> > 1: the write data payload is sent via the "Argo ring for writes"
> > 2: the descriptor is sent via a sync of the available/descriptor ring
> >   -- is there a third one that I am missing?
>
> In the picture, I can see
> a) Data transmitted by Argo sendv
> b) Descriptor written after data sendv
> c) VirtIO ring sync'd to back-end via separate sendv
>
> Oops, (b) is not a hypervisor call, is it?
>

That's correct, it is not - the blue arrows in the diagram are not
hypercalls, they are intended to show data movement or action in the flow
of performing the operation, and (b) is a data write within the guest's
address space into the descriptor ring.



> (But I guess that you will have to have yet another call for notification
> since there is no config register of QueueNotify?)
>

Reasoning about hypercalls necessary for data movement:

VirtIO transport drivers are responsible for instantiating virtqueues
(setup_vq) and are able to populate the notify function pointer in the
virtqueue that they supply. The virtio-argo transport driver can provide a
suitable notify function implementation that will issue the Argo hypercall
sendv hypercall(s) for sending data from the guest frontend to the backend.
By issuing the sendv at the time of the queuenotify, rather than as each
buffer is added to the virtqueue, the cost of the sendv hypercall can be
amortized over multiple buffer additions to the virtqueue.

I also understand that there has been some recent work in the Linaro
Project Stratos on "Fat Virtqueues", where the data to be transmitted is
included within an expanded virtqueue, which could further reduce the
number of hypercalls required, since the data can be transmitted inline
with the descriptors.
Reference here:
https://linaro.atlassian.net/wiki/spaces/STR/pages/25626313982/2021-01-21+Project+Stratos+Sync+Meeting+notes
https://linaro.atlassian.net/browse/STR-25

As a result of the above, I think that a single hypercall could be
sufficient for communicating data for multiple requests, and that a
two-hypercall-per-request (worst case) upper bound could also be
established.

Christopher



>
> Thanks,
> -Takahiro Akashi
>
>
> > Christopher
> >
> >
> > >
> > > Thanks,
> > > -Takahiro Akashi
> > >
> > >
> > > > * Here are the design documents for building VirtIO-over-Argo, to
> > > support a
> > > >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> > > >
> > > > The Development Plan to build VirtIO virtual device support over Argo
> > > > transport:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > >
> > > > A design for using VirtIO over Argo, describing how VirtIO data
> > > structures
> > > > and communication is handled over the Argo transport:
> > > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> > > >
> > > > Diagram (from the above document) showing how VirtIO rings are
> > > synchronized
> > > > between domains without using shared memory:
> > > >
> > >
> https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> > > >
> > > > Please note that the above design documents show that the existing
> VirtIO
> > > > device drivers, and both vring and virtqueue data structures can be
> > > > preserved
> > > > while interdomain communication can be performed with no shared
> memory
> > > > required
> > > > for most drivers; (the exceptions where further design is required
> are
> > > those
> > > > such as virtual framebuffer devices where shared memory regions are
> > > > intentionally
> > > > added to the communication structure beyond the vrings and
> virtqueues).
> > > >
> > > > An analysis of VirtIO and Argo, informing the design:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> > > >
> > > > * Argo can be used for a communication path for configuration
> between the
> > > > backend
> > > >   and the toolstack, avoiding the need for a dependency on XenStore,
> > > which
> > > > is an
> > > >   advantage for any hypervisor-agnostic design. It is also amenable
> to a
> > > > notification
> > > >   mechanism that is not based on Xen event channels.
> > > >
> > > > * Argo does not use or require shared memory between VMs and
> provides an
> > > > alternative
> > > >   to the use of foreign shared memory mappings. It avoids some of the
> > > > complexities
> > > >   involved with using grants (eg. XSA-300).
> > > >
> > > > * Argo supports Mandatory Access Control by the hypervisor,
> satisfying a
> > > > common
> > > >   certification requirement.
> > > >
> > > > * The Argo headers are BSD-licensed and the Xen hypervisor
> implementation
> > > > is GPLv2 but
> > > >   accessible via the hypercall interface. The licensing should not
> > > present
> > > > an obstacle
> > > >   to adoption of Argo in guest software or implementation by other
> > > > hypervisors.
> > > >
> > > > * Since the interface that Argo presents to a guest VM is similar to
> > > DMA, a
> > > > VirtIO-Argo
> > > >   frontend transport driver should be able to operate with a physical
> > > > VirtIO-enabled
> > > >   smart-NIC if the toolstack and an Argo-aware backend provide
> support.
> > > >
> > > > The next Xen Community Call is next week and I would be happy to
> answer
> > > > questions
> > > > about Argo and on this topic. I will also be following this thread.
> > > >
> > > > Christopher
> > > > (Argo maintainer, Xen Community)
> > > >
> > > >
> > >
> --------------------------------------------------------------------------------
> > > > [1]
> > > > An introduction to Argo:
> > > >
> > >
> https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > > > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > > > Xen Wiki page for Argo:
> > > >
> > >
> https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> > > >
> > > > [2]
> > > > OpenXT Linux Argo driver and userspace library:
> > > > https://github.com/openxt/linux-xen-argo
> > > >
> > > > Windows V4V at OpenXT wiki:
> > > > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > > > Windows v4v driver source:
> > > > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> > > >
> > > > HP/Bromium uXen V4V driver:
> > > > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> > > >
> > > > [3]
> > > > v2 of the Argo test unikernel for XTF:
> > > >
> > >
> https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> > > >
> > > > [4]
> > > > Argo HMX Transport for VirtIO meeting minutes:
> > > >
> > >
> https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> > > >
> > > > VirtIO-Argo Development wiki page:
> > > >
> > >
> https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > >
> > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 12365 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-07  2:41               ` [virtio-dev] " Christopher Clark
  (?)
@ 2021-09-10  2:50               ` AKASHI Takahiro
  -1 siblings, 0 replies; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-10  2:50 UTC (permalink / raw)
  To: Christopher Clark
  Cc: Alex Benn??e, Wei Chen, Paul Durrant, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	Juergen Gross, Julien Grall, Carl van Schaik, Bertrand Marquis,
	Stefan Hajnoczi, Artem Mygaiev, Xen-devel, Oleksandr Tyshchenko,
	Oleksandr Tyshchenko, Elena Afanasova, James McKenzie,
	Andrew Cooper, Rich Persaud, Daniel Smith, Jason Andryuk,
	eric chanudet, Roger Pau Monn??

On Mon, Sep 06, 2021 at 07:41:48PM -0700, Christopher Clark wrote:
> On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev <
> stratos-dev@op-lists.linaro.org> wrote:
> 
> > Alex,
> >
> > On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
> > >
> > > AKASHI Takahiro <takahiro.akashi@linaro.org> writes:
> > >
> > > > Alex,
> > > >
> > > > On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
> > > >>
> > > >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> > > >>
> > > >> > [[PGP Signed Part:Undecided]]
> > > >> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > >> >> > Could we consider the kernel internally converting IOREQ
> > messages from
> > > >> >> > the Xen hypervisor to eventfd events? Would this scale with
> > other kernel
> > > >> >> > hypercall interfaces?
> > > >> >> >
> > > >> >> > So any thoughts on what directions are worth experimenting with?
> > > >> >>
> > > >> >> One option we should consider is for each backend to connect to
> > Xen via
> > > >> >> the IOREQ interface. We could generalize the IOREQ interface and
> > make it
> > > >> >> hypervisor agnostic. The interface is really trivial and easy to
> > add.
> > > >> >> The only Xen-specific part is the notification mechanism, which is
> > an
> > > >> >> event channel. If we replaced the event channel with something
> > else the
> > > >> >> interface would be generic. See:
> > > >> >>
> > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > >> >
> > > >> > There have been experiments with something kind of similar in KVM
> > > >> > recently (see struct ioregionfd_cmd):
> > > >> >
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanasova@gmail.com/
> > > >>
> > > >> Reading the cover letter was very useful in showing how this provides
> > a
> > > >> separate channel for signalling IO events to userspace instead of
> > using
> > > >> the normal type-2 vmexit type event. I wonder how deeply tied the
> > > >> userspace facing side of this is to KVM? Could it provide a common FD
> > > >> type interface to IOREQ?
> > > >
> > > > Why do you stick to a "FD" type interface?
> > >
> > > I mean most user space interfaces on POSIX start with a file descriptor
> > > and the usual read/write semantics or a series of ioctls.
> >
> > Who do you assume is responsible for implementing this kind of
> > fd semantics, OSs on BE or hypervisor itself?
> >
> > I think such interfaces can only be easily implemented on type-2
> > hypervisors.
> >
> > # In this sense, I don't think rust-vmm, as it is, cannot be
> > # a general solution.
> >
> > > >> As I understand IOREQ this is currently a direct communication between
> > > >> userspace and the hypervisor using the existing Xen message bus. My
> > > >
> > > > With IOREQ server, IO event occurrences are notified to BE via Xen's
> > event
> > > > channel, while the actual contexts of IO events (see struct ioreq in
> > ioreq.h)
> > > > are put in a queue on a single shared memory page which is to be
> > assigned
> > > > beforehand with xenforeignmemory_map_resource hypervisor call.
> > >
> > > If we abstracted the IOREQ via the kernel interface you would probably
> > > just want to put the ioreq structure on a queue rather than expose the
> > > shared page to userspace.
> >
> > Where is that queue?
> >
> > > >> worry would be that by adding knowledge of what the underlying
> > > >> hypervisor is we'd end up with excess complexity in the kernel. For
> > one
> > > >> thing we certainly wouldn't want an API version dependency on the
> > kernel
> > > >> to understand which version of the Xen hypervisor it was running on.
> > > >
> > > > That's exactly what virtio-proxy in my proposal[1] does; All the
> > hypervisor-
> > > > specific details of IO event handlings are contained in virtio-proxy
> > > > and virtio BE will communicate with virtio-proxy through a virtqueue
> > > > (yes, virtio-proxy is seen as yet another virtio device on BE) and will
> > > > get IO event-related *RPC* callbacks, either MMIO read or write, from
> > > > virtio-proxy.
> > > >
> > > > See page 8 (protocol flow) and 10 (interfaces) in [1].
> > >
> > > There are two areas of concern with the proxy approach at the moment.
> > > The first is how the bootstrap of the virtio-proxy channel happens and
> >
> > As I said, from BE point of view, virtio-proxy would be seen
> > as yet another virtio device by which BE could talk to "virtio
> > proxy" vm or whatever else.
> >
> > This way we guarantee BE's hypervisor-agnosticism instead of having
> > "common" hypervisor interfaces. That is the base of my idea.
> >
> > > the second is how many context switches are involved in a transaction.
> > > Of course with all things there is a trade off. Things involving the
> > > very tightest latency would probably opt for a bare metal backend which
> > > I think would imply hypervisor knowledge in the backend binary.
> >
> > In configuration phase of virtio device, the latency won't be a big matter.
> > In device operations (i.e. read/write to block devices), if we can
> > resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
> > is
> > how efficiently we can deliver notification to the opposite side. Right?
> > And this is a very common problem whatever approach we would take.
> >
> > Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> > related code can be re-implemented just as a stub (or shim?) library
> > since the protocols are defined as RPCs.
> > In this case, however, we would lose the benefit of providing "single
> > binary"
> > BE.
> > (I know this is is an arguable requirement, though.)
> >
> > # Would we better discuss what "hypervisor-agnosticism" means?
> >
> Is there a call that you could recommend that we join to discuss this and
> the topics of this thread?

Stratos call?
Alex should have more to say.

-Takahiro Akashi


> There is definitely interest in pursuing a new interface for Argo that can
> be implemented in other hypervisors and enable guest binary portability
> between them, at least on the same hardware architecture, with VirtIO
> transport as a primary use case.
> 
> The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF
> for Xen and Guest OS, which include context about the several separate
> approaches to VirtIO on Xen, have now been posted here:
> https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html
> 
> Christopher
> 
> 
> 
> > -Takahiro Akashi
> >
> >
> >


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-07 18:09                                   ` [virtio-dev] " Christopher Clark
  (?)
@ 2021-09-10  3:12                                   ` AKASHI Takahiro
  -1 siblings, 0 replies; 66+ messages in thread
From: AKASHI Takahiro @ 2021-09-10  3:12 UTC (permalink / raw)
  To: Christopher Clark
  Cc: Wei Chen, Oleksandr Tyshchenko, Stefano Stabellini, Alex Benn??e,
	Kaly Xin, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Oleksandr Tyshchenko,
	Bertrand Marquis, Artem Mygaiev, Julien Grall, Juergen Gross,
	Paul Durrant, Xen Devel, Rich Persaud, Daniel Smith,
	James McKenzie, Andrew Cooper

Hi Christopher,

On Tue, Sep 07, 2021 at 11:09:34AM -0700, Christopher Clark wrote:
> On Tue, Sep 7, 2021 at 4:55 AM AKASHI Takahiro <takahiro.akashi@linaro.org>
> wrote:
> 
> > Hi,
> >
> > I have not covered all your comments below yet.
> > So just one comment:
> >
> > On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
> > > On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <
> > takahiro.akashi@linaro.org>
> > > wrote:
> >
> > (snip)
> >
> > > >    It appears that, on FE side, at least three hypervisor calls (and
> > data
> > > >    copying) need to be invoked at every request, right?
> > > >
> > >
> > > For a write, counting FE sendv ops:
> > > 1: the write data payload is sent via the "Argo ring for writes"
> > > 2: the descriptor is sent via a sync of the available/descriptor ring
> > >   -- is there a third one that I am missing?
> >
> > In the picture, I can see
> > a) Data transmitted by Argo sendv
> > b) Descriptor written after data sendv
> > c) VirtIO ring sync'd to back-end via separate sendv
> >
> > Oops, (b) is not a hypervisor call, is it?
> >
> 
> That's correct, it is not - the blue arrows in the diagram are not
> hypercalls, they are intended to show data movement or action in the flow
> of performing the operation, and (b) is a data write within the guest's
> address space into the descriptor ring.
> 
> 
> 
> > (But I guess that you will have to have yet another call for notification
> > since there is no config register of QueueNotify?)
> >
> 
> Reasoning about hypercalls necessary for data movement:
> 
> VirtIO transport drivers are responsible for instantiating virtqueues
> (setup_vq) and are able to populate the notify function pointer in the
> virtqueue that they supply. The virtio-argo transport driver can provide a
> suitable notify function implementation that will issue the Argo hypercall
> sendv hypercall(s) for sending data from the guest frontend to the backend.
> By issuing the sendv at the time of the queuenotify, rather than as each
> buffer is added to the virtqueue, the cost of the sendv hypercall can be
> amortized over multiple buffer additions to the virtqueue.
> 
> I also understand that there has been some recent work in the Linaro
> Project Stratos on "Fat Virtqueues", where the data to be transmitted is
> included within an expanded virtqueue, which could further reduce the
> number of hypercalls required, since the data can be transmitted inline
> with the descriptors.
> Reference here:
> https://linaro.atlassian.net/wiki/spaces/STR/pages/25626313982/2021-01-21+Project+Stratos+Sync+Meeting+notes
> https://linaro.atlassian.net/browse/STR-25

Ah, yes. Obviously, "fatvirtqueue" has pros and cons.
One of cons is that it won't be suitable for bigger payload
with limited space in descriptors.

> As a result of the above, I think that a single hypercall could be
> sufficient for communicating data for multiple requests, and that a
> two-hypercall-per-request (worst case) upper bound could also be
> established.

When it comes to the payload or data plane, "fatvirtqueue" as well as
Argo utilizes copying. You dub it "DMA operations".
A similar approach can be also seen in virtio-over-ivshmem, where
a limited amount of memory are shared and FE will allocate some space
in this buffer and copy the payload into there. Those allocation will
be done via dma_ops of virtio_ivshmem driver. BE, on the other hand,
fetches the data from the shared memory by using the "offset" described
in a descriptor.
Shared memory is divided into a couple of different groups;
one for read/write for all, others for one writer with many readers.
(I hope I'm right here :)

Looks close to Argo? What is different is who is responsible for
copying data, the kernel or the hypervisor.
(Yeah, I know that Argo has more crucial aspects like access controls.)

In this sense, ivshmem can also be a candidate for hypervisor-agnostic
framework. Jailhouse doesn't say so explicitly AFAIK.
Jan may have some more to say.

Thanks,
-Takahiro Akashi


> Christopher
> 
> 
> 
> >
> > Thanks,
> > -Takahiro Akashi
> >
> >
> > > Christopher
> > >
> > >
> > > >
> > > > Thanks,
> > > > -Takahiro Akashi
> > > >
> > > >
> > > > > * Here are the design documents for building VirtIO-over-Argo, to
> > > > support a
> > > > >   hypervisor-agnostic frontend VirtIO transport driver using Argo.
> > > > >
> > > > > The Development Plan to build VirtIO virtual device support over Argo
> > > > > transport:
> > > > >
> > > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > > >
> > > > > A design for using VirtIO over Argo, describing how VirtIO data
> > > > structures
> > > > > and communication is handled over the Argo transport:
> > > > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
> > > > >
> > > > > Diagram (from the above document) showing how VirtIO rings are
> > > > synchronized
> > > > > between domains without using shared memory:
> > > > >
> > > >
> > https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob-url=true&id=01f7d0e1-7686-4f0b-88e1-457c1d30df40&collection=contentId-1348763698&contextId=1348763698&mimeType=image%2Fpng&name=device-buffer-access-virtio-argo.png&size=243175&width=1106&height=1241
> > > > >
> > > > > Please note that the above design documents show that the existing
> > VirtIO
> > > > > device drivers, and both vring and virtqueue data structures can be
> > > > > preserved
> > > > > while interdomain communication can be performed with no shared
> > memory
> > > > > required
> > > > > for most drivers; (the exceptions where further design is required
> > are
> > > > those
> > > > > such as virtual framebuffer devices where shared memory regions are
> > > > > intentionally
> > > > > added to the communication structure beyond the vrings and
> > virtqueues).
> > > > >
> > > > > An analysis of VirtIO and Argo, informing the design:
> > > > >
> > > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Argo+as+a+transport+medium+for+VirtIO
> > > > >
> > > > > * Argo can be used for a communication path for configuration
> > between the
> > > > > backend
> > > > >   and the toolstack, avoiding the need for a dependency on XenStore,
> > > > which
> > > > > is an
> > > > >   advantage for any hypervisor-agnostic design. It is also amenable
> > to a
> > > > > notification
> > > > >   mechanism that is not based on Xen event channels.
> > > > >
> > > > > * Argo does not use or require shared memory between VMs and
> > provides an
> > > > > alternative
> > > > >   to the use of foreign shared memory mappings. It avoids some of the
> > > > > complexities
> > > > >   involved with using grants (eg. XSA-300).
> > > > >
> > > > > * Argo supports Mandatory Access Control by the hypervisor,
> > satisfying a
> > > > > common
> > > > >   certification requirement.
> > > > >
> > > > > * The Argo headers are BSD-licensed and the Xen hypervisor
> > implementation
> > > > > is GPLv2 but
> > > > >   accessible via the hypercall interface. The licensing should not
> > > > present
> > > > > an obstacle
> > > > >   to adoption of Argo in guest software or implementation by other
> > > > > hypervisors.
> > > > >
> > > > > * Since the interface that Argo presents to a guest VM is similar to
> > > > DMA, a
> > > > > VirtIO-Argo
> > > > >   frontend transport driver should be able to operate with a physical
> > > > > VirtIO-enabled
> > > > >   smart-NIC if the toolstack and an Argo-aware backend provide
> > support.
> > > > >
> > > > > The next Xen Community Call is next week and I would be happy to
> > answer
> > > > > questions
> > > > > about Argo and on this topic. I will also be following this thread.
> > > > >
> > > > > Christopher
> > > > > (Argo maintainer, Xen Community)
> > > > >
> > > > >
> > > >
> > --------------------------------------------------------------------------------
> > > > > [1]
> > > > > An introduction to Argo:
> > > > >
> > > >
> > https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20OpenXT%20-%20Christopher%20Clark%20-%20Xen%20Summit%202019.pdf
> > > > > https://www.youtube.com/watch?v=cnC0Tg3jqJQ
> > > > > Xen Wiki page for Argo:
> > > > >
> > > >
> > https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_for_Xen
> > > > >
> > > > > [2]
> > > > > OpenXT Linux Argo driver and userspace library:
> > > > > https://github.com/openxt/linux-xen-argo
> > > > >
> > > > > Windows V4V at OpenXT wiki:
> > > > > https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V
> > > > > Windows v4v driver source:
> > > > > https://github.com/OpenXT/xc-windows/tree/master/xenv4v
> > > > >
> > > > > HP/Bromium uXen V4V driver:
> > > > > https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
> > > > >
> > > > > [3]
> > > > > v2 of the Argo test unikernel for XTF:
> > > > >
> > > >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
> > > > >
> > > > > [4]
> > > > > Argo HMX Transport for VirtIO meeting minutes:
> > > > >
> > > >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
> > > > >
> > > > > VirtIO-Argo Development wiki page:
> > > > >
> > > >
> > https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Development+Phase+1
> > > > >
> > > >
> > > >
> >


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-07  2:41               ` [virtio-dev] " Christopher Clark
@ 2021-09-10  9:35                 ` Alex Bennée
  -1 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-10  9:35 UTC (permalink / raw)
  To: Christopher Clark
  Cc: AKASHI Takahiro, Wei Chen, Paul Durrant, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	Juergen Gross, Julien Grall, Carl van Schaik, Bertrand Marquis,
	Stefan Hajnoczi, Artem Mygaiev, Xen-devel, Oleksandr Tyshchenko,
	Oleksandr Tyshchenko, Elena Afanasova, James McKenzie,
	Andrew Cooper, Rich Persaud, Daniel Smith, Jason Andryuk,
	eric chanudet, Roger Pau Monné


Christopher Clark <christopher.w.clark@gmail.com> writes:

> On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev <stratos-dev@op-lists.linaro.org> wrote:
>
>  Alex,
>
>  On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
<snip>
>
>  In configuration phase of virtio device, the latency won't be a big matter.
>  In device operations (i.e. read/write to block devices), if we can
>  resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
>  how efficiently we can deliver notification to the opposite side. Right?
>  And this is a very common problem whatever approach we would take.
>
>  Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>  related code can be re-implemented just as a stub (or shim?) library
>  since the protocols are defined as RPCs.
>  In this case, however, we would lose the benefit of providing "single binary"
>  BE.
>  (I know this is is an arguable requirement, though.)

The proposal for a single binary would always require something to shim
between hypervisors. This is still an area of discussion though. Having
a compile time selectable approach is practically unavoidable for "bare
metal" backends though because there are no other processes/layers that
communication with the hypervisor can be delegated to.

>
>  # Would we better discuss what "hypervisor-agnosticism" means?
>
> Is there a call that you could recommend that we join to discuss this and the topics of this thread?
> There is definitely interest in pursuing a new interface for Argo that can be implemented in other hypervisors and enable guest binary
> portability between them, at least on the same hardware architecture,
> with VirtIO transport as a primary use case.

There is indeed ;-)

We have a regular open call every two week for the Stratos project which
you are welcome to attend. You can find the details on the project
overview page:

  https://linaro.atlassian.net/wiki/spaces/STR/overview

we regularly have teams from outside the project present their work as well.

> The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF for Xen and Guest OS, which include context about the
> several separate approaches to VirtIO on Xen, have now been posted here:
> https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html

Thanks for the link - looks like a very detailed summary.

>
> Christopher
>
>  
>  -Takahiro Akashi


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-10  9:35                 ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-10  9:35 UTC (permalink / raw)
  To: Christopher Clark
  Cc: AKASHI Takahiro, Wei Chen, Paul Durrant, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	Juergen Gross, Julien Grall, Carl van Schaik, Bertrand Marquis,
	Stefan Hajnoczi, Artem Mygaiev, Xen-devel, Oleksandr Tyshchenko,
	Oleksandr Tyshchenko, Elena Afanasova, James McKenzie,
	Andrew Cooper, Rich Persaud, Daniel Smith, Jason Andryuk,
	eric chanudet, Roger Pau Monné


Christopher Clark <christopher.w.clark@gmail.com> writes:

> On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev <stratos-dev@op-lists.linaro.org> wrote:
>
>  Alex,
>
>  On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
<snip>
>
>  In configuration phase of virtio device, the latency won't be a big matter.
>  In device operations (i.e. read/write to block devices), if we can
>  resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
>  how efficiently we can deliver notification to the opposite side. Right?
>  And this is a very common problem whatever approach we would take.
>
>  Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>  related code can be re-implemented just as a stub (or shim?) library
>  since the protocols are defined as RPCs.
>  In this case, however, we would lose the benefit of providing "single binary"
>  BE.
>  (I know this is is an arguable requirement, though.)

The proposal for a single binary would always require something to shim
between hypervisors. This is still an area of discussion though. Having
a compile time selectable approach is practically unavoidable for "bare
metal" backends though because there are no other processes/layers that
communication with the hypervisor can be delegated to.

>
>  # Would we better discuss what "hypervisor-agnosticism" means?
>
> Is there a call that you could recommend that we join to discuss this and the topics of this thread?
> There is definitely interest in pursuing a new interface for Argo that can be implemented in other hypervisors and enable guest binary
> portability between them, at least on the same hardware architecture,
> with VirtIO transport as a primary use case.

There is indeed ;-)

We have a regular open call every two week for the Stratos project which
you are welcome to attend. You can find the details on the project
overview page:

  https://linaro.atlassian.net/wiki/spaces/STR/overview

we regularly have teams from outside the project present their work as well.

> The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF for Xen and Guest OS, which include context about the
> several separate approaches to VirtIO on Xen, have now been posted here:
> https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html

Thanks for the link - looks like a very detailed summary.

>
> Christopher
>
>  
>  -Takahiro Akashi


-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-06  2:23           ` AKASHI Takahiro
  2021-09-07  2:41               ` [virtio-dev] " Christopher Clark
@ 2021-09-13 23:51             ` Stefano Stabellini
  2021-09-14  6:08               ` [Stratos-dev] " François Ozog
                                 ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Stefano Stabellini @ 2021-09-13 23:51 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Alex Benn??e, Stefan Hajnoczi, Stefano Stabellini,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Stefano Stabellini, Jan Kiszka, Carl van Schaik, pratikp,
	Srivatsa Vaddagiri, Jean-Philippe Brucker, Mathieu Poirier,
	Wei.Chen, olekstysh, Oleksandr_Tyshchenko, Bertrand.Marquis,
	Artem_Mygaiev, julien, jgross, paul, xen-devel, Elena Afanasova

On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
> > the second is how many context switches are involved in a transaction.
> > Of course with all things there is a trade off. Things involving the
> > very tightest latency would probably opt for a bare metal backend which
> > I think would imply hypervisor knowledge in the backend binary.
> 
> In configuration phase of virtio device, the latency won't be a big matter.
> In device operations (i.e. read/write to block devices), if we can
> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
> how efficiently we can deliver notification to the opposite side. Right?
> And this is a very common problem whatever approach we would take.
> 
> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> related code can be re-implemented just as a stub (or shim?) library
> since the protocols are defined as RPCs.
> In this case, however, we would lose the benefit of providing "single binary"
> BE.
> (I know this is is an arguable requirement, though.)

In my experience, latency, performance, and security are far more
important than providing a single binary.

In my opinion, we should optimize for the best performance and security,
then be practical on the topic of hypervisor agnosticism. For instance,
a shared source with a small hypervisor-specific component, with one
implementation of the small component for each hypervisor, would provide
a good enough hypervisor abstraction. It is good to be hypervisor
agnostic, but I wouldn't go extra lengths to have a single binary. I
cannot picture a case where a BE binary needs to be moved between
different hypervisors and a recompilation is impossible (BE, not FE).
Instead, I can definitely imagine detailed requirements on IRQ latency
having to be lower than 10us or bandwidth higher than 500 MB/sec.

Instead of virtio-proxy, my suggestion is to work together on a common
project and common source with others interested in the same problem.

I would pick something like kvmtool as a basis. It doesn't have to be
kvmtools, and kvmtools specifically is GPL-licensed, which is
unfortunate because it would help if the license was BSD-style for ease
of integration with Zephyr and other RTOSes.

As long as the project is open to working together on multiple
hypervisors and deployment models then it is fine. For instance, the
shared source could be based on OpenAMP kvmtool [1] (the original
kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
kvmtool was created to add support for hypervisor-less virtio but they
are very open to hypervisors too. It could be a good place to add a Xen
implementation, a KVM fatqueue implementation, a Jailhouse
implementation, etc. -- work together toward the common goal of a single
BE source (not binary) supporting multiple different deployment models.


[1] https://github.com/OpenAMP/kvmtool


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-13 23:51             ` Stefano Stabellini
@ 2021-09-14  6:08               ` François Ozog
  2021-09-14 14:25                 ` [virtio-dev] " Alex Bennée
  2021-09-14 17:38               ` [Stratos-dev] " Trilok Soni
  2 siblings, 0 replies; 66+ messages in thread
From: François Ozog @ 2021-09-14  6:08 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: AKASHI Takahiro, Arnd Bergmann, Artem_Mygaiev, Bertrand.Marquis,
	Carl van Schaik, Elena Afanasova, Jan Kiszka,
	Oleksandr_Tyshchenko, Stefan Hajnoczi, Stefano Stabellini,
	Stratos Mailing List, jgross, julien, olekstysh, paul,
	virtio-dev, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3842 bytes --]

Hi

Le mar. 14 sept. 2021 à 01:51, Stefano Stabellini via Stratos-dev <
stratos-dev@op-lists.linaro.org> a écrit :

> On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
> > > the second is how many context switches are involved in a transaction.
> > > Of course with all things there is a trade off. Things involving the
> > > very tightest latency would probably opt for a bare metal backend which
> > > I think would imply hypervisor knowledge in the backend binary.
> >
> > In configuration phase of virtio device, the latency won't be a big
> matter.
> > In device operations (i.e. read/write to block devices), if we can
> > resolve 'mmap' issue, as Oleksandr is proposing right now, the only
> issue is
> > how efficiently we can deliver notification to the opposite side. Right?
> > And this is a very common problem whatever approach we would take.
> >
> > Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> > related code can be re-implemented just as a stub (or shim?) library
> > since the protocols are defined as RPCs.
> > In this case, however, we would lose the benefit of providing "single
> binary"
> > BE.
> > (I know this is is an arguable requirement, though.)
>
> In my experience, latency, performance, and security are far more
> important than providing a single binary.
>
> In my opinion, we should optimize for the best performance and security,
> then be practical on the topic of hypervisor agnosticism. For instance,
> a shared source with a small hypervisor-specific component, with one
> implementation of the small component for each hypervisor, would provide
> a good enough hypervisor abstraction. It is good to be hypervisor
> agnostic, but I wouldn't go extra lengths to have a single binary. I
> cannot picture a case where a BE binary needs to be moved between
> different hypervisors and a recompilation is impossible (BE, not FE).
> Instead, I can definitely imagine detailed requirements on IRQ latency
> having to be lower than 10us or bandwidth higher than 500 MB/sec.
>
> Instead of virtio-proxy, my suggestion is to work together on a common
> project and common source with others interested in the same problem.
>
> I would pick something like kvmtool as a basis. It doesn't have to be
> kvmtools, and kvmtools specifically is GPL-licensed, which is
> unfortunate because it would help if the license was BSD-style for ease
> of integration with Zephyr and other RTOSes.
>
> As long as the project is open to working together on multiple
> hypervisors and deployment models then it is fine. For instance, the
> shared source could be based on OpenAMP kvmtool [1] (the original
> kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> kvmtool was created to add support for hypervisor-less virtio but they
> are very open to hypervisors too. It could be a good place to add a Xen
> implementation, a KVM fatqueue implementation, a Jailhouse
> implementation, etc. -- work together toward the common goal of a single
> BE source (not binary) supporting multiple different deployment models.
>
i like the hypervisor-less approach described in the link below. it can
also be used to drfine abstract HAL between normal world and TrustZone to
implement confidential workloads in the TZ. Virtio-sock is of particular
interest.
In addition, this can define a HAL that can be re-used in many contexts :
could we use this to implement something similar to Android Generic Kernel
Image stuff ?

>
>
> [1] https://github.com/OpenAMP/kvmtool
> --
> Stratos-dev mailing list
> Stratos-dev@op-lists.linaro.org
> https://op-lists.linaro.org/mailman/listinfo/stratos-dev
>
-- 
François-Frédéric Ozog | *Director Business Development*
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

[-- Attachment #2: Type: text/html, Size: 6167 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Enabling hypervisor agnosticism for VirtIO backends
  2021-09-13 23:51             ` Stefano Stabellini
@ 2021-09-14 14:25                 ` Alex Bennée
  2021-09-14 14:25                 ` [virtio-dev] " Alex Bennée
  2021-09-14 17:38               ` [Stratos-dev] " Trilok Soni
  2 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-14 14:25 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: AKASHI Takahiro, Stefan Hajnoczi, Stefano Stabellini,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


Stefano Stabellini <stefano.stabellini@xilinx.com> writes:

> On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
>> > the second is how many context switches are involved in a transaction.
>> > Of course with all things there is a trade off. Things involving the
>> > very tightest latency would probably opt for a bare metal backend which
>> > I think would imply hypervisor knowledge in the backend binary.
>> 
>> In configuration phase of virtio device, the latency won't be a big matter.
>> In device operations (i.e. read/write to block devices), if we can
>> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
>> how efficiently we can deliver notification to the opposite side. Right?
>> And this is a very common problem whatever approach we would take.
>> 
>> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>> related code can be re-implemented just as a stub (or shim?) library
>> since the protocols are defined as RPCs.
>> In this case, however, we would lose the benefit of providing "single binary"
>> BE.
>> (I know this is is an arguable requirement, though.)
>
> In my experience, latency, performance, and security are far more
> important than providing a single binary.
>
> In my opinion, we should optimize for the best performance and security,
> then be practical on the topic of hypervisor agnosticism. For instance,
> a shared source with a small hypervisor-specific component, with one
> implementation of the small component for each hypervisor, would provide
> a good enough hypervisor abstraction. It is good to be hypervisor
> agnostic, but I wouldn't go extra lengths to have a single binary.

I agree it shouldn't be a primary goal although a single binary working
with helpers to bridge the gap would make a cool demo. The real aim of
agnosticism is avoid having multiple implementations of the backend
itself for no other reason than a change in hypervisor.

> I cannot picture a case where a BE binary needs to be moved between
> different hypervisors and a recompilation is impossible (BE, not FE).
> Instead, I can definitely imagine detailed requirements on IRQ latency
> having to be lower than 10us or bandwidth higher than 500 MB/sec.
>
> Instead of virtio-proxy, my suggestion is to work together on a common
> project and common source with others interested in the same problem.
>
> I would pick something like kvmtool as a basis. It doesn't have to be
> kvmtools, and kvmtools specifically is GPL-licensed, which is
> unfortunate because it would help if the license was BSD-style for ease
> of integration with Zephyr and other RTOSes.

This does imply making some choices, especially the implementation
language. However I feel that C is really the lowest common denominator
here and I get the sense that people would rather avoid it if they could
given the potential security implications of a bug prone back end. This
is what is prompting interest in Rust.

> As long as the project is open to working together on multiple
> hypervisors and deployment models then it is fine. For instance, the
> shared source could be based on OpenAMP kvmtool [1] (the original
> kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> kvmtool was created to add support for hypervisor-less virtio but they
> are very open to hypervisors too. It could be a good place to add a Xen
> implementation, a KVM fatqueue implementation, a Jailhouse
> implementation, etc. -- work together toward the common goal of a single
> BE source (not binary) supporting multiple different deployment models.
>
>
> [1] https://github.com/OpenAMP/kvmtool


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [virtio-dev] Re: Enabling hypervisor agnosticism for VirtIO backends
@ 2021-09-14 14:25                 ` Alex Bennée
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Bennée @ 2021-09-14 14:25 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: AKASHI Takahiro, Stefan Hajnoczi, Stefano Stabellini,
	Stratos Mailing List, virtio-dev, Arnd Bergmann, Viresh Kumar,
	Jan Kiszka, Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier, Wei.Chen, olekstysh,
	Oleksandr_Tyshchenko, Bertrand.Marquis, Artem_Mygaiev, julien,
	jgross, paul, xen-devel, Elena Afanasova


Stefano Stabellini <stefano.stabellini@xilinx.com> writes:

> On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
>> > the second is how many context switches are involved in a transaction.
>> > Of course with all things there is a trade off. Things involving the
>> > very tightest latency would probably opt for a bare metal backend which
>> > I think would imply hypervisor knowledge in the backend binary.
>> 
>> In configuration phase of virtio device, the latency won't be a big matter.
>> In device operations (i.e. read/write to block devices), if we can
>> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
>> how efficiently we can deliver notification to the opposite side. Right?
>> And this is a very common problem whatever approach we would take.
>> 
>> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>> related code can be re-implemented just as a stub (or shim?) library
>> since the protocols are defined as RPCs.
>> In this case, however, we would lose the benefit of providing "single binary"
>> BE.
>> (I know this is is an arguable requirement, though.)
>
> In my experience, latency, performance, and security are far more
> important than providing a single binary.
>
> In my opinion, we should optimize for the best performance and security,
> then be practical on the topic of hypervisor agnosticism. For instance,
> a shared source with a small hypervisor-specific component, with one
> implementation of the small component for each hypervisor, would provide
> a good enough hypervisor abstraction. It is good to be hypervisor
> agnostic, but I wouldn't go extra lengths to have a single binary.

I agree it shouldn't be a primary goal although a single binary working
with helpers to bridge the gap would make a cool demo. The real aim of
agnosticism is avoid having multiple implementations of the backend
itself for no other reason than a change in hypervisor.

> I cannot picture a case where a BE binary needs to be moved between
> different hypervisors and a recompilation is impossible (BE, not FE).
> Instead, I can definitely imagine detailed requirements on IRQ latency
> having to be lower than 10us or bandwidth higher than 500 MB/sec.
>
> Instead of virtio-proxy, my suggestion is to work together on a common
> project and common source with others interested in the same problem.
>
> I would pick something like kvmtool as a basis. It doesn't have to be
> kvmtools, and kvmtools specifically is GPL-licensed, which is
> unfortunate because it would help if the license was BSD-style for ease
> of integration with Zephyr and other RTOSes.

This does imply making some choices, especially the implementation
language. However I feel that C is really the lowest common denominator
here and I get the sense that people would rather avoid it if they could
given the potential security implications of a bug prone back end. This
is what is prompting interest in Rust.

> As long as the project is open to working together on multiple
> hypervisors and deployment models then it is fine. For instance, the
> shared source could be based on OpenAMP kvmtool [1] (the original
> kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> kvmtool was created to add support for hypervisor-less virtio but they
> are very open to hypervisors too. It could be a good place to add a Xen
> implementation, a KVM fatqueue implementation, a Jailhouse
> implementation, etc. -- work together toward the common goal of a single
> BE source (not binary) supporting multiple different deployment models.
>
>
> [1] https://github.com/OpenAMP/kvmtool


-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-13 23:51             ` Stefano Stabellini
  2021-09-14  6:08               ` [Stratos-dev] " François Ozog
  2021-09-14 14:25                 ` [virtio-dev] " Alex Bennée
@ 2021-09-14 17:38               ` Trilok Soni
  2021-09-15  3:29                 ` Stefano Stabellini
  2 siblings, 1 reply; 66+ messages in thread
From: Trilok Soni @ 2021-09-14 17:38 UTC (permalink / raw)
  To: Stefano Stabellini, AKASHI Takahiro
  Cc: paul, Stratos Mailing List, virtio-dev, Stefano Stabellini,
	Jan Kiszka, Arnd Bergmann, jgross, julien, Carl van Schaik,
	Bertrand.Marquis, Stefan Hajnoczi, Artem_Mygaiev, xen-devel,
	olekstysh, Oleksandr_Tyshchenko, Elena Afanasova


Hello,

On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
> On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
>>> the second is how many context switches are involved in a transaction.
>>> Of course with all things there is a trade off. Things involving the
>>> very tightest latency would probably opt for a bare metal backend which
>>> I think would imply hypervisor knowledge in the backend binary.
>>
>> In configuration phase of virtio device, the latency won't be a big matter.
>> In device operations (i.e. read/write to block devices), if we can
>> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
>> how efficiently we can deliver notification to the opposite side. Right?
>> And this is a very common problem whatever approach we would take.
>>
>> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>> related code can be re-implemented just as a stub (or shim?) library
>> since the protocols are defined as RPCs.
>> In this case, however, we would lose the benefit of providing "single binary"
>> BE.
>> (I know this is is an arguable requirement, though.)
> 
> In my experience, latency, performance, and security are far more
> important than providing a single binary.
> 
> In my opinion, we should optimize for the best performance and security,
> then be practical on the topic of hypervisor agnosticism. For instance,
> a shared source with a small hypervisor-specific component, with one
> implementation of the small component for each hypervisor, would provide
> a good enough hypervisor abstraction. It is good to be hypervisor
> agnostic, but I wouldn't go extra lengths to have a single binary. I
> cannot picture a case where a BE binary needs to be moved between
> different hypervisors and a recompilation is impossible (BE, not FE).
> Instead, I can definitely imagine detailed requirements on IRQ latency
> having to be lower than 10us or bandwidth higher than 500 MB/sec.
> 
> Instead of virtio-proxy, my suggestion is to work together on a common
> project and common source with others interested in the same problem.
> 
> I would pick something like kvmtool as a basis. It doesn't have to be
> kvmtools, and kvmtools specifically is GPL-licensed, which is
> unfortunate because it would help if the license was BSD-style for ease
> of integration with Zephyr and other RTOSes.
> 
> As long as the project is open to working together on multiple
> hypervisors and deployment models then it is fine. For instance, the
> shared source could be based on OpenAMP kvmtool [1] (the original
> kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> kvmtool was created to add support for hypervisor-less virtio but they
> are very open to hypervisors too. It could be a good place to add a Xen
> implementation, a KVM fatqueue implementation, a Jailhouse
> implementation, etc. -- work together toward the common goal of a single
> BE source (not binary) supporting multiple different deployment models.

I have my reservations on using "kvmtool" to do any development here. 
"kvmtool" can't be used on the products and it is just a tool for the 
developers.

The benefit of the solving problem w/ rust-vmm is that some of the 
crates from this project can be utilized for the real product. Alex has 
mentioned that "rust-vmm" today has some KVM specific bits but the 
rust-vmm community is already discussing to remove or re-org them in 
such a way that other Hypervisors can fit in.

Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some 
of the rust-vmm components as well and they had shown interest to add 
the Hyper-V support in the "rust-vmm" project as well. I don't know the 
current progress but they had proven it it "cloud-hypervisor" project.

"rust-vmm" project's license will work as well for most of the project 
developments and I see that "CrosVM" is shipping in the products as well.


---Trilok Soni



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-14 17:38               ` [Stratos-dev] " Trilok Soni
@ 2021-09-15  3:29                 ` Stefano Stabellini
  2021-09-15 23:50                   ` Trilok Soni
  0 siblings, 1 reply; 66+ messages in thread
From: Stefano Stabellini @ 2021-09-15  3:29 UTC (permalink / raw)
  To: Trilok Soni
  Cc: Stefano Stabellini, AKASHI Takahiro, paul, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	jgross, julien, Carl van Schaik, Bertrand.Marquis,
	Stefan Hajnoczi, Artem_Mygaiev, xen-devel, olekstysh,
	Oleksandr_Tyshchenko, Elena Afanasova

On Tue, 14 Sep 2021, Trilok Soni wrote:
> On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
> > On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
> > > > the second is how many context switches are involved in a transaction.
> > > > Of course with all things there is a trade off. Things involving the
> > > > very tightest latency would probably opt for a bare metal backend which
> > > > I think would imply hypervisor knowledge in the backend binary.
> > > 
> > > In configuration phase of virtio device, the latency won't be a big
> > > matter.
> > > In device operations (i.e. read/write to block devices), if we can
> > > resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
> > > is
> > > how efficiently we can deliver notification to the opposite side. Right?
> > > And this is a very common problem whatever approach we would take.
> > > 
> > > Anyhow, if we do care the latency in my approach, most of virtio-proxy-
> > > related code can be re-implemented just as a stub (or shim?) library
> > > since the protocols are defined as RPCs.
> > > In this case, however, we would lose the benefit of providing "single
> > > binary"
> > > BE.
> > > (I know this is is an arguable requirement, though.)
> > 
> > In my experience, latency, performance, and security are far more
> > important than providing a single binary.
> > 
> > In my opinion, we should optimize for the best performance and security,
> > then be practical on the topic of hypervisor agnosticism. For instance,
> > a shared source with a small hypervisor-specific component, with one
> > implementation of the small component for each hypervisor, would provide
> > a good enough hypervisor abstraction. It is good to be hypervisor
> > agnostic, but I wouldn't go extra lengths to have a single binary. I
> > cannot picture a case where a BE binary needs to be moved between
> > different hypervisors and a recompilation is impossible (BE, not FE).
> > Instead, I can definitely imagine detailed requirements on IRQ latency
> > having to be lower than 10us or bandwidth higher than 500 MB/sec.
> > 
> > Instead of virtio-proxy, my suggestion is to work together on a common
> > project and common source with others interested in the same problem.
> > 
> > I would pick something like kvmtool as a basis. It doesn't have to be
> > kvmtools, and kvmtools specifically is GPL-licensed, which is
> > unfortunate because it would help if the license was BSD-style for ease
> > of integration with Zephyr and other RTOSes.
> > 
> > As long as the project is open to working together on multiple
> > hypervisors and deployment models then it is fine. For instance, the
> > shared source could be based on OpenAMP kvmtool [1] (the original
> > kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> > kvmtool was created to add support for hypervisor-less virtio but they
> > are very open to hypervisors too. It could be a good place to add a Xen
> > implementation, a KVM fatqueue implementation, a Jailhouse
> > implementation, etc. -- work together toward the common goal of a single
> > BE source (not binary) supporting multiple different deployment models.
> 
> I have my reservations on using "kvmtool" to do any development here.
> "kvmtool" can't be used on the products and it is just a tool for the
> developers.
>
> The benefit of the solving problem w/ rust-vmm is that some of the crates from
> this project can be utilized for the real product. Alex has mentioned that
> "rust-vmm" today has some KVM specific bits but the rust-vmm community is
> already discussing to remove or re-org them in such a way that other
> Hypervisors can fit in.
> 
> Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of
> the rust-vmm components as well and they had shown interest to add the Hyper-V
> support in the "rust-vmm" project as well. I don't know the current progress
> but they had proven it it "cloud-hypervisor" project.
> 
> "rust-vmm" project's license will work as well for most of the project
> developments and I see that "CrosVM" is shipping in the products as well.

Most things in open source start as a developers tool before they become
part of a product :)

I am concerned about how "embeddable" rust-vmm is going to be. Do you
think it would be possible to run it against an RTOS together with other
apps written in C?

Let me make a realistic example. You can imagine a Zephyr instance with
simple toolstack functionalities written in C (starting/stopping VMs).
One might want to add a virtio backend to it. I am not familiar enough
with Rust and rust-vmm to know if it would be feasible and "easy" to run
a rust-vmm backend as a Zephyr app.

A C project of the size of kvmtool, but BSD-licensed, could run on
Zephyr with only a little porting effort using the POSIX compatibility
layer. I think that would be ideal. Anybody aware of a project
fulfilling these requirements?


If we have to give up the ability to integrate with an RTOS, then I
think QEMU could be the leading choice because is still the main
reference implementation for virtio.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-15  3:29                 ` Stefano Stabellini
@ 2021-09-15 23:50                   ` Trilok Soni
  2021-09-16  2:11                     ` Stefano Stabellini
  0 siblings, 1 reply; 66+ messages in thread
From: Trilok Soni @ 2021-09-15 23:50 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: AKASHI Takahiro, paul, Stratos Mailing List, virtio-dev,
	Stefano Stabellini, Jan Kiszka, Arnd Bergmann, jgross, julien,
	Carl van Schaik, Bertrand.Marquis, Stefan Hajnoczi,
	Artem_Mygaiev, xen-devel, olekstysh, Oleksandr_Tyshchenko,
	Elena Afanasova

Hi Stefano,

On 9/14/2021 8:29 PM, Stefano Stabellini wrote:
> On Tue, 14 Sep 2021, Trilok Soni wrote:
>> On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
>>> On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
>>>>> the second is how many context switches are involved in a transaction.
>>>>> Of course with all things there is a trade off. Things involving the
>>>>> very tightest latency would probably opt for a bare metal backend which
>>>>> I think would imply hypervisor knowledge in the backend binary.
>>>>
>>>> In configuration phase of virtio device, the latency won't be a big
>>>> matter.
>>>> In device operations (i.e. read/write to block devices), if we can
>>>> resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
>>>> is
>>>> how efficiently we can deliver notification to the opposite side. Right?
>>>> And this is a very common problem whatever approach we would take.
>>>>
>>>> Anyhow, if we do care the latency in my approach, most of virtio-proxy-
>>>> related code can be re-implemented just as a stub (or shim?) library
>>>> since the protocols are defined as RPCs.
>>>> In this case, however, we would lose the benefit of providing "single
>>>> binary"
>>>> BE.
>>>> (I know this is is an arguable requirement, though.)
>>>
>>> In my experience, latency, performance, and security are far more
>>> important than providing a single binary.
>>>
>>> In my opinion, we should optimize for the best performance and security,
>>> then be practical on the topic of hypervisor agnosticism. For instance,
>>> a shared source with a small hypervisor-specific component, with one
>>> implementation of the small component for each hypervisor, would provide
>>> a good enough hypervisor abstraction. It is good to be hypervisor
>>> agnostic, but I wouldn't go extra lengths to have a single binary. I
>>> cannot picture a case where a BE binary needs to be moved between
>>> different hypervisors and a recompilation is impossible (BE, not FE).
>>> Instead, I can definitely imagine detailed requirements on IRQ latency
>>> having to be lower than 10us or bandwidth higher than 500 MB/sec.
>>>
>>> Instead of virtio-proxy, my suggestion is to work together on a common
>>> project and common source with others interested in the same problem.
>>>
>>> I would pick something like kvmtool as a basis. It doesn't have to be
>>> kvmtools, and kvmtools specifically is GPL-licensed, which is
>>> unfortunate because it would help if the license was BSD-style for ease
>>> of integration with Zephyr and other RTOSes.
>>>
>>> As long as the project is open to working together on multiple
>>> hypervisors and deployment models then it is fine. For instance, the
>>> shared source could be based on OpenAMP kvmtool [1] (the original
>>> kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
>>> kvmtool was created to add support for hypervisor-less virtio but they
>>> are very open to hypervisors too. It could be a good place to add a Xen
>>> implementation, a KVM fatqueue implementation, a Jailhouse
>>> implementation, etc. -- work together toward the common goal of a single
>>> BE source (not binary) supporting multiple different deployment models.
>>
>> I have my reservations on using "kvmtool" to do any development here.
>> "kvmtool" can't be used on the products and it is just a tool for the
>> developers.
>>
>> The benefit of the solving problem w/ rust-vmm is that some of the crates from
>> this project can be utilized for the real product. Alex has mentioned that
>> "rust-vmm" today has some KVM specific bits but the rust-vmm community is
>> already discussing to remove or re-org them in such a way that other
>> Hypervisors can fit in.
>>
>> Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of
>> the rust-vmm components as well and they had shown interest to add the Hyper-V
>> support in the "rust-vmm" project as well. I don't know the current progress
>> but they had proven it it "cloud-hypervisor" project.
>>
>> "rust-vmm" project's license will work as well for most of the project
>> developments and I see that "CrosVM" is shipping in the products as well.
> 
> Most things in open source start as a developers tool before they become
> part of a product :)

Agree, but I had an offline discussions with one the active developer of 
kvmtool and the confidence of using it in the product was no where near 
we expected during our evaluation. Same goes the QEMU and one of the 
biggest problem was no. of security issues against this huge codebase of 
QEMU.

> 
> I am concerned about how "embeddable" rust-vmm is going to be. Do you
> think it would be possible to run it against an RTOS together with other
> apps written in C?

I don't see any limitations of rust-vmm. For example, I am confident 
that we can port rust-vmm based backend into the QNX as host OS and same 
goes w/ Zephyr as well. Some work is needed but nothing fundamentally 
blocking it. We should be able to run it w/ Fuchsia as well with some 
effort.

---Trilok Soni


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
  2021-09-15 23:50                   ` Trilok Soni
@ 2021-09-16  2:11                     ` Stefano Stabellini
  0 siblings, 0 replies; 66+ messages in thread
From: Stefano Stabellini @ 2021-09-16  2:11 UTC (permalink / raw)
  To: Trilok Soni
  Cc: Stefano Stabellini, AKASHI Takahiro, paul, Stratos Mailing List,
	virtio-dev, Stefano Stabellini, Jan Kiszka, Arnd Bergmann,
	jgross, julien, Carl van Schaik, Bertrand.Marquis,
	Stefan Hajnoczi, Artem_Mygaiev, xen-devel, olekstysh,
	Oleksandr_Tyshchenko, Elena Afanasova

On Wed, 15 Sep 2021, Trilok Soni wrote:
> On 9/14/2021 8:29 PM, Stefano Stabellini wrote:
> > On Tue, 14 Sep 2021, Trilok Soni wrote:
> > > On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
> > > > On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
> > > > > > the second is how many context switches are involved in a
> > > > > > transaction.
> > > > > > Of course with all things there is a trade off. Things involving the
> > > > > > very tightest latency would probably opt for a bare metal backend
> > > > > > which
> > > > > > I think would imply hypervisor knowledge in the backend binary.
> > > > > 
> > > > > In configuration phase of virtio device, the latency won't be a big
> > > > > matter.
> > > > > In device operations (i.e. read/write to block devices), if we can
> > > > > resolve 'mmap' issue, as Oleksandr is proposing right now, the only
> > > > > issue
> > > > > is
> > > > > how efficiently we can deliver notification to the opposite side.
> > > > > Right?
> > > > > And this is a very common problem whatever approach we would take.
> > > > > 
> > > > > Anyhow, if we do care the latency in my approach, most of
> > > > > virtio-proxy-
> > > > > related code can be re-implemented just as a stub (or shim?) library
> > > > > since the protocols are defined as RPCs.
> > > > > In this case, however, we would lose the benefit of providing "single
> > > > > binary"
> > > > > BE.
> > > > > (I know this is is an arguable requirement, though.)
> > > > 
> > > > In my experience, latency, performance, and security are far more
> > > > important than providing a single binary.
> > > > 
> > > > In my opinion, we should optimize for the best performance and security,
> > > > then be practical on the topic of hypervisor agnosticism. For instance,
> > > > a shared source with a small hypervisor-specific component, with one
> > > > implementation of the small component for each hypervisor, would provide
> > > > a good enough hypervisor abstraction. It is good to be hypervisor
> > > > agnostic, but I wouldn't go extra lengths to have a single binary. I
> > > > cannot picture a case where a BE binary needs to be moved between
> > > > different hypervisors and a recompilation is impossible (BE, not FE).
> > > > Instead, I can definitely imagine detailed requirements on IRQ latency
> > > > having to be lower than 10us or bandwidth higher than 500 MB/sec.
> > > > 
> > > > Instead of virtio-proxy, my suggestion is to work together on a common
> > > > project and common source with others interested in the same problem.
> > > > 
> > > > I would pick something like kvmtool as a basis. It doesn't have to be
> > > > kvmtools, and kvmtools specifically is GPL-licensed, which is
> > > > unfortunate because it would help if the license was BSD-style for ease
> > > > of integration with Zephyr and other RTOSes.
> > > > 
> > > > As long as the project is open to working together on multiple
> > > > hypervisors and deployment models then it is fine. For instance, the
> > > > shared source could be based on OpenAMP kvmtool [1] (the original
> > > > kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
> > > > kvmtool was created to add support for hypervisor-less virtio but they
> > > > are very open to hypervisors too. It could be a good place to add a Xen
> > > > implementation, a KVM fatqueue implementation, a Jailhouse
> > > > implementation, etc. -- work together toward the common goal of a single
> > > > BE source (not binary) supporting multiple different deployment models.
> > > 
> > > I have my reservations on using "kvmtool" to do any development here.
> > > "kvmtool" can't be used on the products and it is just a tool for the
> > > developers.
> > > 
> > > The benefit of the solving problem w/ rust-vmm is that some of the crates
> > > from
> > > this project can be utilized for the real product. Alex has mentioned that
> > > "rust-vmm" today has some KVM specific bits but the rust-vmm community is
> > > already discussing to remove or re-org them in such a way that other
> > > Hypervisors can fit in.
> > > 
> > > Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some
> > > of
> > > the rust-vmm components as well and they had shown interest to add the
> > > Hyper-V
> > > support in the "rust-vmm" project as well. I don't know the current
> > > progress
> > > but they had proven it it "cloud-hypervisor" project.
> > > 
> > > "rust-vmm" project's license will work as well for most of the project
> > > developments and I see that "CrosVM" is shipping in the products as well.
> > 
> > Most things in open source start as a developers tool before they become
> > part of a product :)
> 
> Agree, but I had an offline discussions with one the active developer of
> kvmtool and the confidence of using it in the product was no where near we
> expected during our evaluation. Same goes the QEMU and one of the biggest
> problem was no. of security issues against this huge codebase of QEMU.

That is fair, but it is important to recognize that these are *known*
security issues.

Does rust-vmm have a security process and a security response team? I
tried googling for it but couldn't find relevant info.

QEMU is a very widely used and very well inspected codebase. It has a
mailing list to report security issues and a security process. As a
consequence we know of many vulnerabilities affecting the code base.
As far as I am aware rust-vmm has not been inspected yet with the same
level of attention and the same amount of security researchers.

That said, of course it is undeniable that the larger size of QEMU
implies a higher amount of security issues. But for this project, we
wouldn't be using the whole of QEMU of course. We would be narrowing it
down to a build with only few revelant pieces. I imagine that the total
LOC count would still be higher but the number of relevant security
vulnerabilities would only be a small fraction of the QEMU total.

 
> > I am concerned about how "embeddable" rust-vmm is going to be. Do you
> > think it would be possible to run it against an RTOS together with other
> > apps written in C?
> 
> I don't see any limitations of rust-vmm. For example, I am confident that we
> can port rust-vmm based backend into the QNX as host OS and same goes w/
> Zephyr as well. Some work is needed but nothing fundamentally blocking it. We
> should be able to run it w/ Fuchsia as well with some effort.
 
That's good to hear.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
       [not found]       ` <20210823012029.GB40863@laputa>
@ 2021-10-04 11:33         ` Matias Ezequiel Vara Larsen
  0 siblings, 0 replies; 66+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2021-10-04 11:33 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Alex Benn??e, Stratos Mailing List, virtio-dev, Arnd Bergmann,
	Viresh Kumar, Stefano Stabellini, stefanha, Jan Kiszka,
	Carl van Schaik, pratikp, Srivatsa Vaddagiri,
	Jean-Philippe Brucker, Mathieu Poirier

On Mon, Aug 23, 2021 at 10:20:29AM +0900, AKASHI Takahiro wrote:
> Hi Matias,
> 
> On Sat, Aug 21, 2021 at 04:08:20PM +0200, Matias Ezequiel Vara Larsen wrote:
> > Hello,
> > 
> > On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
> > > Hi Matias,
> > > 
> > > On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
> > > > Hello Alex,
> > > > 
> > > > I can tell you my experience from working on a PoC (library) 
> > > > to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
> > > 
> > > What hypervisor are you using for your PoC here?
> > > 
> > 
> > I am using an in-house hypervisor, which is similar to Jailhouse.
> > 
> > > > I focused on two use cases:
> > > > 1. type-I hypervisor in which the backend is running as a VM. This
> > > > is an in-house hypervisor that does not support VMExits.
> > > > 2. Linux user-space. In this case, the library is just used to
> > > > communicate threads. The goal of this use case is merely testing.
> > > > 
> > > > I have chosen virtio-mmio as the way to exchange information
> > > > between the frontend and backend. I found it hard to synchronize the
> > > > access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
> > > 
> > > Can you explain how MMIOs to registers in virito-mmio layout
> > > (which I think means a configuration space?) will be propagated to BE?
> > > 
> > 
> > In this PoC, the BE guest is created with a fixed number of regions
> > of memory that represents each device. The BE initializes these regions, and then, waits
> > for the FEs to begin the initialization. 
> 
> Let me ask you in another way; When FE tries to write a register
> in configuration space, say QueueSel, how is BE notified of this event?
> 
In my PoC, it is never notified when FE writes to a register. For example, the QueueSel is only used in one of the
steps of the device status configuration. The BE is only notified when the
FE is in that step. When the FE is setting up the vrings, it sets the address, set the QueueSel, and 
then blocks until the BE can get the values. The BE gets the values and resumes the FE, which moves to the next step. 

> > > > the front-end and back-end to synchronize, which is required
> > > > during the device-status initialization. These extra bits would not be 
> > > > needed in case the hypervisor supports VMExits, e.g., KVM.
> > > > 
> > > > Each guest has a memory region that is shared with the backend. 
> > > > This memory region is used by the frontend to allocate the io-buffers. This region also 
> > > > maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
> > > > is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
> > > 
> > > So in summary, you have a single memory region that is used
> > > for virtio-mmio layout and io-buffers (I think they are for payload)
> > > and you assume that the region will be (at lease for now) statically
> > > shared between FE and BE so that you can eliminate 'mmap' at every
> > > time to access the payload.
> > > Correct?
> > >
> > 
> > Yes, It is. 
> > 
> > > If so, it can be an alternative solution for memory access issue,
> > > and a similar technique is used in some implementations:
> > > - (Jailhouse's) ivshmem
> > > - Arnd's fat virtqueue
> > >
> > > In either case, however, you will have to allocate payload from the region
> > > and so you will see some impact on FE code (at least at some low level).
> > > (In ivshmem, dma_ops in the kernel is defined for this purpose.)
> > > Correct?
> > 
> > Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that
> > memory region.
> > 
> > Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and 
> > the BE are VMs. The use of VMExits may require to involve the hypervisor.
> 
> Maybe I misunderstand something. Are FE/BE not VMs in your PoC?
> 

Yes, both are VMs. I meant, in case that both are VMs AND a VMExit
mechanism is used, such a mechanism would require the hypervisor to
forward the traps. In my PoC, both are VMs BUT there is not a VMExit
mechanism.

Matias
> -Takahiro Akashi
> 
> > Matias
> > > 
> > > -Takahiro Akashi
> > > 
> > > > At some point, the guest shall be able to balloon this region. Notifications between 
> > > > the frontend and the backend are implemented by using an hypercall. The hypercall 
> > > > mechanism and the memory allocation are abstracted away by a platform layer that 
> > > > exposes an interface that is hypervisor/os agnostic.
> > > > 
> > > > I split the backend into a virtio-device driver and a
> > > > backend driver. The virtio-device driver is the virtqueues and the
> > > > backend driver gets packets from the virtqueue for
> > > > post-processing. For example, in the case of virtio-net, the backend
> > > > driver would decide if the packet goes to the hardware or to another
> > > > virtio-net device. The virtio-device drivers may be
> > > > implemented in different ways like by using a single thread, multiple threads, 
> > > > or one thread for all the virtio-devices.
> > > > 
> > > > In this PoC, I just tackled two very simple use-cases. These
> > > > use-cases allowed me to extract some requirements for an hypervisor to
> > > > support virtio.
> > > > 
> > > > Matias
> > > > 
> > > > On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> > > > > Hi,
> > > > > 
> > > > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > > > backends so we can enable as much re-use of code as possible and avoid
> > > > > repeating ourselves. This is the flip side of the front end where
> > > > > multiple front-end implementations are required - one per OS, assuming
> > > > > you don't just want Linux guests. The resultant guests are trivially
> > > > > movable between hypervisors modulo any abstracted paravirt type
> > > > > interfaces.
> > > > > 
> > > > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > > > daemons running in a broadly POSIX like environment. The interface to
> > > > > the daemon is fairly simple requiring only some mapped memory and some
> > > > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > > > stub binary would be responsible for any hypervisor specific setup and
> > > > > then launch a common binary to deal with the actual virtqueue requests
> > > > > themselves.
> > > > > 
> > > > > Since that original sketch we've seen an expansion in the sort of ways
> > > > > backends could be created. There is interest in encapsulating backends
> > > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > > > has prompted ideas of using the trait interface to abstract differences
> > > > > away as well as the idea of bare-metal Rust backends.
> > > > > 
> > > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > > calls for a description of the APIs needed from the hypervisor side to
> > > > > support VirtIO guests and their backends. However we are some way off
> > > > > from that at the moment as I think we need to at least demonstrate one
> > > > > portable backend before we start codifying requirements. To that end I
> > > > > want to think about what we need for a backend to function.
> > > > > 
> > > > > Configuration
> > > > > =============
> > > > > 
> > > > > In the type-2 setup this is typically fairly simple because the host
> > > > > system can orchestrate the various modules that make up the complete
> > > > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > > > we need some sort of mechanism to inform the backend VM about key
> > > > > details about the system:
> > > > > 
> > > > >   - where virt queue memory is in it's address space
> > > > >   - how it's going to receive (interrupt) and trigger (kick) events
> > > > >   - what (if any) resources the backend needs to connect to
> > > > > 
> > > > > Obviously you can elide over configuration issues by having static
> > > > > configurations and baking the assumptions into your guest images however
> > > > > this isn't scalable in the long term. The obvious solution seems to be
> > > > > extending a subset of Device Tree data to user space but perhaps there
> > > > > are other approaches?
> > > > > 
> > > > > Before any virtio transactions can take place the appropriate memory
> > > > > mappings need to be made between the FE guest and the BE guest.
> > > > > Currently the whole of the FE guests address space needs to be visible
> > > > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > > > 
> > > > >  * BE guest boots with memory already mapped
> > > > > 
> > > > >  This would entail the guest OS knowing where in it's Guest Physical
> > > > >  Address space is already taken up and avoiding clashing. I would assume
> > > > >  in this case you would want a standard interface to userspace to then
> > > > >  make that address space visible to the backend daemon.
> > > > > 
> > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > 
> > > > >  The BE guest is then free to map the FE's memory to where it wants in
> > > > >  the BE's guest physical address space. To activate the mapping will
> > > > >  require some sort of hypercall to the hypervisor. I can see two options
> > > > >  at this point:
> > > > > 
> > > > >   - expose the handle to userspace for daemon/helper to trigger the
> > > > >     mapping via existing hypercall interfaces. If using a helper you
> > > > >     would have a hypervisor specific one to avoid the daemon having to
> > > > >     care too much about the details or push that complexity into a
> > > > >     compile time option for the daemon which would result in different
> > > > >     binaries although a common source base.
> > > > > 
> > > > >   - expose a new kernel ABI to abstract the hypercall differences away
> > > > >     in the guest kernel. In this case the userspace would essentially
> > > > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > > > >     the kernel deal with the different hypercall interfaces. This of
> > > > >     course assumes the majority of BE guests would be Linux kernels and
> > > > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > > > 
> > > > > Operation
> > > > > =========
> > > > > 
> > > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > > vhost-user feature negotiation is done it's a case of receiving update
> > > > > events and parsing the resultant virt queue for data. The vhost-user
> > > > > specification handles a bunch of setup before that point, mostly to
> > > > > detail where the virt queues are set up FD's for memory and event
> > > > > communication. This is where the envisioned stub process would be
> > > > > responsible for getting the daemon up and ready to run. This is
> > > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > > approach would be to use the rust-vmm vhost crate. It would then either
> > > > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > > > build option for the various hypervisors.
> > > > > 
> > > > > One question is how to best handle notification and kicks. The existing
> > > > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > > > IOREQ mechanism. However latency is an important factor and having
> > > > > events go through the stub would add quite a lot.
> > > > > 
> > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > hypercall interfaces?
> > > > > 
> > > > > So any thoughts on what directions are worth experimenting with?
> > > > > 
> > > > > -- 
> > > > > Alex Bennée
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2021-10-04 11:35 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-04  9:04 [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends Alex Bennée
2021-08-04 19:20 ` Stefano Stabellini
2021-08-11  6:27   ` AKASHI Takahiro
2021-08-14 15:37     ` Oleksandr Tyshchenko
2021-08-16 10:04       ` Wei Chen
2021-08-17  8:07         ` AKASHI Takahiro
2021-08-17  8:39           ` Wei Chen
2021-08-18  5:38             ` AKASHI Takahiro
2021-08-18  8:35               ` Wei Chen
2021-08-20  6:41                 ` AKASHI Takahiro
2021-08-26  9:40                   ` AKASHI Takahiro
2021-08-26 12:10                     ` Wei Chen
2021-08-30 19:36                       ` Christopher Clark
2021-08-30 19:53                         ` Christopher Clark
2021-08-30 19:53                           ` [virtio-dev] " Christopher Clark
2021-09-02  7:19                           ` AKASHI Takahiro
2021-09-07  0:57                             ` Christopher Clark
2021-09-07  0:57                               ` [virtio-dev] " Christopher Clark
2021-09-07 11:55                               ` AKASHI Takahiro
2021-09-07 18:09                                 ` Christopher Clark
2021-09-07 18:09                                   ` [virtio-dev] " Christopher Clark
2021-09-10  3:12                                   ` AKASHI Takahiro
2021-08-31  6:18                       ` AKASHI Takahiro
2021-09-01 11:12                         ` Wei Chen
2021-09-01 12:29                           ` AKASHI Takahiro
2021-09-01 16:26                             ` Oleksandr Tyshchenko
2021-09-02  1:30                             ` Wei Chen
2021-09-02  1:50                               ` Wei Chen
     [not found]   ` <0100017b33e585a5-06d4248e-b1a7-485e-800c-7ead89e5f916-000000@email.amazonses.com>
2021-08-12  7:55     ` [Stratos-dev] " François Ozog
2021-08-13  5:10       ` AKASHI Takahiro
2021-09-01  8:57         ` Alex Bennée
2021-09-01  8:57           ` [virtio-dev] " Alex Bennée
2021-08-17 10:41   ` Stefan Hajnoczi
2021-08-17 10:41     ` [virtio-dev] " Stefan Hajnoczi
2021-08-23  6:25     ` AKASHI Takahiro
2021-08-23  9:58       ` Stefan Hajnoczi
2021-08-23  9:58         ` [virtio-dev] " Stefan Hajnoczi
2021-08-25 10:29         ` AKASHI Takahiro
2021-08-25 15:02           ` Stefan Hajnoczi
2021-08-25 15:02             ` [virtio-dev] " Stefan Hajnoczi
2021-09-01 12:53     ` Alex Bennée
2021-09-01 12:53       ` [virtio-dev] " Alex Bennée
2021-09-02  9:12       ` Stefan Hajnoczi
2021-09-02  9:12         ` [virtio-dev] " Stefan Hajnoczi
2021-09-03  8:06       ` AKASHI Takahiro
2021-09-03  9:28         ` Alex Bennée
2021-09-03  9:28           ` [virtio-dev] " Alex Bennée
2021-09-06  2:23           ` AKASHI Takahiro
2021-09-07  2:41             ` [Stratos-dev] " Christopher Clark
2021-09-07  2:41               ` [virtio-dev] " Christopher Clark
2021-09-10  2:50               ` AKASHI Takahiro
2021-09-10  9:35               ` Alex Bennée
2021-09-10  9:35                 ` [virtio-dev] " Alex Bennée
2021-09-13 23:51             ` Stefano Stabellini
2021-09-14  6:08               ` [Stratos-dev] " François Ozog
2021-09-14 14:25               ` Alex Bennée
2021-09-14 14:25                 ` [virtio-dev] " Alex Bennée
2021-09-14 17:38               ` [Stratos-dev] " Trilok Soni
2021-09-15  3:29                 ` Stefano Stabellini
2021-09-15 23:50                   ` Trilok Soni
2021-09-16  2:11                     ` Stefano Stabellini
2021-08-05 15:48 ` [virtio-dev] " Stefan Hajnoczi
2021-08-19  9:11 ` [virtio-dev] " Matias Ezequiel Vara Larsen
     [not found]   ` <20210820060558.GB13452@laputa>
2021-08-21 14:08     ` Matias Ezequiel Vara Larsen
     [not found]       ` <20210823012029.GB40863@laputa>
2021-10-04 11:33         ` Matias Ezequiel Vara Larsen
2021-09-01  8:43   ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.