All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
@ 2015-08-31 14:11 Michael S. Tsirkin
  2015-08-31 18:35   ` Nakajima, Jun
                   ` (4 more replies)
  0 siblings, 5 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-08-31 14:11 UTC (permalink / raw)
  To: qemu-devel, virtualization, virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka, Claudio.Fontana

Hello!
During the KVM forum, we discussed supporting virtio on top
of ivshmem. I have considered it, and came up with an alternative
that has several advantages over that - please see below.
Comments welcome.

-----

Existing solutions to userspace switching between VMs on the
same host are vhost-user and ivshmem.

vhost-user works by mapping memory of all VMs being bridged into the
switch memory space.

By comparison, ivshmem works by exposing a shared region of memory to all VMs.
VMs are required to use this region to store packets. The switch only
needs access to this region.

Another difference between vhost-user and ivshmem surfaces when polling
is used. With vhost-user, the switch is required to handle
data movement between VMs, if using polling, this means that 1 host CPU
needs to be sacrificed for this task.

This is easiest to understand when one of the VMs is
used with VF pass-through. This can be schematically shown below:

+-- VM1 --------------+            +---VM2-----------+
| virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+            +-----------------+


With ivshmem in theory communication can happen directly, with two VMs
polling the shared memory region.


I won't spend time listing advantages of vhost-user over ivshmem.
Instead, having identified two advantages of ivshmem over vhost-user,
below is a proposal to extend vhost-user to gain the advantages
of ivshmem.


1: virtio in guest can be extended to allow support
for IOMMUs. This provides guest with full flexibility
about memory which is readable or write able by each device.
By setting up a virtio device for each other VM we need to
communicate to, guest gets full control of its security, from
mapping all memory (like with current vhost-user) to only
mapping buffers used for networking (like ivshmem) to
transient mappings for the duration of data transfer only.
This also allows use of VFIO within guests, for improved
security.

vhost user would need to be extended to send the
mappings programmed by guest IOMMU.

2. qemu can be extended to serve as a vhost-user client:
remote VM mappings over the vhost-user protocol, and
map them into another VM's memory.
This mapping can take, for example, the form of
a BAR of a pci device, which I'll call here vhost-pci - 
with bus address allowed
by VM1's IOMMU mappings being translated into
offsets within this BAR within VM2's physical
memory space.

Since the translation can be a simple one, VM2
can perform it within its vhost-pci device driver.

While this setup would be the most useful with polling,
VM1's ioeventfd can also be mapped to
another VM2's irqfd, and vice versa, such that VMs
can trigger interrupts to each other without need
for a helper thread on the host.


The resulting channel might look something like the following:

+-- VM1 --------------+  +---VM2-----------+
| virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+  +-----------------+

comparing the two diagrams, a vhost-user thread on the host is
no longer required, reducing the host CPU utilization when
polling is active.  At the same time, VM2 can not access all of VM1's
memory - it is limited by the iommu configuration setup by VM1.


Advantages over ivshmem:

- more flexibility, endpoint VMs do not have to place data at any
  specific locations to use the device, in practice this likely
  means less data copies.
- better standardization/code reuse
  virtio changes within guests would be fairly easy to implement
  and would also benefit other backends, besides vhost-user
  standard hotplug interfaces can be used to add and remove these
  channels as VMs are added or removed.
- migration support
  It's easy to implement since ownership of memory is well defined.
  For example, during migration VM2 can notify hypervisor of VM1
  by updating dirty bitmap each time is writes into VM1 memory.

Thanks,

-- 
MST

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
@ 2015-08-31 18:35   ` Nakajima, Jun
  2015-09-01  7:35   ` Jan Kiszka
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-08-31 18:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem. I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.

Hi Michael,

I like this, and it should be able to achieve what I presented at KVM
Forum (vhost-user-shmem).
Comments below.

>
> -----
>
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
>
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
>
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
>
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
>
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
>
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
>
>
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
>
>
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
>
>
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.

I assume that you meant VFIO only for virtio by "use of VFIO".  To get
VFIO working for general direct-I/O (including VFs) in guests, as you
know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
remapping table on x86 (i.e. nested VT-d).

> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.

And I think that we can use VMFUNC to have such transient mappings.

> This also allows use of VFIO within guests, for improved
> security.
>
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join
the group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

>
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci -
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.

I think it's sensible.

>
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
>
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
>
>
> The resulting channel might look something like the following:
>
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
>
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
>
>
> Advantages over ivshmem:
>
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.

Also, the ivshmem functionality could be implemented by this proposal:
- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

>
> Thanks,
>
> --
> MST
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization


-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-08-31 18:35   ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-08-31 18:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem. I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.

Hi Michael,

I like this, and it should be able to achieve what I presented at KVM
Forum (vhost-user-shmem).
Comments below.

>
> -----
>
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
>
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
>
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
>
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
>
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
>
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
>
>
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
>
>
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
>
>
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.

I assume that you meant VFIO only for virtio by "use of VFIO".  To get
VFIO working for general direct-I/O (including VFs) in guests, as you
know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
remapping table on x86 (i.e. nested VT-d).

> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.

And I think that we can use VMFUNC to have such transient mappings.

> This also allows use of VFIO within guests, for improved
> security.
>
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join
the group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

>
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci -
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.

I think it's sensible.

>
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
>
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
>
>
> The resulting channel might look something like the following:
>
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
>
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
>
>
> Advantages over ivshmem:
>
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.

Also, the ivshmem functionality could be implemented by this proposal:
- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

>
> Thanks,
>
> --
> MST
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization


-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
  (?)
@ 2015-09-01  3:03   ` Varun Sethi
  2015-09-01  8:30       ` Michael S. Tsirkin
  -1 siblings, 1 reply; 80+ messages in thread
From: Varun Sethi @ 2015-09-01  3:03 UTC (permalink / raw)
  To: Nakajima, Jun, Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

Hi Michael,
When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?
Also, I am not clear on the following points:
1. How transient memory would be mapped using BAR in the backend VM
2. How would the backend VM update the dirty page bitmap for the frontend VM

Regards
Varun

> -----Original Message-----
> From: qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org
> [mailto:qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org] On
> Behalf Of Nakajima, Jun
> Sent: Monday, August 31, 2015 1:36 PM
> To: Michael S. Tsirkin
> Cc: virtio-dev@lists.oasis-open.org; Jan Kiszka;
> Claudio.Fontana@huawei.com; qemu-devel@nongnu.org; Linux
> Virtualization; opnfv-tech-discuss@lists.opnfv.org
> Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
> communication
> 
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top of
> > ivshmem. I have considered it, and came up with an alternative that
> > has several advantages over that - please see below.
> > Comments welcome.
> 
> Hi Michael,
> 
> I like this, and it should be able to achieve what I presented at KVM Forum
> (vhost-user-shmem).
> Comments below.
> 
> >
> > -----
> >
> > Existing solutions to userspace switching between VMs on the same host
> > are vhost-user and ivshmem.
> >
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> >
> > By comparison, ivshmem works by exposing a shared region of memory to
> all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> >
> > Another difference between vhost-user and ivshmem surfaces when
> > polling is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host
> > CPU needs to be sacrificed for this task.
> >
> > This is easiest to understand when one of the VMs is used with VF
> > pass-through. This can be schematically shown below:
> >
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> >
> >
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> >
> >
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages of
> > ivshmem.
> >
> >
> > 1: virtio in guest can be extended to allow support for IOMMUs. This
> > provides guest with full flexibility about memory which is readable or
> > write able by each device.
> 
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get VFIO
> working for general direct-I/O (including VFs) in guests, as you know, we
> need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
> x86 (i.e. nested VT-d).
> 
> > By setting up a virtio device for each other VM we need to communicate
> > to, guest gets full control of its security, from mapping all memory
> > (like with current vhost-user) to only mapping buffers used for
> > networking (like ivshmem) to transient mappings for the duration of
> > data transfer only.
> 
> And I think that we can use VMFUNC to have such transient mappings.
> 
> > This also allows use of VFIO within guests, for improved security.
> >
> > vhost user would need to be extended to send the mappings programmed
> > by guest IOMMU.
> 
> Right. We need to think about cases where other VMs (VM3, etc.) join the
> group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
> 
> >
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and map them into
> > another VM's memory.
> > This mapping can take, for example, the form of a BAR of a pci device,
> > which I'll call here vhost-pci - with bus address allowed by VM1's
> > IOMMU mappings being translated into offsets within this BAR within
> > VM2's physical memory space.
> 
> I think it's sensible.
> 
> >
> > Since the translation can be a simple one, VM2 can perform it within
> > its vhost-pci device driver.
> >
> > While this setup would be the most useful with polling, VM1's
> > ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
> > such that VMs can trigger interrupts to each other without need for a
> > helper thread on the host.
> >
> >
> > The resulting channel might look something like the following:
> >
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> >
> > comparing the two diagrams, a vhost-user thread on the host is no
> > longer required, reducing the host CPU utilization when polling is
> > active.  At the same time, VM2 can not access all of VM1's memory - it
> > is limited by the iommu configuration setup by VM1.
> >
> >
> > Advantages over ivshmem:
> >
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions
> 
> >
> > Thanks,
> >
> > --
> > MST
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 
> 
> --
> Jun
> Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
  (?)
  (?)
@ 2015-09-01  3:03   ` Varun Sethi
  -1 siblings, 0 replies; 80+ messages in thread
From: Varun Sethi @ 2015-09-01  3:03 UTC (permalink / raw)
  To: Nakajima, Jun, Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

Hi Michael,
When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?
Also, I am not clear on the following points:
1. How transient memory would be mapped using BAR in the backend VM
2. How would the backend VM update the dirty page bitmap for the frontend VM

Regards
Varun

> -----Original Message-----
> From: qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org
> [mailto:qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org] On
> Behalf Of Nakajima, Jun
> Sent: Monday, August 31, 2015 1:36 PM
> To: Michael S. Tsirkin
> Cc: virtio-dev@lists.oasis-open.org; Jan Kiszka;
> Claudio.Fontana@huawei.com; qemu-devel@nongnu.org; Linux
> Virtualization; opnfv-tech-discuss@lists.opnfv.org
> Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
> communication
> 
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top of
> > ivshmem. I have considered it, and came up with an alternative that
> > has several advantages over that - please see below.
> > Comments welcome.
> 
> Hi Michael,
> 
> I like this, and it should be able to achieve what I presented at KVM Forum
> (vhost-user-shmem).
> Comments below.
> 
> >
> > -----
> >
> > Existing solutions to userspace switching between VMs on the same host
> > are vhost-user and ivshmem.
> >
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> >
> > By comparison, ivshmem works by exposing a shared region of memory to
> all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> >
> > Another difference between vhost-user and ivshmem surfaces when
> > polling is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host
> > CPU needs to be sacrificed for this task.
> >
> > This is easiest to understand when one of the VMs is used with VF
> > pass-through. This can be schematically shown below:
> >
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> >
> >
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> >
> >
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages of
> > ivshmem.
> >
> >
> > 1: virtio in guest can be extended to allow support for IOMMUs. This
> > provides guest with full flexibility about memory which is readable or
> > write able by each device.
> 
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get VFIO
> working for general direct-I/O (including VFs) in guests, as you know, we
> need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
> x86 (i.e. nested VT-d).
> 
> > By setting up a virtio device for each other VM we need to communicate
> > to, guest gets full control of its security, from mapping all memory
> > (like with current vhost-user) to only mapping buffers used for
> > networking (like ivshmem) to transient mappings for the duration of
> > data transfer only.
> 
> And I think that we can use VMFUNC to have such transient mappings.
> 
> > This also allows use of VFIO within guests, for improved security.
> >
> > vhost user would need to be extended to send the mappings programmed
> > by guest IOMMU.
> 
> Right. We need to think about cases where other VMs (VM3, etc.) join the
> group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
> 
> >
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and map them into
> > another VM's memory.
> > This mapping can take, for example, the form of a BAR of a pci device,
> > which I'll call here vhost-pci - with bus address allowed by VM1's
> > IOMMU mappings being translated into offsets within this BAR within
> > VM2's physical memory space.
> 
> I think it's sensible.
> 
> >
> > Since the translation can be a simple one, VM2 can perform it within
> > its vhost-pci device driver.
> >
> > While this setup would be the most useful with polling, VM1's
> > ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
> > such that VMs can trigger interrupts to each other without need for a
> > helper thread on the host.
> >
> >
> > The resulting channel might look something like the following:
> >
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> >
> > comparing the two diagrams, a vhost-user thread on the host is no
> > longer required, reducing the host CPU utilization when polling is
> > active.  At the same time, VM2 can not access all of VM1's memory - it
> > is limited by the iommu configuration setup by VM1.
> >
> >
> > Advantages over ivshmem:
> >
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions
> 
> >
> > Thanks,
> >
> > --
> > MST
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 
> 
> --
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
@ 2015-09-01  7:35   ` Jan Kiszka
  2015-09-01  7:35   ` Jan Kiszka
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss
  Cc: Varun Sethi, Claudio.Fontana, Nakajima, Jun

On 2015-08-31 16:11, Michael S. Tsirkin wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem.

No, not on top of ivshmem. On top of shared memory. Our model is
different from the simplistic ivshmem.

> I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.
> 
> -----
> 
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
> 
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
> 
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
> 
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
> 
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
> 
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
> 
> 
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
> 
> 
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
> 
> 
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.
> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.
> This also allows use of VFIO within guests, for improved
> security.
> 
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.
> 
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci - 
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.
> 
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
> 
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
> 
> 
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
> 
> 
> Advantages over ivshmem:
> 
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Thanks,
> 

This sounds like a different interface to a concept very similar to
Xen's grant table, no? Well, there might be benefits for some use cases,
for ours this is too dynamic, in fact. We'd like to avoid remappings
during runtime controlled by guest activities, which is clearly required
for this model.

Another shortcoming: If VM1 does not trust (security or safety-wise) VM2
while preparing a message for it, it has to keep the buffer invisible
for VM2 until it is completed and signed, hashed etc. That means it has
to reprogram the IOMMU frequently. With the concept we discussed at KVM
Forum, there would be shared memory mapped read-only to VM2 while being
R/W for VM1. That would resolve this issue without the need for costly
remappings.

Leaving all the implementation and interface details aside, this
discussion is first of all about two fundamentally different approaches:
static shared memory windows vs. dynamically remapped shared windows (a
third one would be copying in the hypervisor, but I suppose we all agree
that the whole exercise is about avoiding that). Which way do we want or
have to go?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01  7:35   ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss
  Cc: Varun Sethi, Claudio.Fontana

On 2015-08-31 16:11, Michael S. Tsirkin wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem.

No, not on top of ivshmem. On top of shared memory. Our model is
different from the simplistic ivshmem.

> I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.
> 
> -----
> 
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
> 
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
> 
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
> 
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
> 
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
> 
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
> 
> 
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
> 
> 
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
> 
> 
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.
> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.
> This also allows use of VFIO within guests, for improved
> security.
> 
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.
> 
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci - 
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.
> 
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
> 
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
> 
> 
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
> 
> 
> Advantages over ivshmem:
> 
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Thanks,
> 

This sounds like a different interface to a concept very similar to
Xen's grant table, no? Well, there might be benefits for some use cases,
for ours this is too dynamic, in fact. We'd like to avoid remappings
during runtime controlled by guest activities, which is clearly required
for this model.

Another shortcoming: If VM1 does not trust (security or safety-wise) VM2
while preparing a message for it, it has to keep the buffer invisible
for VM2 until it is completed and signed, hashed etc. That means it has
to reprogram the IOMMU frequently. With the concept we discussed at KVM
Forum, there would be shared memory mapped read-only to VM2 while being
R/W for VM1. That would resolve this issue without the need for costly
remappings.

Leaving all the implementation and interface details aside, this
discussion is first of all about two fundamentally different approaches:
static shared memory windows vs. dynamically remapped shared windows (a
third one would be copying in the hypervisor, but I suppose we all agree
that the whole exercise is about avoiding that). Which way do we want or
have to go?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  7:35   ` Jan Kiszka
  (?)
  (?)
@ 2015-09-01  8:01   ` Michael S. Tsirkin
  2015-09-01  9:11       ` Jan Kiszka
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:01 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> On 2015-08-31 16:11, Michael S. Tsirkin wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem.
> 
> No, not on top of ivshmem. On top of shared memory. Our model is
> different from the simplistic ivshmem.
> 
> > I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> > 
> > -----
> > 
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> > 
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> > 
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> > 
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> > 
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> > 
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> > 
> > 
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> > 
> > 
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> > 
> > 
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> > This also allows use of VFIO within guests, for improved
> > security.
> > 
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> > 
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci - 
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> > 
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> > 
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> > 
> > 
> > The resulting channel might look something like the following:
> > 
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> > 
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> > 
> > 
> > Advantages over ivshmem:
> > 
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Thanks,
> > 
> 
> This sounds like a different interface to a concept very similar to
> Xen's grant table, no?

Yes in a sense that grant tables are also memory sharing and
include permissions.
But we are emulating an IOMMU, and keep the PV part
as simple as possible (e.g. offset within BAR)
without attaching any policy to it.
Xen is fundamentally a PV interface.

> Well, there might be benefits for some use cases,
> for ours this is too dynamic, in fact. We'd like to avoid remappings
> during runtime controlled by guest activities, which is clearly required
> for this model.

The dynamic part is up to the guest. For example, userspace pmd within guest would
create mostly static mappings using VFIO.

> Another shortcoming: If VM1 does not trust (security or safety-wise) VM2
> while preparing a message for it, it has to keep the buffer invisible
> for VM2 until it is completed and signed, hashed etc. That means it has
> to reprogram the IOMMU frequently. With the concept we discussed at KVM
> Forum, there would be shared memory mapped read-only to VM2 while being
> R/W for VM1. That would resolve this issue without the need for costly
> remappings.

IOMMU allows read-only mappings too. It's all up to the guest.

> Leaving all the implementation and interface details aside, this
> discussion is first of all about two fundamentally different approaches:
> static shared memory windows vs. dynamically remapped shared windows (a
> third one would be copying in the hypervisor, but I suppose we all agree
> that the whole exercise is about avoiding that). Which way do we want or
> have to go?
> 
> Jan

Dynamic is a superset of static: you can always make it static if you
wish. Static has the advantage of simplicity, but that's lost once you
realize you need to invent interfaces to make it work.  Since we can use
existing IOMMU interfaces for the dynamic one, what's the disadvantage?


Let me put it another way: any security model you come up with
should also be useful for bare-metal OS isolation from a device.
That's a useful test for checking whether whatever we come
up with makes sense, and it's much better than inventing our own.


> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-01  7:35   ` Jan Kiszka
  (?)
@ 2015-09-01  8:01   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:01 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> On 2015-08-31 16:11, Michael S. Tsirkin wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem.
> 
> No, not on top of ivshmem. On top of shared memory. Our model is
> different from the simplistic ivshmem.
> 
> > I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> > 
> > -----
> > 
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> > 
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> > 
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> > 
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> > 
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> > 
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> > 
> > 
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> > 
> > 
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> > 
> > 
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> > This also allows use of VFIO within guests, for improved
> > security.
> > 
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> > 
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci - 
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> > 
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> > 
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> > 
> > 
> > The resulting channel might look something like the following:
> > 
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> > 
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> > 
> > 
> > Advantages over ivshmem:
> > 
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Thanks,
> > 
> 
> This sounds like a different interface to a concept very similar to
> Xen's grant table, no?

Yes in a sense that grant tables are also memory sharing and
include permissions.
But we are emulating an IOMMU, and keep the PV part
as simple as possible (e.g. offset within BAR)
without attaching any policy to it.
Xen is fundamentally a PV interface.

> Well, there might be benefits for some use cases,
> for ours this is too dynamic, in fact. We'd like to avoid remappings
> during runtime controlled by guest activities, which is clearly required
> for this model.

The dynamic part is up to the guest. For example, userspace pmd within guest would
create mostly static mappings using VFIO.

> Another shortcoming: If VM1 does not trust (security or safety-wise) VM2
> while preparing a message for it, it has to keep the buffer invisible
> for VM2 until it is completed and signed, hashed etc. That means it has
> to reprogram the IOMMU frequently. With the concept we discussed at KVM
> Forum, there would be shared memory mapped read-only to VM2 while being
> R/W for VM1. That would resolve this issue without the need for costly
> remappings.

IOMMU allows read-only mappings too. It's all up to the guest.

> Leaving all the implementation and interface details aside, this
> discussion is first of all about two fundamentally different approaches:
> static shared memory windows vs. dynamically remapped shared windows (a
> third one would be copying in the hypervisor, but I suppose we all agree
> that the whole exercise is about avoiding that). Which way do we want or
> have to go?
> 
> Jan

Dynamic is a superset of static: you can always make it static if you
wish. Static has the advantage of simplicity, but that's lost once you
realize you need to invent interfaces to make it work.  Since we can use
existing IOMMU interfaces for the dynamic one, what's the disadvantage?


Let me put it another way: any security model you come up with
should also be useful for bare-metal OS isolation from a device.
That's a useful test for checking whether whatever we come
up with makes sense, and it's much better than inventing our own.


> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
                     ` (2 preceding siblings ...)
  (?)
@ 2015-09-01  8:17   ` Michael S. Tsirkin
  2015-09-01 22:56       ` Nakajima, Jun
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:17 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 11:35:55AM -0700, Nakajima, Jun wrote:
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem. I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> 
> Hi Michael,
> 
> I like this, and it should be able to achieve what I presented at KVM
> Forum (vhost-user-shmem).
> Comments below.
> 
> >
> > -----
> >
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> >
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> >
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> >
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> >
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> >
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> >
> >
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> >
> >
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> >
> >
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> 
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> VFIO working for general direct-I/O (including VFs) in guests, as you
> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> remapping table on x86 (i.e. nested VT-d).

Not necessarily: if pmd is used, mappings stay mostly static,
and there are no interrupts, so existing IOMMU emulation in qemu
will do the job.


> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> 
> And I think that we can use VMFUNC to have such transient mappings.

Interesting. There are two points to make here:


1. To create transient mappings, VMFUNC isn't strictly required.
Instead, mappings can be created when first access by VM2
within BAR triggers a page fault.
I guess VMFUNC could remove this first pagefault by hypervisor mapping
host PTE into the alternative view, then VMFUNC making
VM2 PTE valid - might be important if mappings are very dynamic
so there are many pagefaults.

2. To invalidate mappings, VMFUNC isn't sufficient since
translation cache of other CPUs needs to be invalidated.
I don't think VMFUNC can do this.




> > This also allows use of VFIO within guests, for improved
> > security.
> >
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> 
> Right. We need to think about cases where other VMs (VM3, etc.) join
> the group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
> 
> >
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci -
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> 
> I think it's sensible.
> 
> >
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> >
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> >
> >
> > The resulting channel might look something like the following:
> >
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> >
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> >
> >
> > Advantages over ivshmem:
> >
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions

I agree it's possible, but that's not something that exists on real
hardware. It's not clear to me what are the security implications
of having VM2 control IOMMU of VM1. Having each VM control its own IOMMU
seems more straight-forward.


> >
> > Thanks,
> >
> > --
> > MST
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 
> 
> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
                     ` (3 preceding siblings ...)
  (?)
@ 2015-09-01  8:17   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:17 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 11:35:55AM -0700, Nakajima, Jun wrote:
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem. I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> 
> Hi Michael,
> 
> I like this, and it should be able to achieve what I presented at KVM
> Forum (vhost-user-shmem).
> Comments below.
> 
> >
> > -----
> >
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> >
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> >
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> >
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> >
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> >
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> >
> >
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> >
> >
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> >
> >
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> 
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> VFIO working for general direct-I/O (including VFs) in guests, as you
> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> remapping table on x86 (i.e. nested VT-d).

Not necessarily: if pmd is used, mappings stay mostly static,
and there are no interrupts, so existing IOMMU emulation in qemu
will do the job.


> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> 
> And I think that we can use VMFUNC to have such transient mappings.

Interesting. There are two points to make here:


1. To create transient mappings, VMFUNC isn't strictly required.
Instead, mappings can be created when first access by VM2
within BAR triggers a page fault.
I guess VMFUNC could remove this first pagefault by hypervisor mapping
host PTE into the alternative view, then VMFUNC making
VM2 PTE valid - might be important if mappings are very dynamic
so there are many pagefaults.

2. To invalidate mappings, VMFUNC isn't sufficient since
translation cache of other CPUs needs to be invalidated.
I don't think VMFUNC can do this.




> > This also allows use of VFIO within guests, for improved
> > security.
> >
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> 
> Right. We need to think about cases where other VMs (VM3, etc.) join
> the group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
> 
> >
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci -
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> 
> I think it's sensible.
> 
> >
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> >
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> >
> >
> > The resulting channel might look something like the following:
> >
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> >
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> >
> >
> > Advantages over ivshmem:
> >
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions

I agree it's possible, but that's not something that exists on real
hardware. It's not clear to me what are the security implications
of having VM2 control IOMMU of VM1. Having each VM control its own IOMMU
seems more straight-forward.


> >
> > Thanks,
> >
> > --
> > MST
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 
> 
> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  3:03   ` [Qemu-devel] " Varun Sethi
@ 2015-09-01  8:30       ` Michael S. Tsirkin
  0 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:30 UTC (permalink / raw)
  To: Varun Sethi
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Nakajima, Jun, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 03:03:12AM +0000, Varun Sethi wrote:
> Hi Michael,
> When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?

This can use the emulated IOMMU in Qemu.
That's probably fast enough if mappings are mostly static.
We can also add a PV-IOMMU if necessary.

> Also, I am not clear on the following points:
> 1. How transient memory would be mapped using BAR in the backend VM

The simplest way is that 
each update sends a vhost-user message. backend gets it and
mmaps it into backend QEMU and make it part of RAM memory slot.

Or - backend QEMU could detect a pagefault on access and get the
IOMMU from frontend QEMU - using vhost-user messages or
from shared memory.




> 2. How would the backend VM update the dirty page bitmap for the frontend VM
> 
> Regards
> Varun

The easiest to implement way is probably for backend QEMU to setup dirty tracking
for the relevant slot (upon getting vhost user message
from the frontend) then retrieve the dirty map
from kvm and record it in a shared memory region
(when do it? We could have an eventfd and/or vhost-user message to
trigger this from the frontend QEMU, or just use a timer).

An alternative is for backend VM to get access to dirty log
(e.g. map it within BAR) and update it directly in shared memory.
Seems like more work.

Marc-André Lureau recently sent patches to support passing
dirty log around, these would be useful.


> > -----Original Message-----
> > From: qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org
> > [mailto:qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org] On
> > Behalf Of Nakajima, Jun
> > Sent: Monday, August 31, 2015 1:36 PM
> > To: Michael S. Tsirkin
> > Cc: virtio-dev@lists.oasis-open.org; Jan Kiszka;
> > Claudio.Fontana@huawei.com; qemu-devel@nongnu.org; Linux
> > Virtualization; opnfv-tech-discuss@lists.opnfv.org
> > Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
> > communication
> > 
> > On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > Hello!
> > > During the KVM forum, we discussed supporting virtio on top of
> > > ivshmem. I have considered it, and came up with an alternative that
> > > has several advantages over that - please see below.
> > > Comments welcome.
> > 
> > Hi Michael,
> > 
> > I like this, and it should be able to achieve what I presented at KVM Forum
> > (vhost-user-shmem).
> > Comments below.
> > 
> > >
> > > -----
> > >
> > > Existing solutions to userspace switching between VMs on the same host
> > > are vhost-user and ivshmem.
> > >
> > > vhost-user works by mapping memory of all VMs being bridged into the
> > > switch memory space.
> > >
> > > By comparison, ivshmem works by exposing a shared region of memory to
> > all VMs.
> > > VMs are required to use this region to store packets. The switch only
> > > needs access to this region.
> > >
> > > Another difference between vhost-user and ivshmem surfaces when
> > > polling is used. With vhost-user, the switch is required to handle
> > > data movement between VMs, if using polling, this means that 1 host
> > > CPU needs to be sacrificed for this task.
> > >
> > > This is easiest to understand when one of the VMs is used with VF
> > > pass-through. This can be schematically shown below:
> > >
> > > +-- VM1 --------------+            +---VM2-----------+
> > > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > > +---------------------+            +-----------------+
> > >
> > >
> > > With ivshmem in theory communication can happen directly, with two VMs
> > > polling the shared memory region.
> > >
> > >
> > > I won't spend time listing advantages of vhost-user over ivshmem.
> > > Instead, having identified two advantages of ivshmem over vhost-user,
> > > below is a proposal to extend vhost-user to gain the advantages of
> > > ivshmem.
> > >
> > >
> > > 1: virtio in guest can be extended to allow support for IOMMUs. This
> > > provides guest with full flexibility about memory which is readable or
> > > write able by each device.
> > 
> > I assume that you meant VFIO only for virtio by "use of VFIO".  To get VFIO
> > working for general direct-I/O (including VFs) in guests, as you know, we
> > need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
> > x86 (i.e. nested VT-d).
> > 
> > > By setting up a virtio device for each other VM we need to communicate
> > > to, guest gets full control of its security, from mapping all memory
> > > (like with current vhost-user) to only mapping buffers used for
> > > networking (like ivshmem) to transient mappings for the duration of
> > > data transfer only.
> > 
> > And I think that we can use VMFUNC to have such transient mappings.
> > 
> > > This also allows use of VFIO within guests, for improved security.
> > >
> > > vhost user would need to be extended to send the mappings programmed
> > > by guest IOMMU.
> > 
> > Right. We need to think about cases where other VMs (VM3, etc.) join the
> > group or some existing VM leaves.
> > PCI hot-plug should work there (as you point out at "Advantages over
> > ivshmem" below).
> > 
> > >
> > > 2. qemu can be extended to serve as a vhost-user client:
> > > remote VM mappings over the vhost-user protocol, and map them into
> > > another VM's memory.
> > > This mapping can take, for example, the form of a BAR of a pci device,
> > > which I'll call here vhost-pci - with bus address allowed by VM1's
> > > IOMMU mappings being translated into offsets within this BAR within
> > > VM2's physical memory space.
> > 
> > I think it's sensible.
> > 
> > >
> > > Since the translation can be a simple one, VM2 can perform it within
> > > its vhost-pci device driver.
> > >
> > > While this setup would be the most useful with polling, VM1's
> > > ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
> > > such that VMs can trigger interrupts to each other without need for a
> > > helper thread on the host.
> > >
> > >
> > > The resulting channel might look something like the following:
> > >
> > > +-- VM1 --------------+  +---VM2-----------+
> > > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > > +---------------------+  +-----------------+
> > >
> > > comparing the two diagrams, a vhost-user thread on the host is no
> > > longer required, reducing the host CPU utilization when polling is
> > > active.  At the same time, VM2 can not access all of VM1's memory - it
> > > is limited by the iommu configuration setup by VM1.
> > >
> > >
> > > Advantages over ivshmem:
> > >
> > > - more flexibility, endpoint VMs do not have to place data at any
> > >   specific locations to use the device, in practice this likely
> > >   means less data copies.
> > > - better standardization/code reuse
> > >   virtio changes within guests would be fairly easy to implement
> > >   and would also benefit other backends, besides vhost-user
> > >   standard hotplug interfaces can be used to add and remove these
> > >   channels as VMs are added or removed.
> > > - migration support
> > >   It's easy to implement since ownership of memory is well defined.
> > >   For example, during migration VM2 can notify hypervisor of VM1
> > >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Also, the ivshmem functionality could be implemented by this proposal:
> > - vswitch (or some VM) allocates memory regions in its address space, and
> > - it sets up that IOMMU mappings on the VMs be translated into the regions
> > 
> > >
> > > Thanks,
> > >
> > > --
> > > MST
> > > _______________________________________________
> > > Virtualization mailing list
> > > Virtualization@lists.linux-foundation.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> > 
> > 
> > --
> > Jun
> > Intel Open Source Technology Center
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01  8:30       ` Michael S. Tsirkin
  0 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  8:30 UTC (permalink / raw)
  To: Varun Sethi
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 03:03:12AM +0000, Varun Sethi wrote:
> Hi Michael,
> When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?

This can use the emulated IOMMU in Qemu.
That's probably fast enough if mappings are mostly static.
We can also add a PV-IOMMU if necessary.

> Also, I am not clear on the following points:
> 1. How transient memory would be mapped using BAR in the backend VM

The simplest way is that 
each update sends a vhost-user message. backend gets it and
mmaps it into backend QEMU and make it part of RAM memory slot.

Or - backend QEMU could detect a pagefault on access and get the
IOMMU from frontend QEMU - using vhost-user messages or
from shared memory.




> 2. How would the backend VM update the dirty page bitmap for the frontend VM
> 
> Regards
> Varun

The easiest to implement way is probably for backend QEMU to setup dirty tracking
for the relevant slot (upon getting vhost user message
from the frontend) then retrieve the dirty map
from kvm and record it in a shared memory region
(when do it? We could have an eventfd and/or vhost-user message to
trigger this from the frontend QEMU, or just use a timer).

An alternative is for backend VM to get access to dirty log
(e.g. map it within BAR) and update it directly in shared memory.
Seems like more work.

Marc-André Lureau recently sent patches to support passing
dirty log around, these would be useful.


> > -----Original Message-----
> > From: qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org
> > [mailto:qemu-devel-bounces+varun.sethi=freescale.com@nongnu.org] On
> > Behalf Of Nakajima, Jun
> > Sent: Monday, August 31, 2015 1:36 PM
> > To: Michael S. Tsirkin
> > Cc: virtio-dev@lists.oasis-open.org; Jan Kiszka;
> > Claudio.Fontana@huawei.com; qemu-devel@nongnu.org; Linux
> > Virtualization; opnfv-tech-discuss@lists.opnfv.org
> > Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
> > communication
> > 
> > On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > Hello!
> > > During the KVM forum, we discussed supporting virtio on top of
> > > ivshmem. I have considered it, and came up with an alternative that
> > > has several advantages over that - please see below.
> > > Comments welcome.
> > 
> > Hi Michael,
> > 
> > I like this, and it should be able to achieve what I presented at KVM Forum
> > (vhost-user-shmem).
> > Comments below.
> > 
> > >
> > > -----
> > >
> > > Existing solutions to userspace switching between VMs on the same host
> > > are vhost-user and ivshmem.
> > >
> > > vhost-user works by mapping memory of all VMs being bridged into the
> > > switch memory space.
> > >
> > > By comparison, ivshmem works by exposing a shared region of memory to
> > all VMs.
> > > VMs are required to use this region to store packets. The switch only
> > > needs access to this region.
> > >
> > > Another difference between vhost-user and ivshmem surfaces when
> > > polling is used. With vhost-user, the switch is required to handle
> > > data movement between VMs, if using polling, this means that 1 host
> > > CPU needs to be sacrificed for this task.
> > >
> > > This is easiest to understand when one of the VMs is used with VF
> > > pass-through. This can be schematically shown below:
> > >
> > > +-- VM1 --------------+            +---VM2-----------+
> > > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > > +---------------------+            +-----------------+
> > >
> > >
> > > With ivshmem in theory communication can happen directly, with two VMs
> > > polling the shared memory region.
> > >
> > >
> > > I won't spend time listing advantages of vhost-user over ivshmem.
> > > Instead, having identified two advantages of ivshmem over vhost-user,
> > > below is a proposal to extend vhost-user to gain the advantages of
> > > ivshmem.
> > >
> > >
> > > 1: virtio in guest can be extended to allow support for IOMMUs. This
> > > provides guest with full flexibility about memory which is readable or
> > > write able by each device.
> > 
> > I assume that you meant VFIO only for virtio by "use of VFIO".  To get VFIO
> > working for general direct-I/O (including VFs) in guests, as you know, we
> > need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
> > x86 (i.e. nested VT-d).
> > 
> > > By setting up a virtio device for each other VM we need to communicate
> > > to, guest gets full control of its security, from mapping all memory
> > > (like with current vhost-user) to only mapping buffers used for
> > > networking (like ivshmem) to transient mappings for the duration of
> > > data transfer only.
> > 
> > And I think that we can use VMFUNC to have such transient mappings.
> > 
> > > This also allows use of VFIO within guests, for improved security.
> > >
> > > vhost user would need to be extended to send the mappings programmed
> > > by guest IOMMU.
> > 
> > Right. We need to think about cases where other VMs (VM3, etc.) join the
> > group or some existing VM leaves.
> > PCI hot-plug should work there (as you point out at "Advantages over
> > ivshmem" below).
> > 
> > >
> > > 2. qemu can be extended to serve as a vhost-user client:
> > > remote VM mappings over the vhost-user protocol, and map them into
> > > another VM's memory.
> > > This mapping can take, for example, the form of a BAR of a pci device,
> > > which I'll call here vhost-pci - with bus address allowed by VM1's
> > > IOMMU mappings being translated into offsets within this BAR within
> > > VM2's physical memory space.
> > 
> > I think it's sensible.
> > 
> > >
> > > Since the translation can be a simple one, VM2 can perform it within
> > > its vhost-pci device driver.
> > >
> > > While this setup would be the most useful with polling, VM1's
> > > ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
> > > such that VMs can trigger interrupts to each other without need for a
> > > helper thread on the host.
> > >
> > >
> > > The resulting channel might look something like the following:
> > >
> > > +-- VM1 --------------+  +---VM2-----------+
> > > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > > +---------------------+  +-----------------+
> > >
> > > comparing the two diagrams, a vhost-user thread on the host is no
> > > longer required, reducing the host CPU utilization when polling is
> > > active.  At the same time, VM2 can not access all of VM1's memory - it
> > > is limited by the iommu configuration setup by VM1.
> > >
> > >
> > > Advantages over ivshmem:
> > >
> > > - more flexibility, endpoint VMs do not have to place data at any
> > >   specific locations to use the device, in practice this likely
> > >   means less data copies.
> > > - better standardization/code reuse
> > >   virtio changes within guests would be fairly easy to implement
> > >   and would also benefit other backends, besides vhost-user
> > >   standard hotplug interfaces can be used to add and remove these
> > >   channels as VMs are added or removed.
> > > - migration support
> > >   It's easy to implement since ownership of memory is well defined.
> > >   For example, during migration VM2 can notify hypervisor of VM1
> > >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Also, the ivshmem functionality could be implemented by this proposal:
> > - vswitch (or some VM) allocates memory regions in its address space, and
> > - it sets up that IOMMU mappings on the VMs be translated into the regions
> > 
> > >
> > > Thanks,
> > >
> > > --
> > > MST
> > > _______________________________________________
> > > Virtualization mailing list
> > > Virtualization@lists.linux-foundation.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> > 
> > 
> > --
> > Jun
> > Intel Open Source Technology Center
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  8:01   ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-01  9:11       ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>> Leaving all the implementation and interface details aside, this
>> discussion is first of all about two fundamentally different approaches:
>> static shared memory windows vs. dynamically remapped shared windows (a
>> third one would be copying in the hypervisor, but I suppose we all agree
>> that the whole exercise is about avoiding that). Which way do we want or
>> have to go?
>>
>> Jan
> 
> Dynamic is a superset of static: you can always make it static if you
> wish. Static has the advantage of simplicity, but that's lost once you
> realize you need to invent interfaces to make it work.  Since we can use
> existing IOMMU interfaces for the dynamic one, what's the disadvantage?

Complexity. Having to emulate even more of an IOMMU in the hypervisor
(we already have to do a bit for VT-d IR in Jailhouse) and doing this
per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
sense, generic grant tables would be more appealing. But what we would
actually need is an interface that is only *optionally* configured by a
guest for dynamic scenarios, otherwise preconfigured by the hypervisor
for static setups. And we need guests that support both. That's the
challenge.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01  9:11       ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>> Leaving all the implementation and interface details aside, this
>> discussion is first of all about two fundamentally different approaches:
>> static shared memory windows vs. dynamically remapped shared windows (a
>> third one would be copying in the hypervisor, but I suppose we all agree
>> that the whole exercise is about avoiding that). Which way do we want or
>> have to go?
>>
>> Jan
> 
> Dynamic is a superset of static: you can always make it static if you
> wish. Static has the advantage of simplicity, but that's lost once you
> realize you need to invent interfaces to make it work.  Since we can use
> existing IOMMU interfaces for the dynamic one, what's the disadvantage?

Complexity. Having to emulate even more of an IOMMU in the hypervisor
(we already have to do a bit for VT-d IR in Jailhouse) and doing this
per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
sense, generic grant tables would be more appealing. But what we would
actually need is an interface that is only *optionally* configured by a
guest for dynamic scenarios, otherwise preconfigured by the hypervisor
for static setups. And we need guests that support both. That's the
challenge.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  9:11       ` Jan Kiszka
  (?)
@ 2015-09-01  9:24       ` Michael S. Tsirkin
  2015-09-01 14:09           ` Jan Kiszka
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  9:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >> Leaving all the implementation and interface details aside, this
> >> discussion is first of all about two fundamentally different approaches:
> >> static shared memory windows vs. dynamically remapped shared windows (a
> >> third one would be copying in the hypervisor, but I suppose we all agree
> >> that the whole exercise is about avoiding that). Which way do we want or
> >> have to go?
> >>
> >> Jan
> > 
> > Dynamic is a superset of static: you can always make it static if you
> > wish. Static has the advantage of simplicity, but that's lost once you
> > realize you need to invent interfaces to make it work.  Since we can use
> > existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> 
> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> sense, generic grant tables would be more appealing.

That's not how we do things for KVM, PV features need to be
modular and interchangeable with emulation.

If you just want something that's cross-platform and easy to
implement, just build a PV IOMMU. Maybe use virtio for this.

> But what we would
> actually need is an interface that is only *optionally* configured by a
> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
> for static setups. And we need guests that support both. That's the
> challenge.
> 
> Jan

That's already there for IOMMUs: vfio does the static setup by default,
enabling iommu by guests is optional.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-01  9:11       ` Jan Kiszka
  (?)
  (?)
@ 2015-09-01  9:24       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01  9:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >> Leaving all the implementation and interface details aside, this
> >> discussion is first of all about two fundamentally different approaches:
> >> static shared memory windows vs. dynamically remapped shared windows (a
> >> third one would be copying in the hypervisor, but I suppose we all agree
> >> that the whole exercise is about avoiding that). Which way do we want or
> >> have to go?
> >>
> >> Jan
> > 
> > Dynamic is a superset of static: you can always make it static if you
> > wish. Static has the advantage of simplicity, but that's lost once you
> > realize you need to invent interfaces to make it work.  Since we can use
> > existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> 
> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> sense, generic grant tables would be more appealing.

That's not how we do things for KVM, PV features need to be
modular and interchangeable with emulation.

If you just want something that's cross-platform and easy to
implement, just build a PV IOMMU. Maybe use virtio for this.

> But what we would
> actually need is an interface that is only *optionally* configured by a
> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
> for static setups. And we need guests that support both. That's the
> challenge.
> 
> Jan

That's already there for IOMMUs: vfio does the static setup by default,
enabling iommu by guests is optional.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  9:24       ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-01 14:09           ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>> Leaving all the implementation and interface details aside, this
>>>> discussion is first of all about two fundamentally different approaches:
>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>> have to go?
>>>>
>>>> Jan
>>>
>>> Dynamic is a superset of static: you can always make it static if you
>>> wish. Static has the advantage of simplicity, but that's lost once you
>>> realize you need to invent interfaces to make it work.  Since we can use
>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>
>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>> sense, generic grant tables would be more appealing.
> 
> That's not how we do things for KVM, PV features need to be
> modular and interchangeable with emulation.

I know, and we may have to make some compromise for Jailhouse if that
brings us valuable standardization and broad guest support. But we will
surely not support an arbitrary amount of IOMMU models for that reason.

> 
> If you just want something that's cross-platform and easy to
> implement, just build a PV IOMMU. Maybe use virtio for this.

That is likely required to keep the complexity manageable and to allow
static preconfiguration.

Well, we could declare our virtio-shmem device to be an IOMMU device
that controls access of a remote VM to RAM of the one that owns the
device. In the static case, this access may at most be enabled/disabled
but not moved around. The static regions would have to be discoverable
for the VM (register read-back), and the guest's firmware will likely
have to declare those ranges reserved to the guest OS.

In the dynamic case, the guest would be able to create an alternative
mapping. We would probably have to define a generic page table structure
for that. Or do you rather have some MPU-like control structure in mind,
more similar to the memory region descriptions vhost is already using?
Also not yet clear to me are how the vhost-pci device and the
translations it will have to do should look like for VM2.

> 
>> But what we would
>> actually need is an interface that is only *optionally* configured by a
>> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
>> for static setups. And we need guests that support both. That's the
>> challenge.
>>
>> Jan
> 
> That's already there for IOMMUs: vfio does the static setup by default,
> enabling iommu by guests is optional.

Cannot follow yet how vfio comes into play regarding some preconfigured
virtual IOMMU.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01 14:09           ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>> Leaving all the implementation and interface details aside, this
>>>> discussion is first of all about two fundamentally different approaches:
>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>> have to go?
>>>>
>>>> Jan
>>>
>>> Dynamic is a superset of static: you can always make it static if you
>>> wish. Static has the advantage of simplicity, but that's lost once you
>>> realize you need to invent interfaces to make it work.  Since we can use
>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>
>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>> sense, generic grant tables would be more appealing.
> 
> That's not how we do things for KVM, PV features need to be
> modular and interchangeable with emulation.

I know, and we may have to make some compromise for Jailhouse if that
brings us valuable standardization and broad guest support. But we will
surely not support an arbitrary amount of IOMMU models for that reason.

> 
> If you just want something that's cross-platform and easy to
> implement, just build a PV IOMMU. Maybe use virtio for this.

That is likely required to keep the complexity manageable and to allow
static preconfiguration.

Well, we could declare our virtio-shmem device to be an IOMMU device
that controls access of a remote VM to RAM of the one that owns the
device. In the static case, this access may at most be enabled/disabled
but not moved around. The static regions would have to be discoverable
for the VM (register read-back), and the guest's firmware will likely
have to declare those ranges reserved to the guest OS.

In the dynamic case, the guest would be able to create an alternative
mapping. We would probably have to define a generic page table structure
for that. Or do you rather have some MPU-like control structure in mind,
more similar to the memory region descriptions vhost is already using?
Also not yet clear to me are how the vhost-pci device and the
translations it will have to do should look like for VM2.

> 
>> But what we would
>> actually need is an interface that is only *optionally* configured by a
>> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
>> for static setups. And we need guests that support both. That's the
>> challenge.
>>
>> Jan
> 
> That's already there for IOMMUs: vfio does the static setup by default,
> enabling iommu by guests is optional.

Cannot follow yet how vfio comes into play regarding some preconfigured
virtual IOMMU.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 14:09           ` Jan Kiszka
  (?)
  (?)
@ 2015-09-01 14:34           ` Michael S. Tsirkin
  2015-09-01 15:34               ` Jan Kiszka
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01 14:34 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>> Leaving all the implementation and interface details aside, this
> >>>> discussion is first of all about two fundamentally different approaches:
> >>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>> have to go?
> >>>>
> >>>> Jan
> >>>
> >>> Dynamic is a superset of static: you can always make it static if you
> >>> wish. Static has the advantage of simplicity, but that's lost once you
> >>> realize you need to invent interfaces to make it work.  Since we can use
> >>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>
> >> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >> sense, generic grant tables would be more appealing.
> > 
> > That's not how we do things for KVM, PV features need to be
> > modular and interchangeable with emulation.
> 
> I know, and we may have to make some compromise for Jailhouse if that
> brings us valuable standardization and broad guest support. But we will
> surely not support an arbitrary amount of IOMMU models for that reason.
> 
> > 
> > If you just want something that's cross-platform and easy to
> > implement, just build a PV IOMMU. Maybe use virtio for this.
> 
> That is likely required to keep the complexity manageable and to allow
> static preconfiguration.

Real IOMMU allow static configuration just fine. This is exactly
what VFIO uses.

> Well, we could declare our virtio-shmem device to be an IOMMU device
> that controls access of a remote VM to RAM of the one that owns the
> device. In the static case, this access may at most be enabled/disabled
> but not moved around. The static regions would have to be discoverable
> for the VM (register read-back), and the guest's firmware will likely
> have to declare those ranges reserved to the guest OS.
> In the dynamic case, the guest would be able to create an alternative
> mapping.


I don't think we want a special device just to support the
static case. It might be a bit less code to write, but
eventually it should be up to the guest.
Fundamentally, it's policy that host has no business
dictating.

> We would probably have to define a generic page table structure
> for that. Or do you rather have some MPU-like control structure in mind,
> more similar to the memory region descriptions vhost is already using?

I don't care much. Page tables use less memory if a lot of memory needs
to be covered. OTOH if you want to use virtio (e.g. to allow command
batching) that likely means commands to manipulate the IOMMU, and
maintaining it all on the host. You decide.


> Also not yet clear to me are how the vhost-pci device and the
> translations it will have to do should look like for VM2.

I think we can use vhost-pci BAR + VM1 bus address as the
VM2 physical address. In other words, all memory exposed to
virtio-pci by VM1 through it's IOMMU is mapped into BAR of
vhost-pci.

Bus addresses can be validated to make sure they fit
in the BAR.


One issue to consider is that VM1 can trick VM2 into writing
into bus address that isn't mapped in the IOMMU, or
is mapped read-only.
We probably would have to teach KVM to handle this somehow,
e.g. exit to QEMU, or even just ignore. Maybe notify guest
e.g. by setting a bit in the config space of the device,
to avoid easy DOS.



> > 
> >> But what we would
> >> actually need is an interface that is only *optionally* configured by a
> >> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
> >> for static setups. And we need guests that support both. That's the
> >> challenge.
> >>
> >> Jan
> > 
> > That's already there for IOMMUs: vfio does the static setup by default,
> > enabling iommu by guests is optional.
> 
> Cannot follow yet how vfio comes into play regarding some preconfigured
> virtual IOMMU.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-01 14:09           ` Jan Kiszka
  (?)
@ 2015-09-01 14:34           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01 14:34 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>> Leaving all the implementation and interface details aside, this
> >>>> discussion is first of all about two fundamentally different approaches:
> >>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>> have to go?
> >>>>
> >>>> Jan
> >>>
> >>> Dynamic is a superset of static: you can always make it static if you
> >>> wish. Static has the advantage of simplicity, but that's lost once you
> >>> realize you need to invent interfaces to make it work.  Since we can use
> >>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>
> >> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >> sense, generic grant tables would be more appealing.
> > 
> > That's not how we do things for KVM, PV features need to be
> > modular and interchangeable with emulation.
> 
> I know, and we may have to make some compromise for Jailhouse if that
> brings us valuable standardization and broad guest support. But we will
> surely not support an arbitrary amount of IOMMU models for that reason.
> 
> > 
> > If you just want something that's cross-platform and easy to
> > implement, just build a PV IOMMU. Maybe use virtio for this.
> 
> That is likely required to keep the complexity manageable and to allow
> static preconfiguration.

Real IOMMU allow static configuration just fine. This is exactly
what VFIO uses.

> Well, we could declare our virtio-shmem device to be an IOMMU device
> that controls access of a remote VM to RAM of the one that owns the
> device. In the static case, this access may at most be enabled/disabled
> but not moved around. The static regions would have to be discoverable
> for the VM (register read-back), and the guest's firmware will likely
> have to declare those ranges reserved to the guest OS.
> In the dynamic case, the guest would be able to create an alternative
> mapping.


I don't think we want a special device just to support the
static case. It might be a bit less code to write, but
eventually it should be up to the guest.
Fundamentally, it's policy that host has no business
dictating.

> We would probably have to define a generic page table structure
> for that. Or do you rather have some MPU-like control structure in mind,
> more similar to the memory region descriptions vhost is already using?

I don't care much. Page tables use less memory if a lot of memory needs
to be covered. OTOH if you want to use virtio (e.g. to allow command
batching) that likely means commands to manipulate the IOMMU, and
maintaining it all on the host. You decide.


> Also not yet clear to me are how the vhost-pci device and the
> translations it will have to do should look like for VM2.

I think we can use vhost-pci BAR + VM1 bus address as the
VM2 physical address. In other words, all memory exposed to
virtio-pci by VM1 through it's IOMMU is mapped into BAR of
vhost-pci.

Bus addresses can be validated to make sure they fit
in the BAR.


One issue to consider is that VM1 can trick VM2 into writing
into bus address that isn't mapped in the IOMMU, or
is mapped read-only.
We probably would have to teach KVM to handle this somehow,
e.g. exit to QEMU, or even just ignore. Maybe notify guest
e.g. by setting a bit in the config space of the device,
to avoid easy DOS.



> > 
> >> But what we would
> >> actually need is an interface that is only *optionally* configured by a
> >> guest for dynamic scenarios, otherwise preconfigured by the hypervisor
> >> for static setups. And we need guests that support both. That's the
> >> challenge.
> >>
> >> Jan
> > 
> > That's already there for IOMMUs: vfio does the static setup by default,
> > enabling iommu by guests is optional.
> 
> Cannot follow yet how vfio comes into play regarding some preconfigured
> virtual IOMMU.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 14:34           ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-01 15:34               ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 15:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>> Leaving all the implementation and interface details aside, this
>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>> have to go?
>>>>>>
>>>>>> Jan
>>>>>
>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>
>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>> sense, generic grant tables would be more appealing.
>>>
>>> That's not how we do things for KVM, PV features need to be
>>> modular and interchangeable with emulation.
>>
>> I know, and we may have to make some compromise for Jailhouse if that
>> brings us valuable standardization and broad guest support. But we will
>> surely not support an arbitrary amount of IOMMU models for that reason.
>>
>>>
>>> If you just want something that's cross-platform and easy to
>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>
>> That is likely required to keep the complexity manageable and to allow
>> static preconfiguration.
> 
> Real IOMMU allow static configuration just fine. This is exactly
> what VFIO uses.

Please specify more precisely which feature in which IOMMU you are
referring to. Also, given that you refer to VFIO, I suspect we have
different thing in mind. I'm talking about an IOMMU device model, like
the one we have in QEMU now for VT-d. That one is not at all
preconfigured by the host for VFIO.

> 
>> Well, we could declare our virtio-shmem device to be an IOMMU device
>> that controls access of a remote VM to RAM of the one that owns the
>> device. In the static case, this access may at most be enabled/disabled
>> but not moved around. The static regions would have to be discoverable
>> for the VM (register read-back), and the guest's firmware will likely
>> have to declare those ranges reserved to the guest OS.
>> In the dynamic case, the guest would be able to create an alternative
>> mapping.
> 
> 
> I don't think we want a special device just to support the
> static case. It might be a bit less code to write, but
> eventually it should be up to the guest.
> Fundamentally, it's policy that host has no business
> dictating.

"A bit less" is to be validated, and I doubt its just "a bit". But if
KVM and its guests will also support some PV-IOMMU that we can reuse for
our scenarios, than that is fine. KVM would not have to mandate support
for it while we would, that's all.

> 
>> We would probably have to define a generic page table structure
>> for that. Or do you rather have some MPU-like control structure in mind,
>> more similar to the memory region descriptions vhost is already using?
> 
> I don't care much. Page tables use less memory if a lot of memory needs
> to be covered. OTOH if you want to use virtio (e.g. to allow command
> batching) that likely means commands to manipulate the IOMMU, and
> maintaining it all on the host. You decide.

I don't care very much about the dynamic case as we won't support it
anyway. However, if the configuration concept used for it is applicable
to static mode as well, then we could reuse it. But preconfiguration
will required register-based region description, I suspect.

> 
>> Also not yet clear to me are how the vhost-pci device and the
>> translations it will have to do should look like for VM2.
> 
> I think we can use vhost-pci BAR + VM1 bus address as the
> VM2 physical address. In other words, all memory exposed to
> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> vhost-pci.
> 
> Bus addresses can be validated to make sure they fit
> in the BAR.

Sounds simple but may become challenging for VMs that have many of such
devices (in order to connect to many possibly large VMs).

> 
> 
> One issue to consider is that VM1 can trick VM2 into writing
> into bus address that isn't mapped in the IOMMU, or
> is mapped read-only.
> We probably would have to teach KVM to handle this somehow,
> e.g. exit to QEMU, or even just ignore. Maybe notify guest
> e.g. by setting a bit in the config space of the device,
> to avoid easy DOS.

Well, that would be trivial for VM1 to check if there are only one or
two memory windows. Relying on the hypervisor to handle it may be
unacceptable for real-time VMs.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01 15:34               ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 15:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>> Leaving all the implementation and interface details aside, this
>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>> have to go?
>>>>>>
>>>>>> Jan
>>>>>
>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>
>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>> sense, generic grant tables would be more appealing.
>>>
>>> That's not how we do things for KVM, PV features need to be
>>> modular and interchangeable with emulation.
>>
>> I know, and we may have to make some compromise for Jailhouse if that
>> brings us valuable standardization and broad guest support. But we will
>> surely not support an arbitrary amount of IOMMU models for that reason.
>>
>>>
>>> If you just want something that's cross-platform and easy to
>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>
>> That is likely required to keep the complexity manageable and to allow
>> static preconfiguration.
> 
> Real IOMMU allow static configuration just fine. This is exactly
> what VFIO uses.

Please specify more precisely which feature in which IOMMU you are
referring to. Also, given that you refer to VFIO, I suspect we have
different thing in mind. I'm talking about an IOMMU device model, like
the one we have in QEMU now for VT-d. That one is not at all
preconfigured by the host for VFIO.

> 
>> Well, we could declare our virtio-shmem device to be an IOMMU device
>> that controls access of a remote VM to RAM of the one that owns the
>> device. In the static case, this access may at most be enabled/disabled
>> but not moved around. The static regions would have to be discoverable
>> for the VM (register read-back), and the guest's firmware will likely
>> have to declare those ranges reserved to the guest OS.
>> In the dynamic case, the guest would be able to create an alternative
>> mapping.
> 
> 
> I don't think we want a special device just to support the
> static case. It might be a bit less code to write, but
> eventually it should be up to the guest.
> Fundamentally, it's policy that host has no business
> dictating.

"A bit less" is to be validated, and I doubt its just "a bit". But if
KVM and its guests will also support some PV-IOMMU that we can reuse for
our scenarios, than that is fine. KVM would not have to mandate support
for it while we would, that's all.

> 
>> We would probably have to define a generic page table structure
>> for that. Or do you rather have some MPU-like control structure in mind,
>> more similar to the memory region descriptions vhost is already using?
> 
> I don't care much. Page tables use less memory if a lot of memory needs
> to be covered. OTOH if you want to use virtio (e.g. to allow command
> batching) that likely means commands to manipulate the IOMMU, and
> maintaining it all on the host. You decide.

I don't care very much about the dynamic case as we won't support it
anyway. However, if the configuration concept used for it is applicable
to static mode as well, then we could reuse it. But preconfiguration
will required register-based region description, I suspect.

> 
>> Also not yet clear to me are how the vhost-pci device and the
>> translations it will have to do should look like for VM2.
> 
> I think we can use vhost-pci BAR + VM1 bus address as the
> VM2 physical address. In other words, all memory exposed to
> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> vhost-pci.
> 
> Bus addresses can be validated to make sure they fit
> in the BAR.

Sounds simple but may become challenging for VMs that have many of such
devices (in order to connect to many possibly large VMs).

> 
> 
> One issue to consider is that VM1 can trick VM2 into writing
> into bus address that isn't mapped in the IOMMU, or
> is mapped read-only.
> We probably would have to teach KVM to handle this somehow,
> e.g. exit to QEMU, or even just ignore. Maybe notify guest
> e.g. by setting a bit in the config space of the device,
> to avoid easy DOS.

Well, that would be trivial for VM1 to check if there are only one or
two memory windows. Relying on the hypervisor to handle it may be
unacceptable for real-time VMs.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 15:34               ` Jan Kiszka
@ 2015-09-01 16:02                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01 16:02 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>> Leaving all the implementation and interface details aside, this
> >>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>> have to go?
> >>>>>>
> >>>>>> Jan
> >>>>>
> >>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>
> >>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>> sense, generic grant tables would be more appealing.
> >>>
> >>> That's not how we do things for KVM, PV features need to be
> >>> modular and interchangeable with emulation.
> >>
> >> I know, and we may have to make some compromise for Jailhouse if that
> >> brings us valuable standardization and broad guest support. But we will
> >> surely not support an arbitrary amount of IOMMU models for that reason.
> >>
> >>>
> >>> If you just want something that's cross-platform and easy to
> >>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>
> >> That is likely required to keep the complexity manageable and to allow
> >> static preconfiguration.
> > 
> > Real IOMMU allow static configuration just fine. This is exactly
> > what VFIO uses.
> 
> Please specify more precisely which feature in which IOMMU you are
> referring to. Also, given that you refer to VFIO, I suspect we have
> different thing in mind. I'm talking about an IOMMU device model, like
> the one we have in QEMU now for VT-d. That one is not at all
> preconfigured by the host for VFIO.

I really just mean that VFIO creates a mostly static IOMMU configuration.

It's configured by the guest, not the host.

I don't see host control over configuration as being particularly important.


> > 
> >> Well, we could declare our virtio-shmem device to be an IOMMU device
> >> that controls access of a remote VM to RAM of the one that owns the
> >> device. In the static case, this access may at most be enabled/disabled
> >> but not moved around. The static regions would have to be discoverable
> >> for the VM (register read-back), and the guest's firmware will likely
> >> have to declare those ranges reserved to the guest OS.
> >> In the dynamic case, the guest would be able to create an alternative
> >> mapping.
> > 
> > 
> > I don't think we want a special device just to support the
> > static case. It might be a bit less code to write, but
> > eventually it should be up to the guest.
> > Fundamentally, it's policy that host has no business
> > dictating.
> 
> "A bit less" is to be validated, and I doubt its just "a bit". But if
> KVM and its guests will also support some PV-IOMMU that we can reuse for
> our scenarios, than that is fine. KVM would not have to mandate support
> for it while we would, that's all.

Someone will have to do this work.

> > 
> >> We would probably have to define a generic page table structure
> >> for that. Or do you rather have some MPU-like control structure in mind,
> >> more similar to the memory region descriptions vhost is already using?
> > 
> > I don't care much. Page tables use less memory if a lot of memory needs
> > to be covered. OTOH if you want to use virtio (e.g. to allow command
> > batching) that likely means commands to manipulate the IOMMU, and
> > maintaining it all on the host. You decide.
> 
> I don't care very much about the dynamic case as we won't support it
> anyway. However, if the configuration concept used for it is applicable
> to static mode as well, then we could reuse it. But preconfiguration
> will required register-based region description, I suspect.

I don't know what you mean by preconfiguration exactly.

Do you want the host to configure the IOMMU? Why not let the
guest do this?


> > 
> >> Also not yet clear to me are how the vhost-pci device and the
> >> translations it will have to do should look like for VM2.
> > 
> > I think we can use vhost-pci BAR + VM1 bus address as the
> > VM2 physical address. In other words, all memory exposed to
> > virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> > vhost-pci.
> > 
> > Bus addresses can be validated to make sure they fit
> > in the BAR.
> 
> Sounds simple but may become challenging for VMs that have many of such
> devices (in order to connect to many possibly large VMs).

You don't need to be able to map all guest memory if you know
guest won't try to allow device access to all of it.
It's a question of how good is the bus address allocator.

> > 
> > 
> > One issue to consider is that VM1 can trick VM2 into writing
> > into bus address that isn't mapped in the IOMMU, or
> > is mapped read-only.
> > We probably would have to teach KVM to handle this somehow,
> > e.g. exit to QEMU, or even just ignore. Maybe notify guest
> > e.g. by setting a bit in the config space of the device,
> > to avoid easy DOS.
> 
> Well, that would be trivial for VM1 to check if there are only one or
> two memory windows. Relying on the hypervisor to handle it may be
> unacceptable for real-time VMs.
> 
> Jan

Why? real-time != fast. I doubt you can avoid vm exits completely.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01 16:02                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-01 16:02 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>> Leaving all the implementation and interface details aside, this
> >>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>> have to go?
> >>>>>>
> >>>>>> Jan
> >>>>>
> >>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>
> >>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>> sense, generic grant tables would be more appealing.
> >>>
> >>> That's not how we do things for KVM, PV features need to be
> >>> modular and interchangeable with emulation.
> >>
> >> I know, and we may have to make some compromise for Jailhouse if that
> >> brings us valuable standardization and broad guest support. But we will
> >> surely not support an arbitrary amount of IOMMU models for that reason.
> >>
> >>>
> >>> If you just want something that's cross-platform and easy to
> >>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>
> >> That is likely required to keep the complexity manageable and to allow
> >> static preconfiguration.
> > 
> > Real IOMMU allow static configuration just fine. This is exactly
> > what VFIO uses.
> 
> Please specify more precisely which feature in which IOMMU you are
> referring to. Also, given that you refer to VFIO, I suspect we have
> different thing in mind. I'm talking about an IOMMU device model, like
> the one we have in QEMU now for VT-d. That one is not at all
> preconfigured by the host for VFIO.

I really just mean that VFIO creates a mostly static IOMMU configuration.

It's configured by the guest, not the host.

I don't see host control over configuration as being particularly important.


> > 
> >> Well, we could declare our virtio-shmem device to be an IOMMU device
> >> that controls access of a remote VM to RAM of the one that owns the
> >> device. In the static case, this access may at most be enabled/disabled
> >> but not moved around. The static regions would have to be discoverable
> >> for the VM (register read-back), and the guest's firmware will likely
> >> have to declare those ranges reserved to the guest OS.
> >> In the dynamic case, the guest would be able to create an alternative
> >> mapping.
> > 
> > 
> > I don't think we want a special device just to support the
> > static case. It might be a bit less code to write, but
> > eventually it should be up to the guest.
> > Fundamentally, it's policy that host has no business
> > dictating.
> 
> "A bit less" is to be validated, and I doubt its just "a bit". But if
> KVM and its guests will also support some PV-IOMMU that we can reuse for
> our scenarios, than that is fine. KVM would not have to mandate support
> for it while we would, that's all.

Someone will have to do this work.

> > 
> >> We would probably have to define a generic page table structure
> >> for that. Or do you rather have some MPU-like control structure in mind,
> >> more similar to the memory region descriptions vhost is already using?
> > 
> > I don't care much. Page tables use less memory if a lot of memory needs
> > to be covered. OTOH if you want to use virtio (e.g. to allow command
> > batching) that likely means commands to manipulate the IOMMU, and
> > maintaining it all on the host. You decide.
> 
> I don't care very much about the dynamic case as we won't support it
> anyway. However, if the configuration concept used for it is applicable
> to static mode as well, then we could reuse it. But preconfiguration
> will required register-based region description, I suspect.

I don't know what you mean by preconfiguration exactly.

Do you want the host to configure the IOMMU? Why not let the
guest do this?


> > 
> >> Also not yet clear to me are how the vhost-pci device and the
> >> translations it will have to do should look like for VM2.
> > 
> > I think we can use vhost-pci BAR + VM1 bus address as the
> > VM2 physical address. In other words, all memory exposed to
> > virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> > vhost-pci.
> > 
> > Bus addresses can be validated to make sure they fit
> > in the BAR.
> 
> Sounds simple but may become challenging for VMs that have many of such
> devices (in order to connect to many possibly large VMs).

You don't need to be able to map all guest memory if you know
guest won't try to allow device access to all of it.
It's a question of how good is the bus address allocator.

> > 
> > 
> > One issue to consider is that VM1 can trick VM2 into writing
> > into bus address that isn't mapped in the IOMMU, or
> > is mapped read-only.
> > We probably would have to teach KVM to handle this somehow,
> > e.g. exit to QEMU, or even just ignore. Maybe notify guest
> > e.g. by setting a bit in the config space of the device,
> > to avoid easy DOS.
> 
> Well, that would be trivial for VM1 to check if there are only one or
> two memory windows. Relying on the hypervisor to handle it may be
> unacceptable for real-time VMs.
> 
> Jan

Why? real-time != fast. I doubt you can avoid vm exits completely.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 16:02                 ` Michael S. Tsirkin
@ 2015-09-01 16:28                   ` Jan Kiszka
  -1 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 16:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>>>> Leaving all the implementation and interface details aside, this
>>>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>>>> have to go?
>>>>>>>>
>>>>>>>> Jan
>>>>>>>
>>>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>>>
>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>>>> sense, generic grant tables would be more appealing.
>>>>>
>>>>> That's not how we do things for KVM, PV features need to be
>>>>> modular and interchangeable with emulation.
>>>>
>>>> I know, and we may have to make some compromise for Jailhouse if that
>>>> brings us valuable standardization and broad guest support. But we will
>>>> surely not support an arbitrary amount of IOMMU models for that reason.
>>>>
>>>>>
>>>>> If you just want something that's cross-platform and easy to
>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>>>
>>>> That is likely required to keep the complexity manageable and to allow
>>>> static preconfiguration.
>>>
>>> Real IOMMU allow static configuration just fine. This is exactly
>>> what VFIO uses.
>>
>> Please specify more precisely which feature in which IOMMU you are
>> referring to. Also, given that you refer to VFIO, I suspect we have
>> different thing in mind. I'm talking about an IOMMU device model, like
>> the one we have in QEMU now for VT-d. That one is not at all
>> preconfigured by the host for VFIO.
> 
> I really just mean that VFIO creates a mostly static IOMMU configuration.
> 
> It's configured by the guest, not the host.

OK, that resolves my confusion.

> 
> I don't see host control over configuration as being particularly important.

We do, see below.

> 
> 
>>>
>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
>>>> that controls access of a remote VM to RAM of the one that owns the
>>>> device. In the static case, this access may at most be enabled/disabled
>>>> but not moved around. The static regions would have to be discoverable
>>>> for the VM (register read-back), and the guest's firmware will likely
>>>> have to declare those ranges reserved to the guest OS.
>>>> In the dynamic case, the guest would be able to create an alternative
>>>> mapping.
>>>
>>>
>>> I don't think we want a special device just to support the
>>> static case. It might be a bit less code to write, but
>>> eventually it should be up to the guest.
>>> Fundamentally, it's policy that host has no business
>>> dictating.
>>
>> "A bit less" is to be validated, and I doubt its just "a bit". But if
>> KVM and its guests will also support some PV-IOMMU that we can reuse for
>> our scenarios, than that is fine. KVM would not have to mandate support
>> for it while we would, that's all.
> 
> Someone will have to do this work.
> 
>>>
>>>> We would probably have to define a generic page table structure
>>>> for that. Or do you rather have some MPU-like control structure in mind,
>>>> more similar to the memory region descriptions vhost is already using?
>>>
>>> I don't care much. Page tables use less memory if a lot of memory needs
>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
>>> batching) that likely means commands to manipulate the IOMMU, and
>>> maintaining it all on the host. You decide.
>>
>> I don't care very much about the dynamic case as we won't support it
>> anyway. However, if the configuration concept used for it is applicable
>> to static mode as well, then we could reuse it. But preconfiguration
>> will required register-based region description, I suspect.
> 
> I don't know what you mean by preconfiguration exactly.
> 
> Do you want the host to configure the IOMMU? Why not let the
> guest do this?

We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
validate and synchronize guest-triggered changes.

>>>
>>>> Also not yet clear to me are how the vhost-pci device and the
>>>> translations it will have to do should look like for VM2.
>>>
>>> I think we can use vhost-pci BAR + VM1 bus address as the
>>> VM2 physical address. In other words, all memory exposed to
>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
>>> vhost-pci.
>>>
>>> Bus addresses can be validated to make sure they fit
>>> in the BAR.
>>
>> Sounds simple but may become challenging for VMs that have many of such
>> devices (in order to connect to many possibly large VMs).
> 
> You don't need to be able to map all guest memory if you know
> guest won't try to allow device access to all of it.
> It's a question of how good is the bus address allocator.

But those BARs need to allocate a guest-physical address range as large
as the other guest's RAM is, possibly even larger if that RAM is not
contiguous, and you can't put other resources into potential holes
because VM2 does not know where those holes will be.

> 
>>>
>>>
>>> One issue to consider is that VM1 can trick VM2 into writing
>>> into bus address that isn't mapped in the IOMMU, or
>>> is mapped read-only.
>>> We probably would have to teach KVM to handle this somehow,
>>> e.g. exit to QEMU, or even just ignore. Maybe notify guest
>>> e.g. by setting a bit in the config space of the device,
>>> to avoid easy DOS.
>>
>> Well, that would be trivial for VM1 to check if there are only one or
>> two memory windows. Relying on the hypervisor to handle it may be
>> unacceptable for real-time VMs.
>>
>> Jan
> 
> Why? real-time != fast. I doubt you can avoid vm exits completely.

We can, one property of Jailhouse (on x86, ARM is waiting for GICv4).

Real-time == deterministic. And if you have such vm exits potentially in
your code path, you have them always - for worst-case analysis. One may
argue about probability in certain scenarios, but if the triggering side
is malicious, probability may become 1.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01 16:28                   ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-01 16:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>>>> Leaving all the implementation and interface details aside, this
>>>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>>>> have to go?
>>>>>>>>
>>>>>>>> Jan
>>>>>>>
>>>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>>>
>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>>>> sense, generic grant tables would be more appealing.
>>>>>
>>>>> That's not how we do things for KVM, PV features need to be
>>>>> modular and interchangeable with emulation.
>>>>
>>>> I know, and we may have to make some compromise for Jailhouse if that
>>>> brings us valuable standardization and broad guest support. But we will
>>>> surely not support an arbitrary amount of IOMMU models for that reason.
>>>>
>>>>>
>>>>> If you just want something that's cross-platform and easy to
>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>>>
>>>> That is likely required to keep the complexity manageable and to allow
>>>> static preconfiguration.
>>>
>>> Real IOMMU allow static configuration just fine. This is exactly
>>> what VFIO uses.
>>
>> Please specify more precisely which feature in which IOMMU you are
>> referring to. Also, given that you refer to VFIO, I suspect we have
>> different thing in mind. I'm talking about an IOMMU device model, like
>> the one we have in QEMU now for VT-d. That one is not at all
>> preconfigured by the host for VFIO.
> 
> I really just mean that VFIO creates a mostly static IOMMU configuration.
> 
> It's configured by the guest, not the host.

OK, that resolves my confusion.

> 
> I don't see host control over configuration as being particularly important.

We do, see below.

> 
> 
>>>
>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
>>>> that controls access of a remote VM to RAM of the one that owns the
>>>> device. In the static case, this access may at most be enabled/disabled
>>>> but not moved around. The static regions would have to be discoverable
>>>> for the VM (register read-back), and the guest's firmware will likely
>>>> have to declare those ranges reserved to the guest OS.
>>>> In the dynamic case, the guest would be able to create an alternative
>>>> mapping.
>>>
>>>
>>> I don't think we want a special device just to support the
>>> static case. It might be a bit less code to write, but
>>> eventually it should be up to the guest.
>>> Fundamentally, it's policy that host has no business
>>> dictating.
>>
>> "A bit less" is to be validated, and I doubt its just "a bit". But if
>> KVM and its guests will also support some PV-IOMMU that we can reuse for
>> our scenarios, than that is fine. KVM would not have to mandate support
>> for it while we would, that's all.
> 
> Someone will have to do this work.
> 
>>>
>>>> We would probably have to define a generic page table structure
>>>> for that. Or do you rather have some MPU-like control structure in mind,
>>>> more similar to the memory region descriptions vhost is already using?
>>>
>>> I don't care much. Page tables use less memory if a lot of memory needs
>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
>>> batching) that likely means commands to manipulate the IOMMU, and
>>> maintaining it all on the host. You decide.
>>
>> I don't care very much about the dynamic case as we won't support it
>> anyway. However, if the configuration concept used for it is applicable
>> to static mode as well, then we could reuse it. But preconfiguration
>> will required register-based region description, I suspect.
> 
> I don't know what you mean by preconfiguration exactly.
> 
> Do you want the host to configure the IOMMU? Why not let the
> guest do this?

We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
validate and synchronize guest-triggered changes.

>>>
>>>> Also not yet clear to me are how the vhost-pci device and the
>>>> translations it will have to do should look like for VM2.
>>>
>>> I think we can use vhost-pci BAR + VM1 bus address as the
>>> VM2 physical address. In other words, all memory exposed to
>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
>>> vhost-pci.
>>>
>>> Bus addresses can be validated to make sure they fit
>>> in the BAR.
>>
>> Sounds simple but may become challenging for VMs that have many of such
>> devices (in order to connect to many possibly large VMs).
> 
> You don't need to be able to map all guest memory if you know
> guest won't try to allow device access to all of it.
> It's a question of how good is the bus address allocator.

But those BARs need to allocate a guest-physical address range as large
as the other guest's RAM is, possibly even larger if that RAM is not
contiguous, and you can't put other resources into potential holes
because VM2 does not know where those holes will be.

> 
>>>
>>>
>>> One issue to consider is that VM1 can trick VM2 into writing
>>> into bus address that isn't mapped in the IOMMU, or
>>> is mapped read-only.
>>> We probably would have to teach KVM to handle this somehow,
>>> e.g. exit to QEMU, or even just ignore. Maybe notify guest
>>> e.g. by setting a bit in the config space of the device,
>>> to avoid easy DOS.
>>
>> Well, that would be trivial for VM1 to check if there are only one or
>> two memory windows. Relying on the hypervisor to handle it may be
>> unacceptable for real-time VMs.
>>
>> Jan
> 
> Why? real-time != fast. I doubt you can avoid vm exits completely.

We can, one property of Jailhouse (on x86, ARM is waiting for GICv4).

Real-time == deterministic. And if you have such vm exits potentially in
your code path, you have them always - for worst-case analysis. One may
argue about probability in certain scenarios, but if the triggering side
is malicious, probability may become 1.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01  8:17   ` Michael S. Tsirkin
@ 2015-09-01 22:56       ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-01 22:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

My previous email has been bounced by virtio-dev@lists.oasis-open.org.
I tried to subscribed it, but to no avail...

On Tue, Sep 1, 2015 at 1:17 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Aug 31, 2015 at 11:35:55AM -0700, Nakajima, Jun wrote:
>> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:

>> > 1: virtio in guest can be extended to allow support
>> > for IOMMUs. This provides guest with full flexibility
>> > about memory which is readable or write able by each device.
>>
>> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
>> VFIO working for general direct-I/O (including VFs) in guests, as you
>> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
>> remapping table on x86 (i.e. nested VT-d).
>
> Not necessarily: if pmd is used, mappings stay mostly static,
> and there are no interrupts, so existing IOMMU emulation in qemu
> will do the job.

OK. It would work, although we need to engage additional/complex code
in the guests when we are making just memory operations under the
hood.

>> > By setting up a virtio device for each other VM we need to
>> > communicate to, guest gets full control of its security, from
>> > mapping all memory (like with current vhost-user) to only
>> > mapping buffers used for networking (like ivshmem) to
>> > transient mappings for the duration of data transfer only.
>>
>> And I think that we can use VMFUNC to have such transient mappings.
>
> Interesting. There are two points to make here:
>
>
> 1. To create transient mappings, VMFUNC isn't strictly required.
> Instead, mappings can be created when first access by VM2
> within BAR triggers a page fault.
> I guess VMFUNC could remove this first pagefault by hypervisor mapping
> host PTE into the alternative view, then VMFUNC making
> VM2 PTE valid - might be important if mappings are very dynamic
> so there are many pagefaults.

I agree that VMFUNC isn't strictly required. It would provide
performance optimization.
And I think it can add some level of protection as well because you
might want to keep mapping guest physical memory (which is partial or
entire VM1's memory) at BAR of VM2 all the time. IOMMU on VM1 can
limit the address ranges accessed by VM2, but such restriction becomes
loose  as you want them static and thus large enough.

>
> 2. To invalidate mappings, VMFUNC isn't sufficient since
> translation cache of other CPUs needs to be invalidated.
> I don't think VMFUNC can do this.

I don't think we need to invalidate mappings often. And if we do, we
need to invalidate EPT anyway.

>>
>> Also, the ivshmem functionality could be implemented by this proposal:
>> - vswitch (or some VM) allocates memory regions in its address space, and
>> - it sets up that IOMMU mappings on the VMs be translated into the regions
>
> I agree it's possible, but that's not something that exists on real
> hardware. It's not clear to me what are the security implications
> of having VM2 control IOMMU of VM1. Having each VM control its own IOMMU
> seems more straight-forward.

I meant the vswitch's IOMMU. It can a bare-metal (or host) process or
VM. For a bare-metal process, it's basically VFIO, where virtual
address is used as bus address. Each VM accesses the shared memory
using vhost-pci BAR + bus (i.e. virtual) address.


-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-01 22:56       ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-01 22:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

My previous email has been bounced by virtio-dev@lists.oasis-open.org.
I tried to subscribed it, but to no avail...

On Tue, Sep 1, 2015 at 1:17 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Aug 31, 2015 at 11:35:55AM -0700, Nakajima, Jun wrote:
>> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:

>> > 1: virtio in guest can be extended to allow support
>> > for IOMMUs. This provides guest with full flexibility
>> > about memory which is readable or write able by each device.
>>
>> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
>> VFIO working for general direct-I/O (including VFs) in guests, as you
>> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
>> remapping table on x86 (i.e. nested VT-d).
>
> Not necessarily: if pmd is used, mappings stay mostly static,
> and there are no interrupts, so existing IOMMU emulation in qemu
> will do the job.

OK. It would work, although we need to engage additional/complex code
in the guests when we are making just memory operations under the
hood.

>> > By setting up a virtio device for each other VM we need to
>> > communicate to, guest gets full control of its security, from
>> > mapping all memory (like with current vhost-user) to only
>> > mapping buffers used for networking (like ivshmem) to
>> > transient mappings for the duration of data transfer only.
>>
>> And I think that we can use VMFUNC to have such transient mappings.
>
> Interesting. There are two points to make here:
>
>
> 1. To create transient mappings, VMFUNC isn't strictly required.
> Instead, mappings can be created when first access by VM2
> within BAR triggers a page fault.
> I guess VMFUNC could remove this first pagefault by hypervisor mapping
> host PTE into the alternative view, then VMFUNC making
> VM2 PTE valid - might be important if mappings are very dynamic
> so there are many pagefaults.

I agree that VMFUNC isn't strictly required. It would provide
performance optimization.
And I think it can add some level of protection as well because you
might want to keep mapping guest physical memory (which is partial or
entire VM1's memory) at BAR of VM2 all the time. IOMMU on VM1 can
limit the address ranges accessed by VM2, but such restriction becomes
loose  as you want them static and thus large enough.

>
> 2. To invalidate mappings, VMFUNC isn't sufficient since
> translation cache of other CPUs needs to be invalidated.
> I don't think VMFUNC can do this.

I don't think we need to invalidate mappings often. And if we do, we
need to invalidate EPT anyway.

>>
>> Also, the ivshmem functionality could be implemented by this proposal:
>> - vswitch (or some VM) allocates memory regions in its address space, and
>> - it sets up that IOMMU mappings on the VMs be translated into the regions
>
> I agree it's possible, but that's not something that exists on real
> hardware. It's not clear to me what are the security implications
> of having VM2 control IOMMU of VM1. Having each VM control its own IOMMU
> seems more straight-forward.

I meant the vswitch's IOMMU. It can a bare-metal (or host) process or
VM. For a bare-metal process, it's basically VFIO, where virtual
address is used as bus address. Each VM accesses the shared memory
using vhost-pci BAR + bus (i.e. virtual) address.


-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 16:28                   ` Jan Kiszka
@ 2015-09-02  0:01                     ` Nakajima, Jun
  -1 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-02  0:01 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Michael S. Tsirkin, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 1, 2015 at 9:28 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
...
>> You don't need to be able to map all guest memory if you know
>> guest won't try to allow device access to all of it.
>> It's a question of how good is the bus address allocator.
>
> But those BARs need to allocate a guest-physical address range as large
> as the other guest's RAM is, possibly even larger if that RAM is not
> contiguous, and you can't put other resources into potential holes
> because VM2 does not know where those holes will be.
>

I think you can allocate such guest-physical address ranges
efficiently if each BAR sets the base of each memory region reported
by VHOST_SET_MEM_TABLE, for example.  The issue is that we would need
to 8 (VHOST_MEMORY_MAX_NREGIONS) of them vs. 6 (defined by PCI-SIG).

-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-02  0:01                     ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-02  0:01 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Michael S. Tsirkin, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 1, 2015 at 9:28 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
...
>> You don't need to be able to map all guest memory if you know
>> guest won't try to allow device access to all of it.
>> It's a question of how good is the bus address allocator.
>
> But those BARs need to allocate a guest-physical address range as large
> as the other guest's RAM is, possibly even larger if that RAM is not
> contiguous, and you can't put other resources into potential holes
> because VM2 does not know where those holes will be.
>

I think you can allocate such guest-physical address ranges
efficiently if each BAR sets the base of each memory region reported
by VHOST_SET_MEM_TABLE, for example.  The issue is that we would need
to 8 (VHOST_MEMORY_MAX_NREGIONS) of them vs. 6 (defined by PCI-SIG).

-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-02  0:01                     ` Nakajima, Jun
@ 2015-09-02 12:15                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-02 12:15 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

On Tue, Sep 01, 2015 at 05:01:07PM -0700, Nakajima, Jun wrote:
> On Tue, Sep 1, 2015 at 9:28 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> ...
> >> You don't need to be able to map all guest memory if you know
> >> guest won't try to allow device access to all of it.
> >> It's a question of how good is the bus address allocator.
> >
> > But those BARs need to allocate a guest-physical address range as large
> > as the other guest's RAM is, possibly even larger if that RAM is not
> > contiguous, and you can't put other resources into potential holes
> > because VM2 does not know where those holes will be.
> >
> 
> I think you can allocate such guest-physical address ranges
> efficiently if each BAR sets the base of each memory region reported
> by VHOST_SET_MEM_TABLE, for example.  The issue is that we would need
> to 8 (VHOST_MEMORY_MAX_NREGIONS) of them vs. 6 (defined by PCI-SIG).

Besides, 8 is not even a limit: we merged a patch that allows makeing it
larger.

> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-02 12:15                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-02 12:15 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

On Tue, Sep 01, 2015 at 05:01:07PM -0700, Nakajima, Jun wrote:
> On Tue, Sep 1, 2015 at 9:28 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> ...
> >> You don't need to be able to map all guest memory if you know
> >> guest won't try to allow device access to all of it.
> >> It's a question of how good is the bus address allocator.
> >
> > But those BARs need to allocate a guest-physical address range as large
> > as the other guest's RAM is, possibly even larger if that RAM is not
> > contiguous, and you can't put other resources into potential holes
> > because VM2 does not know where those holes will be.
> >
> 
> I think you can allocate such guest-physical address ranges
> efficiently if each BAR sets the base of each memory region reported
> by VHOST_SET_MEM_TABLE, for example.  The issue is that we would need
> to 8 (VHOST_MEMORY_MAX_NREGIONS) of them vs. 6 (defined by PCI-SIG).

Besides, 8 is not even a limit: we merged a patch that allows makeing it
larger.

> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-02 12:15                       ` Michael S. Tsirkin
  (?)
  (?)
@ 2015-09-03  4:45                       ` Nakajima, Jun
  2015-09-03  8:09                         ` Michael S. Tsirkin
  2015-09-03  8:09                         ` Michael S. Tsirkin
  -1 siblings, 2 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-03  4:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

BTW, can you please take a look at the following URL to see my
understanding is correct? Our engineers are saying that they are not
really sure if they understood your proposal (especially around
IOMMU), and I drew a figure, adding notes...

https://wiki.opnfv.org/vm2vm_mst

Thanks,
-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-02 12:15                       ` Michael S. Tsirkin
  (?)
@ 2015-09-03  4:45                       ` Nakajima, Jun
  -1 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-09-03  4:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

BTW, can you please take a look at the following URL to see my
understanding is correct? Our engineers are saying that they are not
really sure if they understood your proposal (especially around
IOMMU), and I drew a figure, adding notes...

https://wiki.opnfv.org/vm2vm_mst

Thanks,
-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-01 16:28                   ` Jan Kiszka
@ 2015-09-03  8:08                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>>>> Leaving all the implementation and interface details aside, this
> >>>>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>>>> have to go?
> >>>>>>>>
> >>>>>>>> Jan
> >>>>>>>
> >>>>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>>>
> >>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>>>> sense, generic grant tables would be more appealing.
> >>>>>
> >>>>> That's not how we do things for KVM, PV features need to be
> >>>>> modular and interchangeable with emulation.
> >>>>
> >>>> I know, and we may have to make some compromise for Jailhouse if that
> >>>> brings us valuable standardization and broad guest support. But we will
> >>>> surely not support an arbitrary amount of IOMMU models for that reason.
> >>>>
> >>>>>
> >>>>> If you just want something that's cross-platform and easy to
> >>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>>>
> >>>> That is likely required to keep the complexity manageable and to allow
> >>>> static preconfiguration.
> >>>
> >>> Real IOMMU allow static configuration just fine. This is exactly
> >>> what VFIO uses.
> >>
> >> Please specify more precisely which feature in which IOMMU you are
> >> referring to. Also, given that you refer to VFIO, I suspect we have
> >> different thing in mind. I'm talking about an IOMMU device model, like
> >> the one we have in QEMU now for VT-d. That one is not at all
> >> preconfigured by the host for VFIO.
> > 
> > I really just mean that VFIO creates a mostly static IOMMU configuration.
> > 
> > It's configured by the guest, not the host.
> 
> OK, that resolves my confusion.
> 
> > 
> > I don't see host control over configuration as being particularly important.
> 
> We do, see below.
> 
> > 
> > 
> >>>
> >>>> Well, we could declare our virtio-shmem device to be an IOMMU device
> >>>> that controls access of a remote VM to RAM of the one that owns the
> >>>> device. In the static case, this access may at most be enabled/disabled
> >>>> but not moved around. The static regions would have to be discoverable
> >>>> for the VM (register read-back), and the guest's firmware will likely
> >>>> have to declare those ranges reserved to the guest OS.
> >>>> In the dynamic case, the guest would be able to create an alternative
> >>>> mapping.
> >>>
> >>>
> >>> I don't think we want a special device just to support the
> >>> static case. It might be a bit less code to write, but
> >>> eventually it should be up to the guest.
> >>> Fundamentally, it's policy that host has no business
> >>> dictating.
> >>
> >> "A bit less" is to be validated, and I doubt its just "a bit". But if
> >> KVM and its guests will also support some PV-IOMMU that we can reuse for
> >> our scenarios, than that is fine. KVM would not have to mandate support
> >> for it while we would, that's all.
> > 
> > Someone will have to do this work.
> > 
> >>>
> >>>> We would probably have to define a generic page table structure
> >>>> for that. Or do you rather have some MPU-like control structure in mind,
> >>>> more similar to the memory region descriptions vhost is already using?
> >>>
> >>> I don't care much. Page tables use less memory if a lot of memory needs
> >>> to be covered. OTOH if you want to use virtio (e.g. to allow command
> >>> batching) that likely means commands to manipulate the IOMMU, and
> >>> maintaining it all on the host. You decide.
> >>
> >> I don't care very much about the dynamic case as we won't support it
> >> anyway. However, if the configuration concept used for it is applicable
> >> to static mode as well, then we could reuse it. But preconfiguration
> >> will required register-based region description, I suspect.
> > 
> > I don't know what you mean by preconfiguration exactly.
> > 
> > Do you want the host to configure the IOMMU? Why not let the
> > guest do this?
> 
> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
> validate and synchronize guest-triggered changes.

Fine, but this assumes guest does very specific things, right?
E.g. should guest reconfigure device's BAR, you would have
to change GPA to HPA mappings?


> >>>
> >>>> Also not yet clear to me are how the vhost-pci device and the
> >>>> translations it will have to do should look like for VM2.
> >>>
> >>> I think we can use vhost-pci BAR + VM1 bus address as the
> >>> VM2 physical address. In other words, all memory exposed to
> >>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> >>> vhost-pci.
> >>>
> >>> Bus addresses can be validated to make sure they fit
> >>> in the BAR.
> >>
> >> Sounds simple but may become challenging for VMs that have many of such
> >> devices (in order to connect to many possibly large VMs).
> > 
> > You don't need to be able to map all guest memory if you know
> > guest won't try to allow device access to all of it.
> > It's a question of how good is the bus address allocator.
> 
> But those BARs need to allocate a guest-physical address range as large
> as the other guest's RAM is, possibly even larger if that RAM is not
> contiguous, and you can't put other resources into potential holes
> because VM2 does not know where those holes will be.

No - only the RAM that you want addressable by VM2.

IOW if you wish, you actually can create a shared memory device,
make it accessible to the IOMMU and place some or all
data there.




> > 
> >>>
> >>>
> >>> One issue to consider is that VM1 can trick VM2 into writing
> >>> into bus address that isn't mapped in the IOMMU, or
> >>> is mapped read-only.
> >>> We probably would have to teach KVM to handle this somehow,
> >>> e.g. exit to QEMU, or even just ignore. Maybe notify guest
> >>> e.g. by setting a bit in the config space of the device,
> >>> to avoid easy DOS.
> >>
> >> Well, that would be trivial for VM1 to check if there are only one or
> >> two memory windows. Relying on the hypervisor to handle it may be
> >> unacceptable for real-time VMs.
> >>
> >> Jan
> > 
> > Why? real-time != fast. I doubt you can avoid vm exits completely.
> 
> We can, one property of Jailhouse (on x86, ARM is waiting for GICv4).
> 
> Real-time == deterministic. And if you have such vm exits potentially in
> your code path, you have them always - for worst-case analysis. One may
> argue about probability in certain scenarios, but if the triggering side
> is malicious, probability may become 1.
> 
> Jan

You are doing a special hypervisor anyway, I think you could
detect that setup is done, and freeze
the configuration.

If afterwards a VM attempts to modify mappings, you can
say it's malicious and ignore it, or kill it, or whatever.



> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-03  8:08                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>>>> Leaving all the implementation and interface details aside, this
> >>>>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>>>> have to go?
> >>>>>>>>
> >>>>>>>> Jan
> >>>>>>>
> >>>>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>>>
> >>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>>>> sense, generic grant tables would be more appealing.
> >>>>>
> >>>>> That's not how we do things for KVM, PV features need to be
> >>>>> modular and interchangeable with emulation.
> >>>>
> >>>> I know, and we may have to make some compromise for Jailhouse if that
> >>>> brings us valuable standardization and broad guest support. But we will
> >>>> surely not support an arbitrary amount of IOMMU models for that reason.
> >>>>
> >>>>>
> >>>>> If you just want something that's cross-platform and easy to
> >>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>>>
> >>>> That is likely required to keep the complexity manageable and to allow
> >>>> static preconfiguration.
> >>>
> >>> Real IOMMU allow static configuration just fine. This is exactly
> >>> what VFIO uses.
> >>
> >> Please specify more precisely which feature in which IOMMU you are
> >> referring to. Also, given that you refer to VFIO, I suspect we have
> >> different thing in mind. I'm talking about an IOMMU device model, like
> >> the one we have in QEMU now for VT-d. That one is not at all
> >> preconfigured by the host for VFIO.
> > 
> > I really just mean that VFIO creates a mostly static IOMMU configuration.
> > 
> > It's configured by the guest, not the host.
> 
> OK, that resolves my confusion.
> 
> > 
> > I don't see host control over configuration as being particularly important.
> 
> We do, see below.
> 
> > 
> > 
> >>>
> >>>> Well, we could declare our virtio-shmem device to be an IOMMU device
> >>>> that controls access of a remote VM to RAM of the one that owns the
> >>>> device. In the static case, this access may at most be enabled/disabled
> >>>> but not moved around. The static regions would have to be discoverable
> >>>> for the VM (register read-back), and the guest's firmware will likely
> >>>> have to declare those ranges reserved to the guest OS.
> >>>> In the dynamic case, the guest would be able to create an alternative
> >>>> mapping.
> >>>
> >>>
> >>> I don't think we want a special device just to support the
> >>> static case. It might be a bit less code to write, but
> >>> eventually it should be up to the guest.
> >>> Fundamentally, it's policy that host has no business
> >>> dictating.
> >>
> >> "A bit less" is to be validated, and I doubt its just "a bit". But if
> >> KVM and its guests will also support some PV-IOMMU that we can reuse for
> >> our scenarios, than that is fine. KVM would not have to mandate support
> >> for it while we would, that's all.
> > 
> > Someone will have to do this work.
> > 
> >>>
> >>>> We would probably have to define a generic page table structure
> >>>> for that. Or do you rather have some MPU-like control structure in mind,
> >>>> more similar to the memory region descriptions vhost is already using?
> >>>
> >>> I don't care much. Page tables use less memory if a lot of memory needs
> >>> to be covered. OTOH if you want to use virtio (e.g. to allow command
> >>> batching) that likely means commands to manipulate the IOMMU, and
> >>> maintaining it all on the host. You decide.
> >>
> >> I don't care very much about the dynamic case as we won't support it
> >> anyway. However, if the configuration concept used for it is applicable
> >> to static mode as well, then we could reuse it. But preconfiguration
> >> will required register-based region description, I suspect.
> > 
> > I don't know what you mean by preconfiguration exactly.
> > 
> > Do you want the host to configure the IOMMU? Why not let the
> > guest do this?
> 
> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
> validate and synchronize guest-triggered changes.

Fine, but this assumes guest does very specific things, right?
E.g. should guest reconfigure device's BAR, you would have
to change GPA to HPA mappings?


> >>>
> >>>> Also not yet clear to me are how the vhost-pci device and the
> >>>> translations it will have to do should look like for VM2.
> >>>
> >>> I think we can use vhost-pci BAR + VM1 bus address as the
> >>> VM2 physical address. In other words, all memory exposed to
> >>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> >>> vhost-pci.
> >>>
> >>> Bus addresses can be validated to make sure they fit
> >>> in the BAR.
> >>
> >> Sounds simple but may become challenging for VMs that have many of such
> >> devices (in order to connect to many possibly large VMs).
> > 
> > You don't need to be able to map all guest memory if you know
> > guest won't try to allow device access to all of it.
> > It's a question of how good is the bus address allocator.
> 
> But those BARs need to allocate a guest-physical address range as large
> as the other guest's RAM is, possibly even larger if that RAM is not
> contiguous, and you can't put other resources into potential holes
> because VM2 does not know where those holes will be.

No - only the RAM that you want addressable by VM2.

IOW if you wish, you actually can create a shared memory device,
make it accessible to the IOMMU and place some or all
data there.




> > 
> >>>
> >>>
> >>> One issue to consider is that VM1 can trick VM2 into writing
> >>> into bus address that isn't mapped in the IOMMU, or
> >>> is mapped read-only.
> >>> We probably would have to teach KVM to handle this somehow,
> >>> e.g. exit to QEMU, or even just ignore. Maybe notify guest
> >>> e.g. by setting a bit in the config space of the device,
> >>> to avoid easy DOS.
> >>
> >> Well, that would be trivial for VM1 to check if there are only one or
> >> two memory windows. Relying on the hypervisor to handle it may be
> >> unacceptable for real-time VMs.
> >>
> >> Jan
> > 
> > Why? real-time != fast. I doubt you can avoid vm exits completely.
> 
> We can, one property of Jailhouse (on x86, ARM is waiting for GICv4).
> 
> Real-time == deterministic. And if you have such vm exits potentially in
> your code path, you have them always - for worst-case analysis. One may
> argue about probability in certain scenarios, but if the triggering side
> is malicious, probability may become 1.
> 
> Jan

You are doing a special hypervisor anyway, I think you could
detect that setup is done, and freeze
the configuration.

If afterwards a VM attempts to modify mappings, you can
say it's malicious and ignore it, or kill it, or whatever.



> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-03  4:45                       ` [Qemu-devel] " Nakajima, Jun
@ 2015-09-03  8:09                         ` Michael S. Tsirkin
  2015-09-03  8:09                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:09 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

On Wed, Sep 02, 2015 at 09:45:45PM -0700, Nakajima, Jun wrote:
> BTW, can you please take a look at the following URL to see my
> understanding is correct? Our engineers are saying that they are not
> really sure if they understood your proposal (especially around
> IOMMU), and I drew a figure, adding notes...
> 
> https://wiki.opnfv.org/vm2vm_mst
> 
> Thanks,

I think you got it right, thanks for putting this together!

> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-03  4:45                       ` [Qemu-devel] " Nakajima, Jun
  2015-09-03  8:09                         ` Michael S. Tsirkin
@ 2015-09-03  8:09                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:09 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, Igor Mammedov, Varun Sethi,
	opnfv-tech-discuss

On Wed, Sep 02, 2015 at 09:45:45PM -0700, Nakajima, Jun wrote:
> BTW, can you please take a look at the following URL to see my
> understanding is correct? Our engineers are saying that they are not
> really sure if they understood your proposal (especially around
> IOMMU), and I drew a figure, adding notes...
> 
> https://wiki.opnfv.org/vm2vm_mst
> 
> Thanks,

I think you got it right, thanks for putting this together!

> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-03  8:08                     ` Michael S. Tsirkin
@ 2015-09-03  8:21                       ` Jan Kiszka
  -1 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-03  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-03 10:08, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>>>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>>>>>> Leaving all the implementation and interface details aside, this
>>>>>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>>>>>> have to go?
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>
>>>>>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>>>>>
>>>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>>>>>> sense, generic grant tables would be more appealing.
>>>>>>>
>>>>>>> That's not how we do things for KVM, PV features need to be
>>>>>>> modular and interchangeable with emulation.
>>>>>>
>>>>>> I know, and we may have to make some compromise for Jailhouse if that
>>>>>> brings us valuable standardization and broad guest support. But we will
>>>>>> surely not support an arbitrary amount of IOMMU models for that reason.
>>>>>>
>>>>>>>
>>>>>>> If you just want something that's cross-platform and easy to
>>>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>>>>>
>>>>>> That is likely required to keep the complexity manageable and to allow
>>>>>> static preconfiguration.
>>>>>
>>>>> Real IOMMU allow static configuration just fine. This is exactly
>>>>> what VFIO uses.
>>>>
>>>> Please specify more precisely which feature in which IOMMU you are
>>>> referring to. Also, given that you refer to VFIO, I suspect we have
>>>> different thing in mind. I'm talking about an IOMMU device model, like
>>>> the one we have in QEMU now for VT-d. That one is not at all
>>>> preconfigured by the host for VFIO.
>>>
>>> I really just mean that VFIO creates a mostly static IOMMU configuration.
>>>
>>> It's configured by the guest, not the host.
>>
>> OK, that resolves my confusion.
>>
>>>
>>> I don't see host control over configuration as being particularly important.
>>
>> We do, see below.
>>
>>>
>>>
>>>>>
>>>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
>>>>>> that controls access of a remote VM to RAM of the one that owns the
>>>>>> device. In the static case, this access may at most be enabled/disabled
>>>>>> but not moved around. The static regions would have to be discoverable
>>>>>> for the VM (register read-back), and the guest's firmware will likely
>>>>>> have to declare those ranges reserved to the guest OS.
>>>>>> In the dynamic case, the guest would be able to create an alternative
>>>>>> mapping.
>>>>>
>>>>>
>>>>> I don't think we want a special device just to support the
>>>>> static case. It might be a bit less code to write, but
>>>>> eventually it should be up to the guest.
>>>>> Fundamentally, it's policy that host has no business
>>>>> dictating.
>>>>
>>>> "A bit less" is to be validated, and I doubt its just "a bit". But if
>>>> KVM and its guests will also support some PV-IOMMU that we can reuse for
>>>> our scenarios, than that is fine. KVM would not have to mandate support
>>>> for it while we would, that's all.
>>>
>>> Someone will have to do this work.
>>>
>>>>>
>>>>>> We would probably have to define a generic page table structure
>>>>>> for that. Or do you rather have some MPU-like control structure in mind,
>>>>>> more similar to the memory region descriptions vhost is already using?
>>>>>
>>>>> I don't care much. Page tables use less memory if a lot of memory needs
>>>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
>>>>> batching) that likely means commands to manipulate the IOMMU, and
>>>>> maintaining it all on the host. You decide.
>>>>
>>>> I don't care very much about the dynamic case as we won't support it
>>>> anyway. However, if the configuration concept used for it is applicable
>>>> to static mode as well, then we could reuse it. But preconfiguration
>>>> will required register-based region description, I suspect.
>>>
>>> I don't know what you mean by preconfiguration exactly.
>>>
>>> Do you want the host to configure the IOMMU? Why not let the
>>> guest do this?
>>
>> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
>> validate and synchronize guest-triggered changes.
> 
> Fine, but this assumes guest does very specific things, right?
> E.g. should guest reconfigure device's BAR, you would have
> to change GPA to HPA mappings?
> 

Yes, that's why we only support size exploration, not reallocation.

> 
>>>>>
>>>>>> Also not yet clear to me are how the vhost-pci device and the
>>>>>> translations it will have to do should look like for VM2.
>>>>>
>>>>> I think we can use vhost-pci BAR + VM1 bus address as the
>>>>> VM2 physical address. In other words, all memory exposed to
>>>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
>>>>> vhost-pci.
>>>>>
>>>>> Bus addresses can be validated to make sure they fit
>>>>> in the BAR.
>>>>
>>>> Sounds simple but may become challenging for VMs that have many of such
>>>> devices (in order to connect to many possibly large VMs).
>>>
>>> You don't need to be able to map all guest memory if you know
>>> guest won't try to allow device access to all of it.
>>> It's a question of how good is the bus address allocator.
>>
>> But those BARs need to allocate a guest-physical address range as large
>> as the other guest's RAM is, possibly even larger if that RAM is not
>> contiguous, and you can't put other resources into potential holes
>> because VM2 does not know where those holes will be.
> 
> No - only the RAM that you want addressable by VM2.

That's in the hand of VM1, not VM2 or the hypervisor, in case of
reconfigurable mapping. It's indeed a non-issue in our static case.

> 
> IOW if you wish, you actually can create a shared memory device,
> make it accessible to the IOMMU and place some or all
> data there.
> 

Actually, that could also be something more sophisticated, including
virtio-net, IF that device will be able to express its DMA window
restrictions (a bit like 32-bit PCI devices being restricted to <4G
addresses or ISA devices <1M).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-03  8:21                       ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-03  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-03 10:08, Michael S. Tsirkin wrote:
> On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
>> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
>>> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
>>>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
>>>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
>>>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
>>>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
>>>>>>>>>> Leaving all the implementation and interface details aside, this
>>>>>>>>>> discussion is first of all about two fundamentally different approaches:
>>>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
>>>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
>>>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
>>>>>>>>>> have to go?
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>
>>>>>>>>> Dynamic is a superset of static: you can always make it static if you
>>>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
>>>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
>>>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
>>>>>>>>
>>>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
>>>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
>>>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
>>>>>>>> sense, generic grant tables would be more appealing.
>>>>>>>
>>>>>>> That's not how we do things for KVM, PV features need to be
>>>>>>> modular and interchangeable with emulation.
>>>>>>
>>>>>> I know, and we may have to make some compromise for Jailhouse if that
>>>>>> brings us valuable standardization and broad guest support. But we will
>>>>>> surely not support an arbitrary amount of IOMMU models for that reason.
>>>>>>
>>>>>>>
>>>>>>> If you just want something that's cross-platform and easy to
>>>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
>>>>>>
>>>>>> That is likely required to keep the complexity manageable and to allow
>>>>>> static preconfiguration.
>>>>>
>>>>> Real IOMMU allow static configuration just fine. This is exactly
>>>>> what VFIO uses.
>>>>
>>>> Please specify more precisely which feature in which IOMMU you are
>>>> referring to. Also, given that you refer to VFIO, I suspect we have
>>>> different thing in mind. I'm talking about an IOMMU device model, like
>>>> the one we have in QEMU now for VT-d. That one is not at all
>>>> preconfigured by the host for VFIO.
>>>
>>> I really just mean that VFIO creates a mostly static IOMMU configuration.
>>>
>>> It's configured by the guest, not the host.
>>
>> OK, that resolves my confusion.
>>
>>>
>>> I don't see host control over configuration as being particularly important.
>>
>> We do, see below.
>>
>>>
>>>
>>>>>
>>>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
>>>>>> that controls access of a remote VM to RAM of the one that owns the
>>>>>> device. In the static case, this access may at most be enabled/disabled
>>>>>> but not moved around. The static regions would have to be discoverable
>>>>>> for the VM (register read-back), and the guest's firmware will likely
>>>>>> have to declare those ranges reserved to the guest OS.
>>>>>> In the dynamic case, the guest would be able to create an alternative
>>>>>> mapping.
>>>>>
>>>>>
>>>>> I don't think we want a special device just to support the
>>>>> static case. It might be a bit less code to write, but
>>>>> eventually it should be up to the guest.
>>>>> Fundamentally, it's policy that host has no business
>>>>> dictating.
>>>>
>>>> "A bit less" is to be validated, and I doubt its just "a bit". But if
>>>> KVM and its guests will also support some PV-IOMMU that we can reuse for
>>>> our scenarios, than that is fine. KVM would not have to mandate support
>>>> for it while we would, that's all.
>>>
>>> Someone will have to do this work.
>>>
>>>>>
>>>>>> We would probably have to define a generic page table structure
>>>>>> for that. Or do you rather have some MPU-like control structure in mind,
>>>>>> more similar to the memory region descriptions vhost is already using?
>>>>>
>>>>> I don't care much. Page tables use less memory if a lot of memory needs
>>>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
>>>>> batching) that likely means commands to manipulate the IOMMU, and
>>>>> maintaining it all on the host. You decide.
>>>>
>>>> I don't care very much about the dynamic case as we won't support it
>>>> anyway. However, if the configuration concept used for it is applicable
>>>> to static mode as well, then we could reuse it. But preconfiguration
>>>> will required register-based region description, I suspect.
>>>
>>> I don't know what you mean by preconfiguration exactly.
>>>
>>> Do you want the host to configure the IOMMU? Why not let the
>>> guest do this?
>>
>> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
>> validate and synchronize guest-triggered changes.
> 
> Fine, but this assumes guest does very specific things, right?
> E.g. should guest reconfigure device's BAR, you would have
> to change GPA to HPA mappings?
> 

Yes, that's why we only support size exploration, not reallocation.

> 
>>>>>
>>>>>> Also not yet clear to me are how the vhost-pci device and the
>>>>>> translations it will have to do should look like for VM2.
>>>>>
>>>>> I think we can use vhost-pci BAR + VM1 bus address as the
>>>>> VM2 physical address. In other words, all memory exposed to
>>>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
>>>>> vhost-pci.
>>>>>
>>>>> Bus addresses can be validated to make sure they fit
>>>>> in the BAR.
>>>>
>>>> Sounds simple but may become challenging for VMs that have many of such
>>>> devices (in order to connect to many possibly large VMs).
>>>
>>> You don't need to be able to map all guest memory if you know
>>> guest won't try to allow device access to all of it.
>>> It's a question of how good is the bus address allocator.
>>
>> But those BARs need to allocate a guest-physical address range as large
>> as the other guest's RAM is, possibly even larger if that RAM is not
>> contiguous, and you can't put other resources into potential holes
>> because VM2 does not know where those holes will be.
> 
> No - only the RAM that you want addressable by VM2.

That's in the hand of VM1, not VM2 or the hypervisor, in case of
reconfigurable mapping. It's indeed a non-issue in our static case.

> 
> IOW if you wish, you actually can create a shared memory device,
> make it accessible to the IOMMU and place some or all
> data there.
> 

Actually, that could also be something more sophisticated, including
virtio-net, IF that device will be able to express its DMA window
restrictions (a bit like 32-bit PCI devices being restricted to <4G
addresses or ISA devices <1M).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-03  8:21                       ` Jan Kiszka
  (?)
  (?)
@ 2015-09-03  8:37                       ` Michael S. Tsirkin
  2015-09-03 10:25                           ` Jan Kiszka
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:37 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On Thu, Sep 03, 2015 at 10:21:28AM +0200, Jan Kiszka wrote:
> On 2015-09-03 10:08, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >>>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>>>>>> Leaving all the implementation and interface details aside, this
> >>>>>>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>>>>>> have to go?
> >>>>>>>>>>
> >>>>>>>>>> Jan
> >>>>>>>>>
> >>>>>>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>>>>>
> >>>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>>>>>> sense, generic grant tables would be more appealing.
> >>>>>>>
> >>>>>>> That's not how we do things for KVM, PV features need to be
> >>>>>>> modular and interchangeable with emulation.
> >>>>>>
> >>>>>> I know, and we may have to make some compromise for Jailhouse if that
> >>>>>> brings us valuable standardization and broad guest support. But we will
> >>>>>> surely not support an arbitrary amount of IOMMU models for that reason.
> >>>>>>
> >>>>>>>
> >>>>>>> If you just want something that's cross-platform and easy to
> >>>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>>>>>
> >>>>>> That is likely required to keep the complexity manageable and to allow
> >>>>>> static preconfiguration.
> >>>>>
> >>>>> Real IOMMU allow static configuration just fine. This is exactly
> >>>>> what VFIO uses.
> >>>>
> >>>> Please specify more precisely which feature in which IOMMU you are
> >>>> referring to. Also, given that you refer to VFIO, I suspect we have
> >>>> different thing in mind. I'm talking about an IOMMU device model, like
> >>>> the one we have in QEMU now for VT-d. That one is not at all
> >>>> preconfigured by the host for VFIO.
> >>>
> >>> I really just mean that VFIO creates a mostly static IOMMU configuration.
> >>>
> >>> It's configured by the guest, not the host.
> >>
> >> OK, that resolves my confusion.
> >>
> >>>
> >>> I don't see host control over configuration as being particularly important.
> >>
> >> We do, see below.
> >>
> >>>
> >>>
> >>>>>
> >>>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
> >>>>>> that controls access of a remote VM to RAM of the one that owns the
> >>>>>> device. In the static case, this access may at most be enabled/disabled
> >>>>>> but not moved around. The static regions would have to be discoverable
> >>>>>> for the VM (register read-back), and the guest's firmware will likely
> >>>>>> have to declare those ranges reserved to the guest OS.
> >>>>>> In the dynamic case, the guest would be able to create an alternative
> >>>>>> mapping.
> >>>>>
> >>>>>
> >>>>> I don't think we want a special device just to support the
> >>>>> static case. It might be a bit less code to write, but
> >>>>> eventually it should be up to the guest.
> >>>>> Fundamentally, it's policy that host has no business
> >>>>> dictating.
> >>>>
> >>>> "A bit less" is to be validated, and I doubt its just "a bit". But if
> >>>> KVM and its guests will also support some PV-IOMMU that we can reuse for
> >>>> our scenarios, than that is fine. KVM would not have to mandate support
> >>>> for it while we would, that's all.
> >>>
> >>> Someone will have to do this work.
> >>>
> >>>>>
> >>>>>> We would probably have to define a generic page table structure
> >>>>>> for that. Or do you rather have some MPU-like control structure in mind,
> >>>>>> more similar to the memory region descriptions vhost is already using?
> >>>>>
> >>>>> I don't care much. Page tables use less memory if a lot of memory needs
> >>>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
> >>>>> batching) that likely means commands to manipulate the IOMMU, and
> >>>>> maintaining it all on the host. You decide.
> >>>>
> >>>> I don't care very much about the dynamic case as we won't support it
> >>>> anyway. However, if the configuration concept used for it is applicable
> >>>> to static mode as well, then we could reuse it. But preconfiguration
> >>>> will required register-based region description, I suspect.
> >>>
> >>> I don't know what you mean by preconfiguration exactly.
> >>>
> >>> Do you want the host to configure the IOMMU? Why not let the
> >>> guest do this?
> >>
> >> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
> >> validate and synchronize guest-triggered changes.
> > 
> > Fine, but this assumes guest does very specific things, right?
> > E.g. should guest reconfigure device's BAR, you would have
> > to change GPA to HPA mappings?
> > 
> 
> Yes, that's why we only support size exploration, not reallocation.
> 
> > 
> >>>>>
> >>>>>> Also not yet clear to me are how the vhost-pci device and the
> >>>>>> translations it will have to do should look like for VM2.
> >>>>>
> >>>>> I think we can use vhost-pci BAR + VM1 bus address as the
> >>>>> VM2 physical address. In other words, all memory exposed to
> >>>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> >>>>> vhost-pci.
> >>>>>
> >>>>> Bus addresses can be validated to make sure they fit
> >>>>> in the BAR.
> >>>>
> >>>> Sounds simple but may become challenging for VMs that have many of such
> >>>> devices (in order to connect to many possibly large VMs).
> >>>
> >>> You don't need to be able to map all guest memory if you know
> >>> guest won't try to allow device access to all of it.
> >>> It's a question of how good is the bus address allocator.
> >>
> >> But those BARs need to allocate a guest-physical address range as large
> >> as the other guest's RAM is, possibly even larger if that RAM is not
> >> contiguous, and you can't put other resources into potential holes
> >> because VM2 does not know where those holes will be.
> > 
> > No - only the RAM that you want addressable by VM2.
> 
> That's in the hand of VM1, not VM2 or the hypervisor, in case of
> reconfigurable mapping. It's indeed a non-issue in our static case.
> 
> > 
> > IOW if you wish, you actually can create a shared memory device,
> > make it accessible to the IOMMU and place some or all
> > data there.
> > 
> 
> Actually, that could also be something more sophisticated, including
> virtio-net, IF that device will be able to express its DMA window
> restrictions (a bit like 32-bit PCI devices being restricted to <4G
> addresses or ISA devices <1M).
> 
> Jan

Actually, it's the bus restriction, not the device restriction.

So if you want to use bounce buffers in the name of security or
real-time requirements, you should be able to do this if virtio uses the
DMA API.


> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-03  8:21                       ` Jan Kiszka
  (?)
@ 2015-09-03  8:37                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-03  8:37 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On Thu, Sep 03, 2015 at 10:21:28AM +0200, Jan Kiszka wrote:
> On 2015-09-03 10:08, Michael S. Tsirkin wrote:
> > On Tue, Sep 01, 2015 at 06:28:28PM +0200, Jan Kiszka wrote:
> >> On 2015-09-01 18:02, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 01, 2015 at 05:34:37PM +0200, Jan Kiszka wrote:
> >>>> On 2015-09-01 16:34, Michael S. Tsirkin wrote:
> >>>>> On Tue, Sep 01, 2015 at 04:09:44PM +0200, Jan Kiszka wrote:
> >>>>>> On 2015-09-01 11:24, Michael S. Tsirkin wrote:
> >>>>>>> On Tue, Sep 01, 2015 at 11:11:52AM +0200, Jan Kiszka wrote:
> >>>>>>>> On 2015-09-01 10:01, Michael S. Tsirkin wrote:
> >>>>>>>>> On Tue, Sep 01, 2015 at 09:35:21AM +0200, Jan Kiszka wrote:
> >>>>>>>>>> Leaving all the implementation and interface details aside, this
> >>>>>>>>>> discussion is first of all about two fundamentally different approaches:
> >>>>>>>>>> static shared memory windows vs. dynamically remapped shared windows (a
> >>>>>>>>>> third one would be copying in the hypervisor, but I suppose we all agree
> >>>>>>>>>> that the whole exercise is about avoiding that). Which way do we want or
> >>>>>>>>>> have to go?
> >>>>>>>>>>
> >>>>>>>>>> Jan
> >>>>>>>>>
> >>>>>>>>> Dynamic is a superset of static: you can always make it static if you
> >>>>>>>>> wish. Static has the advantage of simplicity, but that's lost once you
> >>>>>>>>> realize you need to invent interfaces to make it work.  Since we can use
> >>>>>>>>> existing IOMMU interfaces for the dynamic one, what's the disadvantage?
> >>>>>>>>
> >>>>>>>> Complexity. Having to emulate even more of an IOMMU in the hypervisor
> >>>>>>>> (we already have to do a bit for VT-d IR in Jailhouse) and doing this
> >>>>>>>> per platform (AMD IOMMU, ARM SMMU, ...) is out of scope for us. In that
> >>>>>>>> sense, generic grant tables would be more appealing.
> >>>>>>>
> >>>>>>> That's not how we do things for KVM, PV features need to be
> >>>>>>> modular and interchangeable with emulation.
> >>>>>>
> >>>>>> I know, and we may have to make some compromise for Jailhouse if that
> >>>>>> brings us valuable standardization and broad guest support. But we will
> >>>>>> surely not support an arbitrary amount of IOMMU models for that reason.
> >>>>>>
> >>>>>>>
> >>>>>>> If you just want something that's cross-platform and easy to
> >>>>>>> implement, just build a PV IOMMU. Maybe use virtio for this.
> >>>>>>
> >>>>>> That is likely required to keep the complexity manageable and to allow
> >>>>>> static preconfiguration.
> >>>>>
> >>>>> Real IOMMU allow static configuration just fine. This is exactly
> >>>>> what VFIO uses.
> >>>>
> >>>> Please specify more precisely which feature in which IOMMU you are
> >>>> referring to. Also, given that you refer to VFIO, I suspect we have
> >>>> different thing in mind. I'm talking about an IOMMU device model, like
> >>>> the one we have in QEMU now for VT-d. That one is not at all
> >>>> preconfigured by the host for VFIO.
> >>>
> >>> I really just mean that VFIO creates a mostly static IOMMU configuration.
> >>>
> >>> It's configured by the guest, not the host.
> >>
> >> OK, that resolves my confusion.
> >>
> >>>
> >>> I don't see host control over configuration as being particularly important.
> >>
> >> We do, see below.
> >>
> >>>
> >>>
> >>>>>
> >>>>>> Well, we could declare our virtio-shmem device to be an IOMMU device
> >>>>>> that controls access of a remote VM to RAM of the one that owns the
> >>>>>> device. In the static case, this access may at most be enabled/disabled
> >>>>>> but not moved around. The static regions would have to be discoverable
> >>>>>> for the VM (register read-back), and the guest's firmware will likely
> >>>>>> have to declare those ranges reserved to the guest OS.
> >>>>>> In the dynamic case, the guest would be able to create an alternative
> >>>>>> mapping.
> >>>>>
> >>>>>
> >>>>> I don't think we want a special device just to support the
> >>>>> static case. It might be a bit less code to write, but
> >>>>> eventually it should be up to the guest.
> >>>>> Fundamentally, it's policy that host has no business
> >>>>> dictating.
> >>>>
> >>>> "A bit less" is to be validated, and I doubt its just "a bit". But if
> >>>> KVM and its guests will also support some PV-IOMMU that we can reuse for
> >>>> our scenarios, than that is fine. KVM would not have to mandate support
> >>>> for it while we would, that's all.
> >>>
> >>> Someone will have to do this work.
> >>>
> >>>>>
> >>>>>> We would probably have to define a generic page table structure
> >>>>>> for that. Or do you rather have some MPU-like control structure in mind,
> >>>>>> more similar to the memory region descriptions vhost is already using?
> >>>>>
> >>>>> I don't care much. Page tables use less memory if a lot of memory needs
> >>>>> to be covered. OTOH if you want to use virtio (e.g. to allow command
> >>>>> batching) that likely means commands to manipulate the IOMMU, and
> >>>>> maintaining it all on the host. You decide.
> >>>>
> >>>> I don't care very much about the dynamic case as we won't support it
> >>>> anyway. However, if the configuration concept used for it is applicable
> >>>> to static mode as well, then we could reuse it. But preconfiguration
> >>>> will required register-based region description, I suspect.
> >>>
> >>> I don't know what you mean by preconfiguration exactly.
> >>>
> >>> Do you want the host to configure the IOMMU? Why not let the
> >>> guest do this?
> >>
> >> We simply freeze GPA-to-HPA mappings during runtime. Avoids having to
> >> validate and synchronize guest-triggered changes.
> > 
> > Fine, but this assumes guest does very specific things, right?
> > E.g. should guest reconfigure device's BAR, you would have
> > to change GPA to HPA mappings?
> > 
> 
> Yes, that's why we only support size exploration, not reallocation.
> 
> > 
> >>>>>
> >>>>>> Also not yet clear to me are how the vhost-pci device and the
> >>>>>> translations it will have to do should look like for VM2.
> >>>>>
> >>>>> I think we can use vhost-pci BAR + VM1 bus address as the
> >>>>> VM2 physical address. In other words, all memory exposed to
> >>>>> virtio-pci by VM1 through it's IOMMU is mapped into BAR of
> >>>>> vhost-pci.
> >>>>>
> >>>>> Bus addresses can be validated to make sure they fit
> >>>>> in the BAR.
> >>>>
> >>>> Sounds simple but may become challenging for VMs that have many of such
> >>>> devices (in order to connect to many possibly large VMs).
> >>>
> >>> You don't need to be able to map all guest memory if you know
> >>> guest won't try to allow device access to all of it.
> >>> It's a question of how good is the bus address allocator.
> >>
> >> But those BARs need to allocate a guest-physical address range as large
> >> as the other guest's RAM is, possibly even larger if that RAM is not
> >> contiguous, and you can't put other resources into potential holes
> >> because VM2 does not know where those holes will be.
> > 
> > No - only the RAM that you want addressable by VM2.
> 
> That's in the hand of VM1, not VM2 or the hypervisor, in case of
> reconfigurable mapping. It's indeed a non-issue in our static case.
> 
> > 
> > IOW if you wish, you actually can create a shared memory device,
> > make it accessible to the IOMMU and place some or all
> > data there.
> > 
> 
> Actually, that could also be something more sophisticated, including
> virtio-net, IF that device will be able to express its DMA window
> restrictions (a bit like 32-bit PCI devices being restricted to <4G
> addresses or ISA devices <1M).
> 
> Jan

Actually, it's the bus restriction, not the device restriction.

So if you want to use bounce buffers in the name of security or
real-time requirements, you should be able to do this if virtio uses the
DMA API.


> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-03  8:37                       ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-03 10:25                           ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-03 10:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Nakajima, Jun, Varun Sethi, opnfv-tech-discuss

On 2015-09-03 10:37, Michael S. Tsirkin wrote:
> On Thu, Sep 03, 2015 at 10:21:28AM +0200, Jan Kiszka wrote:
>> On 2015-09-03 10:08, Michael S. Tsirkin wrote:
>>>
>>> IOW if you wish, you actually can create a shared memory device,
>>> make it accessible to the IOMMU and place some or all
>>> data there.
>>>
>>
>> Actually, that could also be something more sophisticated, including
>> virtio-net, IF that device will be able to express its DMA window
>> restrictions (a bit like 32-bit PCI devices being restricted to <4G
>> addresses or ISA devices <1M).
>>
>> Jan
> 
> Actually, it's the bus restriction, not the device restriction.
> 
> So if you want to use bounce buffers in the name of security or
> real-time requirements, you should be able to do this if virtio uses the
> DMA API.

Bounce buffer will only be the simplest option (though fine for low-rate
traffic that we also have in mind, like virtual consoles). Given
properly-sized regions, even if fixed, and the right communication
stacks, you can directly allocate application buffers in those regions
and avoid most to all copying.

In any case, if we manage to address this variation along with your
proposal, that would help tremendously.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-03 10:25                           ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-03 10:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio.Fontana, qemu-devel, virtualization,
	Varun Sethi, opnfv-tech-discuss

On 2015-09-03 10:37, Michael S. Tsirkin wrote:
> On Thu, Sep 03, 2015 at 10:21:28AM +0200, Jan Kiszka wrote:
>> On 2015-09-03 10:08, Michael S. Tsirkin wrote:
>>>
>>> IOW if you wish, you actually can create a shared memory device,
>>> make it accessible to the IOMMU and place some or all
>>> data there.
>>>
>>
>> Actually, that could also be something more sophisticated, including
>> virtio-net, IF that device will be able to express its DMA window
>> restrictions (a bit like 32-bit PCI devices being restricted to <4G
>> addresses or ISA devices <1M).
>>
>> Jan
> 
> Actually, it's the bus restriction, not the device restriction.
> 
> So if you want to use bounce buffers in the name of security or
> real-time requirements, you should be able to do this if virtio uses the
> DMA API.

Bounce buffer will only be the simplest option (though fine for low-rate
traffic that we also have in mind, like virtual consoles). Given
properly-sized regions, even if fixed, and the right communication
stacks, you can directly allocate application buffers in those regions
and avoid most to all copying.

In any case, if we manage to address this variation along with your
proposal, that would help tremendously.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
  2015-08-31 18:35   ` Nakajima, Jun
  2015-09-01  7:35   ` Jan Kiszka
@ 2015-09-07 12:38 ` Claudio Fontana
  2015-09-09  6:40   ` [opnfv-tech-discuss] " Zhang, Yang Z
                     ` (3 more replies)
  2015-09-07 12:38 ` Claudio Fontana
  2015-09-14 16:00   ` Stefan Hajnoczi
  4 siblings, 4 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-07 12:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss
  Cc: Jan Kiszka

Coming late to the party, 

On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem. I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.

as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.

This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.

Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).

But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?

But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.

Ciao,

Claudio

> 
> -----
> 
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
> 
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
> 
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
> 
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
> 
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
> 
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
> 
> 
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
> 
> 
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
> 
> 
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.
> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.
> This also allows use of VFIO within guests, for improved
> security.
> 
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.
> 
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci - 
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.
> 
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
> 
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
> 
> 
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
> 
> 
> Advantages over ivshmem:
> 
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Thanks,
> 


-- 
Claudio Fontana
Server Virtualization Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
@ 2015-09-07 12:38 ` Claudio Fontana
  2015-09-14 16:00   ` Stefan Hajnoczi
  4 siblings, 0 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-07 12:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss
  Cc: Jan Kiszka

Coming late to the party, 

On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> Hello!
> During the KVM forum, we discussed supporting virtio on top
> of ivshmem. I have considered it, and came up with an alternative
> that has several advantages over that - please see below.
> Comments welcome.

as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.

This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.

Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).

But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?

But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.

Ciao,

Claudio

> 
> -----
> 
> Existing solutions to userspace switching between VMs on the
> same host are vhost-user and ivshmem.
> 
> vhost-user works by mapping memory of all VMs being bridged into the
> switch memory space.
> 
> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> VMs are required to use this region to store packets. The switch only
> needs access to this region.
> 
> Another difference between vhost-user and ivshmem surfaces when polling
> is used. With vhost-user, the switch is required to handle
> data movement between VMs, if using polling, this means that 1 host CPU
> needs to be sacrificed for this task.
> 
> This is easiest to understand when one of the VMs is
> used with VF pass-through. This can be schematically shown below:
> 
> +-- VM1 --------------+            +---VM2-----------+
> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+            +-----------------+
> 
> 
> With ivshmem in theory communication can happen directly, with two VMs
> polling the shared memory region.
> 
> 
> I won't spend time listing advantages of vhost-user over ivshmem.
> Instead, having identified two advantages of ivshmem over vhost-user,
> below is a proposal to extend vhost-user to gain the advantages
> of ivshmem.
> 
> 
> 1: virtio in guest can be extended to allow support
> for IOMMUs. This provides guest with full flexibility
> about memory which is readable or write able by each device.
> By setting up a virtio device for each other VM we need to
> communicate to, guest gets full control of its security, from
> mapping all memory (like with current vhost-user) to only
> mapping buffers used for networking (like ivshmem) to
> transient mappings for the duration of data transfer only.
> This also allows use of VFIO within guests, for improved
> security.
> 
> vhost user would need to be extended to send the
> mappings programmed by guest IOMMU.
> 
> 2. qemu can be extended to serve as a vhost-user client:
> remote VM mappings over the vhost-user protocol, and
> map them into another VM's memory.
> This mapping can take, for example, the form of
> a BAR of a pci device, which I'll call here vhost-pci - 
> with bus address allowed
> by VM1's IOMMU mappings being translated into
> offsets within this BAR within VM2's physical
> memory space.
> 
> Since the translation can be a simple one, VM2
> can perform it within its vhost-pci device driver.
> 
> While this setup would be the most useful with polling,
> VM1's ioeventfd can also be mapped to
> another VM2's irqfd, and vice versa, such that VMs
> can trigger interrupts to each other without need
> for a helper thread on the host.
> 
> 
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.
> 
> 
> Advantages over ivshmem:
> 
> - more flexibility, endpoint VMs do not have to place data at any
>   specific locations to use the device, in practice this likely
>   means less data copies.
> - better standardization/code reuse
>   virtio changes within guests would be fairly easy to implement
>   and would also benefit other backends, besides vhost-user
>   standard hotplug interfaces can be used to add and remove these
>   channels as VMs are added or removed.
> - migration support
>   It's easy to implement since ownership of memory is well defined.
>   For example, during migration VM2 can notify hypervisor of VM1
>   by updating dirty bitmap each time is writes into VM1 memory.
> 
> Thanks,
> 


-- 
Claudio Fontana
Server Virtualization Architect
Huawei Technologies Duesseldorf GmbH
Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
  2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
  2015-09-09  6:40   ` [opnfv-tech-discuss] " Zhang, Yang Z
@ 2015-09-09  6:40   ` Zhang, Yang Z
  2015-09-09  8:39     ` Claudio Fontana
  2015-09-09  8:39     ` [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication Claudio Fontana
  2015-09-09  7:06   ` [Qemu-devel] " Michael S. Tsirkin
  2015-09-09  7:06   ` Michael S. Tsirkin
  3 siblings, 2 replies; 80+ messages in thread
From: Zhang, Yang Z @ 2015-09-09  6:40 UTC (permalink / raw)
  To: Claudio Fontana, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

Claudio Fontana wrote on 2015-09-07:
> Coming late to the party,
> 
> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>> Hello!
>> During the KVM forum, we discussed supporting virtio on top
>> of ivshmem. I have considered it, and came up with an alternative
>> that has several advantages over that - please see below.
>> Comments welcome.
> 
> as Jan mentioned we actually discussed a virtio-shmem device which would
> incorporate the advantages of ivshmem (so no need for a separate ivshmem
> device), which would use the well known virtio interface, taking advantage of
> the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from
> the two sides, and make use also of BAR0 which has been freed up for use by
> the device.

Interesting! Can you elaborate it? 

> 
> This way it would be possible to share the rings and the actual memory
> for the buffers in the PCI bars. The guest VMs could decide to use the
> shared memory regions directly as prepared by the hypervisor (in the

"the shared memory regions" here means share another VM's memory or like ivshmem?

> jailhouse case) or QEMU/KVM, or perform their own validation on the
> input depending on the use case.
> 
> Of course the communication between VMs needs in this case to be
> pre-configured and is quite static (which is actually beneficial in our use case).

pre-configured means user knows which VMs will talk to each other and configure it when booting guest(i.e. in Qemu command line)?

> 
> But still in your proposed solution, each VM needs to be pre-configured to
> communicate with a specific other VM using a separate device right?
> 
> But I wonder if we are addressing the same problem.. in your case you are
> looking at having a shared memory pool for all VMs potentially visible to all VMs
> (the vhost-user case), while in the virtio-shmem proposal we discussed we
> were assuming specific different regions for every channel.
> 
> Ciao,
> 
> Claudio
> 
> 
>


Best regards,
Yang

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
  2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
@ 2015-09-09  6:40   ` Zhang, Yang Z
  2015-09-09  6:40   ` [Qemu-devel] " Zhang, Yang Z
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 80+ messages in thread
From: Zhang, Yang Z @ 2015-09-09  6:40 UTC (permalink / raw)
  To: Claudio Fontana, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

Claudio Fontana wrote on 2015-09-07:
> Coming late to the party,
> 
> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>> Hello!
>> During the KVM forum, we discussed supporting virtio on top
>> of ivshmem. I have considered it, and came up with an alternative
>> that has several advantages over that - please see below.
>> Comments welcome.
> 
> as Jan mentioned we actually discussed a virtio-shmem device which would
> incorporate the advantages of ivshmem (so no need for a separate ivshmem
> device), which would use the well known virtio interface, taking advantage of
> the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from
> the two sides, and make use also of BAR0 which has been freed up for use by
> the device.

Interesting! Can you elaborate it? 

> 
> This way it would be possible to share the rings and the actual memory
> for the buffers in the PCI bars. The guest VMs could decide to use the
> shared memory regions directly as prepared by the hypervisor (in the

"the shared memory regions" here means share another VM's memory or like ivshmem?

> jailhouse case) or QEMU/KVM, or perform their own validation on the
> input depending on the use case.
> 
> Of course the communication between VMs needs in this case to be
> pre-configured and is quite static (which is actually beneficial in our use case).

pre-configured means user knows which VMs will talk to each other and configure it when booting guest(i.e. in Qemu command line)?

> 
> But still in your proposed solution, each VM needs to be pre-configured to
> communicate with a specific other VM using a separate device right?
> 
> But I wonder if we are addressing the same problem.. in your case you are
> looking at having a shared memory pool for all VMs potentially visible to all VMs
> (the vhost-user case), while in the virtio-shmem proposal we discussed we
> were assuming specific different regions for every channel.
> 
> Ciao,
> 
> Claudio
> 
> 
>


Best regards,
Yang

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
  2015-09-09  6:40   ` [opnfv-tech-discuss] " Zhang, Yang Z
  2015-09-09  6:40   ` [Qemu-devel] " Zhang, Yang Z
@ 2015-09-09  7:06   ` Michael S. Tsirkin
  2015-09-11 15:39       ` Claudio Fontana
  2015-09-09  7:06   ` Michael S. Tsirkin
  3 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-09  7:06 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
> Coming late to the party, 
> 
> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem. I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> 
> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
> 
> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
> 
> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
> 
> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
> 
> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
> 
> Ciao,
> 
> Claudio

The problem, as I see it, is to allow inter-vm communication with
polling (to get very low latencies) but polling within VMs only, without
need to run a host thread (which when polling uses up a host CPU).

What was proposed was to simply change virtio to allow
"offset within BAR" instead of PA.
This would allow VM2VM communication if there are only 2 VMs,
but if data needs to be sent to multiple VMs, you
must copy it.

Additionally, it's a single-purpose feature: you can use it from
a userspace PMD but linux will never use it.


My proposal is a superset: don't require that BAR memory is
used, use IOMMU translation tables.
This way, data can be sent to multiple VMs by sharing the same
memory with them all.

It is still possible to put data in some device BAR if that's
what the guest wants to do: just program the IOMMU to limit
virtio to the memory range that is within this BAR.

Another advantage here is that the feature is more generally useful.


> > 
> > -----
> > 
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> > 
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> > 
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> > 
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> > 
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> > 
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> > 
> > 
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> > 
> > 
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> > 
> > 
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> > This also allows use of VFIO within guests, for improved
> > security.
> > 
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> > 
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci - 
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> > 
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> > 
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> > 
> > 
> > The resulting channel might look something like the following:
> > 
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> > 
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> > 
> > 
> > Advantages over ivshmem:
> > 
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Thanks,
> > 
> 
> 
> -- 
> Claudio Fontana
> Server Virtualization Architect
> Huawei Technologies Duesseldorf GmbH
> Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
                     ` (2 preceding siblings ...)
  2015-09-09  7:06   ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-09  7:06   ` Michael S. Tsirkin
  3 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-09  7:06 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
> Coming late to the party, 
> 
> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> > Hello!
> > During the KVM forum, we discussed supporting virtio on top
> > of ivshmem. I have considered it, and came up with an alternative
> > that has several advantages over that - please see below.
> > Comments welcome.
> 
> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
> 
> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
> 
> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
> 
> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
> 
> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
> 
> Ciao,
> 
> Claudio

The problem, as I see it, is to allow inter-vm communication with
polling (to get very low latencies) but polling within VMs only, without
need to run a host thread (which when polling uses up a host CPU).

What was proposed was to simply change virtio to allow
"offset within BAR" instead of PA.
This would allow VM2VM communication if there are only 2 VMs,
but if data needs to be sent to multiple VMs, you
must copy it.

Additionally, it's a single-purpose feature: you can use it from
a userspace PMD but linux will never use it.


My proposal is a superset: don't require that BAR memory is
used, use IOMMU translation tables.
This way, data can be sent to multiple VMs by sharing the same
memory with them all.

It is still possible to put data in some device BAR if that's
what the guest wants to do: just program the IOMMU to limit
virtio to the memory range that is within this BAR.

Another advantage here is that the feature is more generally useful.


> > 
> > -----
> > 
> > Existing solutions to userspace switching between VMs on the
> > same host are vhost-user and ivshmem.
> > 
> > vhost-user works by mapping memory of all VMs being bridged into the
> > switch memory space.
> > 
> > By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> > VMs are required to use this region to store packets. The switch only
> > needs access to this region.
> > 
> > Another difference between vhost-user and ivshmem surfaces when polling
> > is used. With vhost-user, the switch is required to handle
> > data movement between VMs, if using polling, this means that 1 host CPU
> > needs to be sacrificed for this task.
> > 
> > This is easiest to understand when one of the VMs is
> > used with VF pass-through. This can be schematically shown below:
> > 
> > +-- VM1 --------------+            +---VM2-----------+
> > | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+            +-----------------+
> > 
> > 
> > With ivshmem in theory communication can happen directly, with two VMs
> > polling the shared memory region.
> > 
> > 
> > I won't spend time listing advantages of vhost-user over ivshmem.
> > Instead, having identified two advantages of ivshmem over vhost-user,
> > below is a proposal to extend vhost-user to gain the advantages
> > of ivshmem.
> > 
> > 
> > 1: virtio in guest can be extended to allow support
> > for IOMMUs. This provides guest with full flexibility
> > about memory which is readable or write able by each device.
> > By setting up a virtio device for each other VM we need to
> > communicate to, guest gets full control of its security, from
> > mapping all memory (like with current vhost-user) to only
> > mapping buffers used for networking (like ivshmem) to
> > transient mappings for the duration of data transfer only.
> > This also allows use of VFIO within guests, for improved
> > security.
> > 
> > vhost user would need to be extended to send the
> > mappings programmed by guest IOMMU.
> > 
> > 2. qemu can be extended to serve as a vhost-user client:
> > remote VM mappings over the vhost-user protocol, and
> > map them into another VM's memory.
> > This mapping can take, for example, the form of
> > a BAR of a pci device, which I'll call here vhost-pci - 
> > with bus address allowed
> > by VM1's IOMMU mappings being translated into
> > offsets within this BAR within VM2's physical
> > memory space.
> > 
> > Since the translation can be a simple one, VM2
> > can perform it within its vhost-pci device driver.
> > 
> > While this setup would be the most useful with polling,
> > VM1's ioeventfd can also be mapped to
> > another VM2's irqfd, and vice versa, such that VMs
> > can trigger interrupts to each other without need
> > for a helper thread on the host.
> > 
> > 
> > The resulting channel might look something like the following:
> > 
> > +-- VM1 --------------+  +---VM2-----------+
> > | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> > +---------------------+  +-----------------+
> > 
> > comparing the two diagrams, a vhost-user thread on the host is
> > no longer required, reducing the host CPU utilization when
> > polling is active.  At the same time, VM2 can not access all of VM1's
> > memory - it is limited by the iommu configuration setup by VM1.
> > 
> > 
> > Advantages over ivshmem:
> > 
> > - more flexibility, endpoint VMs do not have to place data at any
> >   specific locations to use the device, in practice this likely
> >   means less data copies.
> > - better standardization/code reuse
> >   virtio changes within guests would be fairly easy to implement
> >   and would also benefit other backends, besides vhost-user
> >   standard hotplug interfaces can be used to add and remove these
> >   channels as VMs are added or removed.
> > - migration support
> >   It's easy to implement since ownership of memory is well defined.
> >   For example, during migration VM2 can notify hypervisor of VM1
> >   by updating dirty bitmap each time is writes into VM1 memory.
> > 
> > Thanks,
> > 
> 
> 
> -- 
> Claudio Fontana
> Server Virtualization Architect
> Huawei Technologies Duesseldorf GmbH
> Riesstraße 25 - 80992 München

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
  2015-09-09  6:40   ` [Qemu-devel] " Zhang, Yang Z
@ 2015-09-09  8:39     ` Claudio Fontana
  2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
  2015-09-18 16:29       ` Claudio Fontana
  2015-09-09  8:39     ` [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication Claudio Fontana
  1 sibling, 2 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-09  8:39 UTC (permalink / raw)
  To: Zhang, Yang Z, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

On 09.09.2015 08:40, Zhang, Yang Z wrote:
> Claudio Fontana wrote on 2015-09-07:
>> Coming late to the party,
>>
>> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>>> Hello!
>>> During the KVM forum, we discussed supporting virtio on top
>>> of ivshmem. I have considered it, and came up with an alternative
>>> that has several advantages over that - please see below.
>>> Comments welcome.
>>
>> as Jan mentioned we actually discussed a virtio-shmem device which would
>> incorporate the advantages of ivshmem (so no need for a separate ivshmem
>> device), which would use the well known virtio interface, taking advantage of
>> the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from
>> the two sides, and make use also of BAR0 which has been freed up for use by
>> the device.
> 
> Interesting! Can you elaborate it? 


Yes, I will post a more detailed proposal in the coming days.


>>
>> This way it would be possible to share the rings and the actual memory
>> for the buffers in the PCI bars. The guest VMs could decide to use the
>> shared memory regions directly as prepared by the hypervisor (in the
> 
> "the shared memory regions" here means share another VM's memory or like ivshmem?


It's explicitly about sharing memory between two desired VMs, as set up by the virtualization environment.


>> jailhouse case) or QEMU/KVM, or perform their own validation on the
>> input depending on the use case.
>>
>> Of course the communication between VMs needs in this case to be
>> pre-configured and is quite static (which is actually beneficial in our use case).
> 
> pre-configured means user knows which VMs will talk to each other and configure it when booting guest(i.e. in Qemu command line)?

Yes.

Ciao,

Claudio

> 
>>
>> But still in your proposed solution, each VM needs to be pre-configured to
>> communicate with a specific other VM using a separate device right?
>>
>> But I wonder if we are addressing the same problem.. in your case you are
>> looking at having a shared memory pool for all VMs potentially visible to all VMs
>> (the vhost-user case), while in the virtio-shmem proposal we discussed we
>> were assuming specific different regions for every channel.
>>
>> Ciao,
>>
>> Claudio

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
  2015-09-09  6:40   ` [Qemu-devel] " Zhang, Yang Z
  2015-09-09  8:39     ` Claudio Fontana
@ 2015-09-09  8:39     ` Claudio Fontana
  1 sibling, 0 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-09  8:39 UTC (permalink / raw)
  To: Zhang, Yang Z, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

On 09.09.2015 08:40, Zhang, Yang Z wrote:
> Claudio Fontana wrote on 2015-09-07:
>> Coming late to the party,
>>
>> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>>> Hello!
>>> During the KVM forum, we discussed supporting virtio on top
>>> of ivshmem. I have considered it, and came up with an alternative
>>> that has several advantages over that - please see below.
>>> Comments welcome.
>>
>> as Jan mentioned we actually discussed a virtio-shmem device which would
>> incorporate the advantages of ivshmem (so no need for a separate ivshmem
>> device), which would use the well known virtio interface, taking advantage of
>> the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from
>> the two sides, and make use also of BAR0 which has been freed up for use by
>> the device.
> 
> Interesting! Can you elaborate it? 


Yes, I will post a more detailed proposal in the coming days.


>>
>> This way it would be possible to share the rings and the actual memory
>> for the buffers in the PCI bars. The guest VMs could decide to use the
>> shared memory regions directly as prepared by the hypervisor (in the
> 
> "the shared memory regions" here means share another VM's memory or like ivshmem?


It's explicitly about sharing memory between two desired VMs, as set up by the virtualization environment.


>> jailhouse case) or QEMU/KVM, or perform their own validation on the
>> input depending on the use case.
>>
>> Of course the communication between VMs needs in this case to be
>> pre-configured and is quite static (which is actually beneficial in our use case).
> 
> pre-configured means user knows which VMs will talk to each other and configure it when booting guest(i.e. in Qemu command line)?

Yes.

Ciao,

Claudio

> 
>>
>> But still in your proposed solution, each VM needs to be pre-configured to
>> communicate with a specific other VM using a separate device right?
>>
>> But I wonder if we are addressing the same problem.. in your case you are
>> looking at having a shared memory pool for all VMs potentially visible to all VMs
>> (the vhost-user case), while in the virtio-shmem proposal we discussed we
>> were assuming specific different regions for every channel.
>>
>> Ciao,
>>
>> Claudio

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-09  7:06   ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-11 15:39       ` Claudio Fontana
  0 siblings, 0 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-11 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On 09.09.2015 09:06, Michael S. Tsirkin wrote:
> On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
>> Coming late to the party, 
>>
>> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>>> Hello!
>>> During the KVM forum, we discussed supporting virtio on top
>>> of ivshmem. I have considered it, and came up with an alternative
>>> that has several advantages over that - please see below.
>>> Comments welcome.
>>
>> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
>>
>> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
>>
>> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
>>
>> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
>>
>> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
>>
>> Ciao,
>>
>> Claudio
> 
> The problem, as I see it, is to allow inter-vm communication with
> polling (to get very low latencies) but polling within VMs only, without
> need to run a host thread (which when polling uses up a host CPU).
> 
> What was proposed was to simply change virtio to allow
> "offset within BAR" instead of PA.

There are many consequences to this, offset within BAR alone is not enough, there are multiple things at the virtio level that need sorting out.
Also we need to consider virtio-mmio etc.

> This would allow VM2VM communication if there are only 2 VMs,
> but if data needs to be sent to multiple VMs, you
> must copy it.

Not necessarily, however getting it to work (sharing the backend window and arbitrating the multicast) is really hard.

> 
> Additionally, it's a single-purpose feature: you can use it from
> a userspace PMD but linux will never use it.
> 
> 
> My proposal is a superset: don't require that BAR memory is
> used, use IOMMU translation tables.
> This way, data can be sent to multiple VMs by sharing the same
> memory with them all.

Can you describe in detail how your proposal deals with the arbitration necessary for multicast handling?

> 
> It is still possible to put data in some device BAR if that's
> what the guest wants to do: just program the IOMMU to limit
> virtio to the memory range that is within this BAR.
> 
> Another advantage here is that the feature is more generally useful.
> 
> 
>>>
>>> -----
>>>
>>> Existing solutions to userspace switching between VMs on the
>>> same host are vhost-user and ivshmem.
>>>
>>> vhost-user works by mapping memory of all VMs being bridged into the
>>> switch memory space.
>>>
>>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
>>> VMs are required to use this region to store packets. The switch only
>>> needs access to this region.
>>>
>>> Another difference between vhost-user and ivshmem surfaces when polling
>>> is used. With vhost-user, the switch is required to handle
>>> data movement between VMs, if using polling, this means that 1 host CPU
>>> needs to be sacrificed for this task.
>>>
>>> This is easiest to understand when one of the VMs is
>>> used with VF pass-through. This can be schematically shown below:
>>>
>>> +-- VM1 --------------+            +---VM2-----------+
>>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
>>> +---------------------+            +-----------------+
>>>
>>>
>>> With ivshmem in theory communication can happen directly, with two VMs
>>> polling the shared memory region.
>>>
>>>
>>> I won't spend time listing advantages of vhost-user over ivshmem.
>>> Instead, having identified two advantages of ivshmem over vhost-user,
>>> below is a proposal to extend vhost-user to gain the advantages
>>> of ivshmem.
>>>
>>>
>>> 1: virtio in guest can be extended to allow support
>>> for IOMMUs. This provides guest with full flexibility
>>> about memory which is readable or write able by each device.
>>> By setting up a virtio device for each other VM we need to
>>> communicate to, guest gets full control of its security, from
>>> mapping all memory (like with current vhost-user) to only
>>> mapping buffers used for networking (like ivshmem) to
>>> transient mappings for the duration of data transfer only.
>>> This also allows use of VFIO within guests, for improved
>>> security.
>>>
>>> vhost user would need to be extended to send the
>>> mappings programmed by guest IOMMU.
>>>
>>> 2. qemu can be extended to serve as a vhost-user client:
>>> remote VM mappings over the vhost-user protocol, and
>>> map them into another VM's memory.
>>> This mapping can take, for example, the form of
>>> a BAR of a pci device, which I'll call here vhost-pci - 
>>> with bus address allowed
>>> by VM1's IOMMU mappings being translated into
>>> offsets within this BAR within VM2's physical
>>> memory space.
>>>
>>> Since the translation can be a simple one, VM2
>>> can perform it within its vhost-pci device driver.
>>>
>>> While this setup would be the most useful with polling,
>>> VM1's ioeventfd can also be mapped to
>>> another VM2's irqfd, and vice versa, such that VMs
>>> can trigger interrupts to each other without need
>>> for a helper thread on the host.
>>>
>>>
>>> The resulting channel might look something like the following:
>>>
>>> +-- VM1 --------------+  +---VM2-----------+
>>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
>>> +---------------------+  +-----------------+
>>>
>>> comparing the two diagrams, a vhost-user thread on the host is
>>> no longer required, reducing the host CPU utilization when
>>> polling is active.  At the same time, VM2 can not access all of VM1's
>>> memory - it is limited by the iommu configuration setup by VM1.
>>>
>>>
>>> Advantages over ivshmem:
>>>
>>> - more flexibility, endpoint VMs do not have to place data at any
>>>   specific locations to use the device, in practice this likely
>>>   means less data copies.
>>> - better standardization/code reuse
>>>   virtio changes within guests would be fairly easy to implement
>>>   and would also benefit other backends, besides vhost-user
>>>   standard hotplug interfaces can be used to add and remove these
>>>   channels as VMs are added or removed.
>>> - migration support
>>>   It's easy to implement since ownership of memory is well defined.
>>>   For example, during migration VM2 can notify hypervisor of VM1
>>>   by updating dirty bitmap each time is writes into VM1 memory.
>>>
>>> Thanks,
>>>
>>
>>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
@ 2015-09-11 15:39       ` Claudio Fontana
  0 siblings, 0 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-11 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On 09.09.2015 09:06, Michael S. Tsirkin wrote:
> On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
>> Coming late to the party, 
>>
>> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
>>> Hello!
>>> During the KVM forum, we discussed supporting virtio on top
>>> of ivshmem. I have considered it, and came up with an alternative
>>> that has several advantages over that - please see below.
>>> Comments welcome.
>>
>> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
>>
>> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
>>
>> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
>>
>> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
>>
>> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
>>
>> Ciao,
>>
>> Claudio
> 
> The problem, as I see it, is to allow inter-vm communication with
> polling (to get very low latencies) but polling within VMs only, without
> need to run a host thread (which when polling uses up a host CPU).
> 
> What was proposed was to simply change virtio to allow
> "offset within BAR" instead of PA.

There are many consequences to this, offset within BAR alone is not enough, there are multiple things at the virtio level that need sorting out.
Also we need to consider virtio-mmio etc.

> This would allow VM2VM communication if there are only 2 VMs,
> but if data needs to be sent to multiple VMs, you
> must copy it.

Not necessarily, however getting it to work (sharing the backend window and arbitrating the multicast) is really hard.

> 
> Additionally, it's a single-purpose feature: you can use it from
> a userspace PMD but linux will never use it.
> 
> 
> My proposal is a superset: don't require that BAR memory is
> used, use IOMMU translation tables.
> This way, data can be sent to multiple VMs by sharing the same
> memory with them all.

Can you describe in detail how your proposal deals with the arbitration necessary for multicast handling?

> 
> It is still possible to put data in some device BAR if that's
> what the guest wants to do: just program the IOMMU to limit
> virtio to the memory range that is within this BAR.
> 
> Another advantage here is that the feature is more generally useful.
> 
> 
>>>
>>> -----
>>>
>>> Existing solutions to userspace switching between VMs on the
>>> same host are vhost-user and ivshmem.
>>>
>>> vhost-user works by mapping memory of all VMs being bridged into the
>>> switch memory space.
>>>
>>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
>>> VMs are required to use this region to store packets. The switch only
>>> needs access to this region.
>>>
>>> Another difference between vhost-user and ivshmem surfaces when polling
>>> is used. With vhost-user, the switch is required to handle
>>> data movement between VMs, if using polling, this means that 1 host CPU
>>> needs to be sacrificed for this task.
>>>
>>> This is easiest to understand when one of the VMs is
>>> used with VF pass-through. This can be schematically shown below:
>>>
>>> +-- VM1 --------------+            +---VM2-----------+
>>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
>>> +---------------------+            +-----------------+
>>>
>>>
>>> With ivshmem in theory communication can happen directly, with two VMs
>>> polling the shared memory region.
>>>
>>>
>>> I won't spend time listing advantages of vhost-user over ivshmem.
>>> Instead, having identified two advantages of ivshmem over vhost-user,
>>> below is a proposal to extend vhost-user to gain the advantages
>>> of ivshmem.
>>>
>>>
>>> 1: virtio in guest can be extended to allow support
>>> for IOMMUs. This provides guest with full flexibility
>>> about memory which is readable or write able by each device.
>>> By setting up a virtio device for each other VM we need to
>>> communicate to, guest gets full control of its security, from
>>> mapping all memory (like with current vhost-user) to only
>>> mapping buffers used for networking (like ivshmem) to
>>> transient mappings for the duration of data transfer only.
>>> This also allows use of VFIO within guests, for improved
>>> security.
>>>
>>> vhost user would need to be extended to send the
>>> mappings programmed by guest IOMMU.
>>>
>>> 2. qemu can be extended to serve as a vhost-user client:
>>> remote VM mappings over the vhost-user protocol, and
>>> map them into another VM's memory.
>>> This mapping can take, for example, the form of
>>> a BAR of a pci device, which I'll call here vhost-pci - 
>>> with bus address allowed
>>> by VM1's IOMMU mappings being translated into
>>> offsets within this BAR within VM2's physical
>>> memory space.
>>>
>>> Since the translation can be a simple one, VM2
>>> can perform it within its vhost-pci device driver.
>>>
>>> While this setup would be the most useful with polling,
>>> VM1's ioeventfd can also be mapped to
>>> another VM2's irqfd, and vice versa, such that VMs
>>> can trigger interrupts to each other without need
>>> for a helper thread on the host.
>>>
>>>
>>> The resulting channel might look something like the following:
>>>
>>> +-- VM1 --------------+  +---VM2-----------+
>>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
>>> +---------------------+  +-----------------+
>>>
>>> comparing the two diagrams, a vhost-user thread on the host is
>>> no longer required, reducing the host CPU utilization when
>>> polling is active.  At the same time, VM2 can not access all of VM1's
>>> memory - it is limited by the iommu configuration setup by VM1.
>>>
>>>
>>> Advantages over ivshmem:
>>>
>>> - more flexibility, endpoint VMs do not have to place data at any
>>>   specific locations to use the device, in practice this likely
>>>   means less data copies.
>>> - better standardization/code reuse
>>>   virtio changes within guests would be fairly easy to implement
>>>   and would also benefit other backends, besides vhost-user
>>>   standard hotplug interfaces can be used to add and remove these
>>>   channels as VMs are added or removed.
>>> - migration support
>>>   It's easy to implement since ownership of memory is well defined.
>>>   For example, during migration VM2 can notify hypervisor of VM1
>>>   by updating dirty bitmap each time is writes into VM1 memory.
>>>
>>> Thanks,
>>>
>>
>>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-09-11 15:39       ` Claudio Fontana
  (?)
  (?)
@ 2015-09-13  9:12       ` Michael S. Tsirkin
  2015-09-14  0:43           ` Zhang, Yang Z
  -1 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-13  9:12 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On Fri, Sep 11, 2015 at 05:39:07PM +0200, Claudio Fontana wrote:
> On 09.09.2015 09:06, Michael S. Tsirkin wrote:
> > On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
> >> Coming late to the party, 
> >>
> >> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> >>> Hello!
> >>> During the KVM forum, we discussed supporting virtio on top
> >>> of ivshmem. I have considered it, and came up with an alternative
> >>> that has several advantages over that - please see below.
> >>> Comments welcome.
> >>
> >> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
> >>
> >> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
> >>
> >> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
> >>
> >> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
> >>
> >> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
> >>
> >> Ciao,
> >>
> >> Claudio
> > 
> > The problem, as I see it, is to allow inter-vm communication with
> > polling (to get very low latencies) but polling within VMs only, without
> > need to run a host thread (which when polling uses up a host CPU).
> > 
> > What was proposed was to simply change virtio to allow
> > "offset within BAR" instead of PA.
> 
> There are many consequences to this, offset within BAR alone is not enough, there are multiple things at the virtio level that need sorting out.
> Also we need to consider virtio-mmio etc.
> 
> > This would allow VM2VM communication if there are only 2 VMs,
> > but if data needs to be sent to multiple VMs, you
> > must copy it.
> 
> Not necessarily, however getting it to work (sharing the backend window and arbitrating the multicast) is really hard.
> 
> > 
> > Additionally, it's a single-purpose feature: you can use it from
> > a userspace PMD but linux will never use it.
> > 
> > 
> > My proposal is a superset: don't require that BAR memory is
> > used, use IOMMU translation tables.
> > This way, data can be sent to multiple VMs by sharing the same
> > memory with them all.
> 
> Can you describe in detail how your proposal deals with the arbitration necessary for multicast handling?

Basically it falls out naturally. Consider linux guest as an example,
and assume dynamic mappings for simplicity.

Multicast is done by a bridge on the guest side. That code clones the
skb (reference-counting page fragments) and passes it to multiple ports.
Each of these will program the IOMMU to allow read access to the
fragments to the relevant device.



> > 
> > It is still possible to put data in some device BAR if that's
> > what the guest wants to do: just program the IOMMU to limit
> > virtio to the memory range that is within this BAR.
> > 
> > Another advantage here is that the feature is more generally useful.
> > 
> > 
> >>>
> >>> -----
> >>>
> >>> Existing solutions to userspace switching between VMs on the
> >>> same host are vhost-user and ivshmem.
> >>>
> >>> vhost-user works by mapping memory of all VMs being bridged into the
> >>> switch memory space.
> >>>
> >>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> >>> VMs are required to use this region to store packets. The switch only
> >>> needs access to this region.
> >>>
> >>> Another difference between vhost-user and ivshmem surfaces when polling
> >>> is used. With vhost-user, the switch is required to handle
> >>> data movement between VMs, if using polling, this means that 1 host CPU
> >>> needs to be sacrificed for this task.
> >>>
> >>> This is easiest to understand when one of the VMs is
> >>> used with VF pass-through. This can be schematically shown below:
> >>>
> >>> +-- VM1 --------------+            +---VM2-----------+
> >>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> >>> +---------------------+            +-----------------+
> >>>
> >>>
> >>> With ivshmem in theory communication can happen directly, with two VMs
> >>> polling the shared memory region.
> >>>
> >>>
> >>> I won't spend time listing advantages of vhost-user over ivshmem.
> >>> Instead, having identified two advantages of ivshmem over vhost-user,
> >>> below is a proposal to extend vhost-user to gain the advantages
> >>> of ivshmem.
> >>>
> >>>
> >>> 1: virtio in guest can be extended to allow support
> >>> for IOMMUs. This provides guest with full flexibility
> >>> about memory which is readable or write able by each device.
> >>> By setting up a virtio device for each other VM we need to
> >>> communicate to, guest gets full control of its security, from
> >>> mapping all memory (like with current vhost-user) to only
> >>> mapping buffers used for networking (like ivshmem) to
> >>> transient mappings for the duration of data transfer only.
> >>> This also allows use of VFIO within guests, for improved
> >>> security.
> >>>
> >>> vhost user would need to be extended to send the
> >>> mappings programmed by guest IOMMU.
> >>>
> >>> 2. qemu can be extended to serve as a vhost-user client:
> >>> remote VM mappings over the vhost-user protocol, and
> >>> map them into another VM's memory.
> >>> This mapping can take, for example, the form of
> >>> a BAR of a pci device, which I'll call here vhost-pci - 
> >>> with bus address allowed
> >>> by VM1's IOMMU mappings being translated into
> >>> offsets within this BAR within VM2's physical
> >>> memory space.
> >>>
> >>> Since the translation can be a simple one, VM2
> >>> can perform it within its vhost-pci device driver.
> >>>
> >>> While this setup would be the most useful with polling,
> >>> VM1's ioeventfd can also be mapped to
> >>> another VM2's irqfd, and vice versa, such that VMs
> >>> can trigger interrupts to each other without need
> >>> for a helper thread on the host.
> >>>
> >>>
> >>> The resulting channel might look something like the following:
> >>>
> >>> +-- VM1 --------------+  +---VM2-----------+
> >>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> >>> +---------------------+  +-----------------+
> >>>
> >>> comparing the two diagrams, a vhost-user thread on the host is
> >>> no longer required, reducing the host CPU utilization when
> >>> polling is active.  At the same time, VM2 can not access all of VM1's
> >>> memory - it is limited by the iommu configuration setup by VM1.
> >>>
> >>>
> >>> Advantages over ivshmem:
> >>>
> >>> - more flexibility, endpoint VMs do not have to place data at any
> >>>   specific locations to use the device, in practice this likely
> >>>   means less data copies.
> >>> - better standardization/code reuse
> >>>   virtio changes within guests would be fairly easy to implement
> >>>   and would also benefit other backends, besides vhost-user
> >>>   standard hotplug interfaces can be used to add and remove these
> >>>   channels as VMs are added or removed.
> >>> - migration support
> >>>   It's easy to implement since ownership of memory is well defined.
> >>>   For example, during migration VM2 can notify hypervisor of VM1
> >>>   by updating dirty bitmap each time is writes into VM1 memory.
> >>>
> >>> Thanks,
> >>>
> >>
> >>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-09-11 15:39       ` Claudio Fontana
  (?)
@ 2015-09-13  9:12       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-13  9:12 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: opnfv-tech-discuss, virtio-dev, Jan Kiszka, qemu-devel, virtualization

On Fri, Sep 11, 2015 at 05:39:07PM +0200, Claudio Fontana wrote:
> On 09.09.2015 09:06, Michael S. Tsirkin wrote:
> > On Mon, Sep 07, 2015 at 02:38:34PM +0200, Claudio Fontana wrote:
> >> Coming late to the party, 
> >>
> >> On 31.08.2015 16:11, Michael S. Tsirkin wrote:
> >>> Hello!
> >>> During the KVM forum, we discussed supporting virtio on top
> >>> of ivshmem. I have considered it, and came up with an alternative
> >>> that has several advantages over that - please see below.
> >>> Comments welcome.
> >>
> >> as Jan mentioned we actually discussed a virtio-shmem device which would incorporate the advantages of ivshmem (so no need for a separate ivshmem device), which would use the well known virtio interface, taking advantage of the new virtio-1 virtqueue layout to split r/w and read-only rings as seen from the two sides, and make use also of BAR0 which has been freed up for use by the device.
> >>
> >> This way it would be possible to share the rings and the actual memory for the buffers in the PCI bars. The guest VMs could decide to use the shared memory regions directly as prepared by the hypervisor (in the jailhouse case) or QEMU/KVM, or perform their own validation on the input depending on the use case.
> >>
> >> Of course the communication between VMs needs in this case to be pre-configured and is quite static (which is actually beneficial in our use case).
> >>
> >> But still in your proposed solution, each VM needs to be pre-configured to communicate with a specific other VM using a separate device right?
> >>
> >> But I wonder if we are addressing the same problem.. in your case you are looking at having a shared memory pool for all VMs potentially visible to all VMs (the vhost-user case), while in the virtio-shmem proposal we discussed we were assuming specific different regions for every channel.
> >>
> >> Ciao,
> >>
> >> Claudio
> > 
> > The problem, as I see it, is to allow inter-vm communication with
> > polling (to get very low latencies) but polling within VMs only, without
> > need to run a host thread (which when polling uses up a host CPU).
> > 
> > What was proposed was to simply change virtio to allow
> > "offset within BAR" instead of PA.
> 
> There are many consequences to this, offset within BAR alone is not enough, there are multiple things at the virtio level that need sorting out.
> Also we need to consider virtio-mmio etc.
> 
> > This would allow VM2VM communication if there are only 2 VMs,
> > but if data needs to be sent to multiple VMs, you
> > must copy it.
> 
> Not necessarily, however getting it to work (sharing the backend window and arbitrating the multicast) is really hard.
> 
> > 
> > Additionally, it's a single-purpose feature: you can use it from
> > a userspace PMD but linux will never use it.
> > 
> > 
> > My proposal is a superset: don't require that BAR memory is
> > used, use IOMMU translation tables.
> > This way, data can be sent to multiple VMs by sharing the same
> > memory with them all.
> 
> Can you describe in detail how your proposal deals with the arbitration necessary for multicast handling?

Basically it falls out naturally. Consider linux guest as an example,
and assume dynamic mappings for simplicity.

Multicast is done by a bridge on the guest side. That code clones the
skb (reference-counting page fragments) and passes it to multiple ports.
Each of these will program the IOMMU to allow read access to the
fragments to the relevant device.



> > 
> > It is still possible to put data in some device BAR if that's
> > what the guest wants to do: just program the IOMMU to limit
> > virtio to the memory range that is within this BAR.
> > 
> > Another advantage here is that the feature is more generally useful.
> > 
> > 
> >>>
> >>> -----
> >>>
> >>> Existing solutions to userspace switching between VMs on the
> >>> same host are vhost-user and ivshmem.
> >>>
> >>> vhost-user works by mapping memory of all VMs being bridged into the
> >>> switch memory space.
> >>>
> >>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> >>> VMs are required to use this region to store packets. The switch only
> >>> needs access to this region.
> >>>
> >>> Another difference between vhost-user and ivshmem surfaces when polling
> >>> is used. With vhost-user, the switch is required to handle
> >>> data movement between VMs, if using polling, this means that 1 host CPU
> >>> needs to be sacrificed for this task.
> >>>
> >>> This is easiest to understand when one of the VMs is
> >>> used with VF pass-through. This can be schematically shown below:
> >>>
> >>> +-- VM1 --------------+            +---VM2-----------+
> >>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> >>> +---------------------+            +-----------------+
> >>>
> >>>
> >>> With ivshmem in theory communication can happen directly, with two VMs
> >>> polling the shared memory region.
> >>>
> >>>
> >>> I won't spend time listing advantages of vhost-user over ivshmem.
> >>> Instead, having identified two advantages of ivshmem over vhost-user,
> >>> below is a proposal to extend vhost-user to gain the advantages
> >>> of ivshmem.
> >>>
> >>>
> >>> 1: virtio in guest can be extended to allow support
> >>> for IOMMUs. This provides guest with full flexibility
> >>> about memory which is readable or write able by each device.
> >>> By setting up a virtio device for each other VM we need to
> >>> communicate to, guest gets full control of its security, from
> >>> mapping all memory (like with current vhost-user) to only
> >>> mapping buffers used for networking (like ivshmem) to
> >>> transient mappings for the duration of data transfer only.
> >>> This also allows use of VFIO within guests, for improved
> >>> security.
> >>>
> >>> vhost user would need to be extended to send the
> >>> mappings programmed by guest IOMMU.
> >>>
> >>> 2. qemu can be extended to serve as a vhost-user client:
> >>> remote VM mappings over the vhost-user protocol, and
> >>> map them into another VM's memory.
> >>> This mapping can take, for example, the form of
> >>> a BAR of a pci device, which I'll call here vhost-pci - 
> >>> with bus address allowed
> >>> by VM1's IOMMU mappings being translated into
> >>> offsets within this BAR within VM2's physical
> >>> memory space.
> >>>
> >>> Since the translation can be a simple one, VM2
> >>> can perform it within its vhost-pci device driver.
> >>>
> >>> While this setup would be the most useful with polling,
> >>> VM1's ioeventfd can also be mapped to
> >>> another VM2's irqfd, and vice versa, such that VMs
> >>> can trigger interrupts to each other without need
> >>> for a helper thread on the host.
> >>>
> >>>
> >>> The resulting channel might look something like the following:
> >>>
> >>> +-- VM1 --------------+  +---VM2-----------+
> >>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> >>> +---------------------+  +-----------------+
> >>>
> >>> comparing the two diagrams, a vhost-user thread on the host is
> >>> no longer required, reducing the host CPU utilization when
> >>> polling is active.  At the same time, VM2 can not access all of VM1's
> >>> memory - it is limited by the iommu configuration setup by VM1.
> >>>
> >>>
> >>> Advantages over ivshmem:
> >>>
> >>> - more flexibility, endpoint VMs do not have to place data at any
> >>>   specific locations to use the device, in practice this likely
> >>>   means less data copies.
> >>> - better standardization/code reuse
> >>>   virtio changes within guests would be fairly easy to implement
> >>>   and would also benefit other backends, besides vhost-user
> >>>   standard hotplug interfaces can be used to add and remove these
> >>>   channels as VMs are added or removed.
> >>> - migration support
> >>>   It's easy to implement since ownership of memory is well defined.
> >>>   For example, during migration VM2 can notify hypervisor of VM1
> >>>   by updating dirty bitmap each time is writes into VM1 memory.
> >>>
> >>> Thanks,
> >>>
> >>
> >>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
  2015-09-13  9:12       ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-14  0:43           ` Zhang, Yang Z
  0 siblings, 0 replies; 80+ messages in thread
From: Zhang, Yang Z @ 2015-09-14  0:43 UTC (permalink / raw)
  To: Michael S. Tsirkin, Claudio Fontana
  Cc: qemu-devel, virtio-dev, virtualization, opnfv-tech-discuss, Jan Kiszka

Michael S. Tsirkin wrote on 2015-09-13:
> On Fri, Sep 11, 2015 at 05:39:07PM +0200, Claudio Fontana wrote:
>> On 09.09.2015 09:06, Michael S. Tsirkin wrote:
>> 
>> There are many consequences to this, offset within BAR alone is not
>> enough, there are multiple things at the virtio level that need sorting
>> out. Also we need to consider virtio-mmio etc.
>> 
>>> This would allow VM2VM communication if there are only 2 VMs, but
>>> if data needs to be sent to multiple VMs, you must copy it.
>> 
>> Not necessarily, however getting it to work (sharing the backend window
>> and arbitrating the multicast) is really hard.
>> 
>>> 
>>> Additionally, it's a single-purpose feature: you can use it from a
>>> userspace PMD but linux will never use it.
>>> 
>>> 
>>> My proposal is a superset: don't require that BAR memory is used,
>>> use IOMMU translation tables.
>>> This way, data can be sent to multiple VMs by sharing the same
>>> memory with them all.
>> 
>> Can you describe in detail how your proposal deals with the
>> arbitration
> necessary for multicast handling?
> 
> Basically it falls out naturally. Consider linux guest as an example,
> and assume dynamic mappings for simplicity.
> 
> Multicast is done by a bridge on the guest side. That code clones the
> skb (reference-counting page fragments) and passes it to multiple ports.
> Each of these will program the IOMMU to allow read access to the
> fragments to the relevant device.

How to work with vswitch in host side like OVS? Since the flow table is inside host, but guest cannot see it.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication
@ 2015-09-14  0:43           ` Zhang, Yang Z
  0 siblings, 0 replies; 80+ messages in thread
From: Zhang, Yang Z @ 2015-09-14  0:43 UTC (permalink / raw)
  To: Michael S. Tsirkin, Claudio Fontana
  Cc: qemu-devel, virtio-dev, virtualization, opnfv-tech-discuss, Jan Kiszka

Michael S. Tsirkin wrote on 2015-09-13:
> On Fri, Sep 11, 2015 at 05:39:07PM +0200, Claudio Fontana wrote:
>> On 09.09.2015 09:06, Michael S. Tsirkin wrote:
>> 
>> There are many consequences to this, offset within BAR alone is not
>> enough, there are multiple things at the virtio level that need sorting
>> out. Also we need to consider virtio-mmio etc.
>> 
>>> This would allow VM2VM communication if there are only 2 VMs, but
>>> if data needs to be sent to multiple VMs, you must copy it.
>> 
>> Not necessarily, however getting it to work (sharing the backend window
>> and arbitrating the multicast) is really hard.
>> 
>>> 
>>> Additionally, it's a single-purpose feature: you can use it from a
>>> userspace PMD but linux will never use it.
>>> 
>>> 
>>> My proposal is a superset: don't require that BAR memory is used,
>>> use IOMMU translation tables.
>>> This way, data can be sent to multiple VMs by sharing the same
>>> memory with them all.
>> 
>> Can you describe in detail how your proposal deals with the
>> arbitration
> necessary for multicast handling?
> 
> Basically it falls out naturally. Consider linux guest as an example,
> and assume dynamic mappings for simplicity.
> 
> Multicast is done by a bridge on the guest side. That code clones the
> skb (reference-counting page fragments) and passes it to multiple ports.
> Each of these will program the IOMMU to allow read access to the
> fragments to the relevant device.

How to work with vswitch in host side like OVS? Since the flow table is inside host, but guest cannot see it.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] [virtio-dev] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
@ 2015-09-14 16:00   ` Stefan Hajnoczi
  2015-09-01  7:35   ` Jan Kiszka
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 80+ messages in thread
From: Stefan Hajnoczi @ 2015-09-14 16:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 05:11:02PM +0300, Michael S. Tsirkin wrote:
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.

Can this use virtio's vring?  If standard virtio devices (net, blk, etc)
cannot be used because this scheme requires new descriptor rings or
memory layout, then this is more an "ivshmem 2.0" than "virtio".

I'm not clear on how vhost-pci works - is this a host kernel component
that updates VM2's memory mappings when VM1 changes iommu entries?

In VM2 there is a userspace network router.  It can mmap the VF's BARs
to access the physical network.  What about the virtual NIC to VM1, how
does the userspace network router access it?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [virtio-dev] rfc: vhost user enhancements for vm2vm communication
@ 2015-09-14 16:00   ` Stefan Hajnoczi
  0 siblings, 0 replies; 80+ messages in thread
From: Stefan Hajnoczi @ 2015-09-14 16:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	virtualization, opnfv-tech-discuss

On Mon, Aug 31, 2015 at 05:11:02PM +0300, Michael S. Tsirkin wrote:
> The resulting channel might look something like the following:
> 
> +-- VM1 --------------+  +---VM2-----------+
> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> +---------------------+  +-----------------+
> 
> comparing the two diagrams, a vhost-user thread on the host is
> no longer required, reducing the host CPU utilization when
> polling is active.  At the same time, VM2 can not access all of VM1's
> memory - it is limited by the iommu configuration setup by VM1.

Can this use virtio's vring?  If standard virtio devices (net, blk, etc)
cannot be used because this scheme requires new descriptor rings or
memory layout, then this is more an "ivshmem 2.0" than "virtio".

I'm not clear on how vhost-pci works - is this a host kernel component
that updates VM2's memory mappings when VM1 changes iommu entries?

In VM2 there is a userspace network router.  It can mmap the VF's BARs
to access the physical network.  What about the virtual NIC to VM1, how
does the userspace network router access it?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-09  8:39     ` Claudio Fontana
@ 2015-09-18 16:29       ` Claudio Fontana
  2015-09-18 21:11           ` Paolo Bonzini
                           ` (2 more replies)
  2015-09-18 16:29       ` Claudio Fontana
  1 sibling, 3 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-18 16:29 UTC (permalink / raw)
  To: Zhang, Yang Z, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

Hello,

this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:

https://github.com/hw-claudio/virtio-peer/wiki

It is also available as PDF there, but the text is reproduced here for commenting:

Peer shared memory communication device (virtio-peer)

General Overview

(I recommend looking at the PDF for some clarifying pictures)

The Virtio Peer shared memory communication device (virtio-peer) is a virtual device which allows high performance low latency guest to guest communication. It uses a new queue extension feature tentatively called VIRTIO_F_WINDOW which indicates that descriptor tables, available and used rings and Queue Data reside in physical memory ranges called Windows, each identified with an unique identifier called WindowID.

Each queue is configured to belong to a specific WindowID, and during queue identification and configuration, the Physical Guest Addresses in the queue configuration fields are to be considered as offsets in octets from the start of the corresponding Window.

For example for PCI, in the virtio_pci_common_cfg structure these fields are affected:

le64 queue_desc;
le64 queue_avail;
le64 queue_used;

For MMIO instead these MMIO Device layout fields are affected:

QueueDescLow, QueueDescHigh
QueueAvailLow, QueueAvailHigh
QueueUsedLow, QueueUsedHigh

For PCI a new virtio_pci_cap of cfg type VIRTIO_PCI_CAP_WINDOW_CFG is defined.

It contains the following fields:

struct virtio_pci_window_cap {
   struct virtio_pci_cap cap;
}

This configuration structure is used to identify the existing Windows, their WindowIDs, ranges and flags. The WindowID is read from the cap.bar field. The Window starting physical guest address is calculated by starting from the contents of the PCI BAR register with index WindowID, plus the cap.offset. The Window size is read from the cap.length field.

XXX TODO XXX describe also the new MMIO registers here.
Virtqueue discovery:

We are faced with two main options with regards to virtqueue discovery in this model.

OPTION1: The simplest option is to make the previous fields read-only when using Windows, and have the virtualization environment / hypervisor provide the starting addresses of the descriptor table, avail ring and used rings, possibly allowing more flexibility on the Queue Data. OPTION2: The other option is to have the guest completely in control of the allocation decisions inside its write Window, including the virtqueue data structures starting addresses inside the Window, and provide a simple virtqueue peer initialization mechanism.

The virtio-peer device is the simplest device implementation which makes use of the Window feature, containing only two virtqueues. In addition to the Desc Table and Rings, these virtqueues also contain Queue Data areas inside the respective Windows. It uses two Windows, one for data which is read-only for the driver (read Window), and a separate one for data which is read-write for the driver (write Window).

In the Descriptor Table of each virtqueue, the field le64 addr; is added to the Queue Data address of the corresponding Window to obtain the physical guest address of a buffer. A value of length in a descriptor which exceeds the Queue Data area is invalid, and its use will cause undefined behavior.

The driver must consider the Desc Table, Avail Ring and Queue Data area of the receiveq as read-only, and the Used Ring as read-write. The Desc Table, Avail Ring and Queue Data of the receiveq will be therefore allocated inside the read Window, while the Used ring will be allocated in the write Window. The driver must consider the Desc Table, Avail Ring and Queue Data area of the transmitq as read-write, and the Used Ring as read-only. The Desc Table, Avail Ring and Queue Data of the transmitq will be therefore allocated inside the write Window, while the Used Ring will be allocated in the read Window.

Note that in OPTION1, this is done by the hypervisor, while in OPTION2, this is fully under control of the peers (with some hypervisor involvement during initialization).

5.7.1 Device ID 13

5.7.2 Virtqueues 0 receiveq (RX), 1 transmitq (TX)

5.7.3 Feature Bits Possibly VIRTIO_F_MULTICAST (NOT clear yet left out for now)

5.7.4 Device configuration layout

struct virtio_peer_config {
    le64 queue_data_offset;
    le32 queue_data_size;
    u8 queue_flags; /* read-only flags*/
    u8 queue_window_idr; /* read-only */
    u8 queue_window_idw; /* read-only */
}

The fields above are queue-specific, and are thus selected by writing to the queue selector field in the common configuration structure.

queue_data_offset is the offset from the start of the Window of the Queue Data area, queue_data_size is the size of the Queue Data area. For the Read Window, the queue_data_offset and queue_data_size are read-only. For the Write Window, the queue_data_offset and queue_data_size are read-write.

The queue_flags if a flag bitfield with the following bits already defined: (1) = FLAGS_REMOTE : this queue descr, avail, and data is read-only and initialized by the remote peer, while the used ring is initialized by the driver. If this flag is not set, this queue descr, avail, and data is read-write and initialized by the driver, while the used ring is initialized by the remote peer. queue_window_idr and queue_window_idw identify the read-window and write-window for this queue (Window IDs).

5.7.5 Device Initialization Initialization of the virtqueues follows the generic procedure for Virtqueue Initialization with the following modifications.

OPTION1: the driver needs to replace the step "Allocate and zero" of the data structures and the write to the queue configuration registers with a read from the queue configuration registers to obtain the addresses of the virtqueue data structures.

OPTION 2: for each virtqueue, the driver allocates and zeroes the data structures as usual only for the read-write data structures, while skipping the read-only queue structures, which will be initialized by the device from the point of view of the driver (they are meant to be initialized by the peer). The queue_flags configuration field can be used to easily determine which fields are to be initialized, and the queue window id registers that are used to reach the data structures.

This feature under OPTION 2 adds the requirement to enable all virtqueues before the DRIVER_OK (which is already done in practice, as usual by writing 1 to the queue_enable field). Driver attempts to read back from the queue_enable field for a queue which has not been also enabled by the remote peer will have the device return 0 (disabled) until the remote peer has also initialized its own share of the data structures for the same virtqueue as it appears in the remote peer. All the queue configuration fields which still need remote initialization (queue_desc, queue_avail, queue_used) have a reset value of 0.

When the FEATURE BIT is detected, the virtio driver will delay setting of the DRIVER_OK status for the device. When both peers have enabled the queues by writing 1 to the queue_enable fields, the driver will be notified via a configuration change interrupt (VIRTIO_PCI_ISR_CONFIG). This will allow the driver to read the necessary queue configuration fields as initialized by the remote peer, and proceed setting the DRIVER_OK status for the device to signal the completion of the initialization steps.
5.7.6 Device Operation

Data is received from the peer on the receive virtqueue. Data is transmitted to the peer using the transmit virtqueue.
5.7.6.1

OMISSIS
5.7.6.2 Transmitting data

Transmitting a chunk of data of arbitrary size is done by following the steps 3.2.1 to 3.2.1.4. The device will update the used field as described in 3.2.2.
5.7.6.2.1 Packet Transmission Interrupt

OMISSIS
5.7.6.3 Receiving data

Receiving data consists in the driver checking the receiveq available ring to be able to find the receive buffers. The procedure is the one usually performed by the device, involving update of the Used ring and a notification, as described in chapter 3.2.2
5.7.xxx: Additional notes and TODOS

Just a note: the Indirect Descriptors feature (VIRTIO_RING_F_INDIRECT) may not compatible with this feature, and thus will not be negotiated by the device (?verify)

Notification mechanisms need to be looked at in detail. Mostly we should be able to reuse the existing notification mechanisms, for OPTION2 configuration change we have identified the ISR_CONFIG notification method above.

MMIO needs to be written down.

PCI capabilities need to be checked again, and the fields in CFG_WINDOW in particular. An alternative could be to extend the pci common configuration structure for the queue- specific extensions, but seems not compatible with multiple features involving similar extensions. Need to consider MMIO, as it's less extensible.

MULTICAST is out of scope of these notes, but seems feasible with some hard work without involving copies by sharing at least the transmit buffer in the producer, but the use case with peers being added and removed dynamically requires a much more complex study. Can this be solved with multiple queues, one for each peer, and configuration change notification interrupts that can disable a queue in the producer when a peer leaves, without taking down the whole device? Would need much more study.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RFC: virtio-peer shared memory based peer communication device
  2015-09-09  8:39     ` Claudio Fontana
  2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
@ 2015-09-18 16:29       ` Claudio Fontana
  1 sibling, 0 replies; 80+ messages in thread
From: Claudio Fontana @ 2015-09-18 16:29 UTC (permalink / raw)
  To: Zhang, Yang Z, Michael S. Tsirkin, qemu-devel, virtualization,
	virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka

Hello,

this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:

https://github.com/hw-claudio/virtio-peer/wiki

It is also available as PDF there, but the text is reproduced here for commenting:

Peer shared memory communication device (virtio-peer)

General Overview

(I recommend looking at the PDF for some clarifying pictures)

The Virtio Peer shared memory communication device (virtio-peer) is a virtual device which allows high performance low latency guest to guest communication. It uses a new queue extension feature tentatively called VIRTIO_F_WINDOW which indicates that descriptor tables, available and used rings and Queue Data reside in physical memory ranges called Windows, each identified with an unique identifier called WindowID.

Each queue is configured to belong to a specific WindowID, and during queue identification and configuration, the Physical Guest Addresses in the queue configuration fields are to be considered as offsets in octets from the start of the corresponding Window.

For example for PCI, in the virtio_pci_common_cfg structure these fields are affected:

le64 queue_desc;
le64 queue_avail;
le64 queue_used;

For MMIO instead these MMIO Device layout fields are affected:

QueueDescLow, QueueDescHigh
QueueAvailLow, QueueAvailHigh
QueueUsedLow, QueueUsedHigh

For PCI a new virtio_pci_cap of cfg type VIRTIO_PCI_CAP_WINDOW_CFG is defined.

It contains the following fields:

struct virtio_pci_window_cap {
   struct virtio_pci_cap cap;
}

This configuration structure is used to identify the existing Windows, their WindowIDs, ranges and flags. The WindowID is read from the cap.bar field. The Window starting physical guest address is calculated by starting from the contents of the PCI BAR register with index WindowID, plus the cap.offset. The Window size is read from the cap.length field.

XXX TODO XXX describe also the new MMIO registers here.
Virtqueue discovery:

We are faced with two main options with regards to virtqueue discovery in this model.

OPTION1: The simplest option is to make the previous fields read-only when using Windows, and have the virtualization environment / hypervisor provide the starting addresses of the descriptor table, avail ring and used rings, possibly allowing more flexibility on the Queue Data. OPTION2: The other option is to have the guest completely in control of the allocation decisions inside its write Window, including the virtqueue data structures starting addresses inside the Window, and provide a simple virtqueue peer initialization mechanism.

The virtio-peer device is the simplest device implementation which makes use of the Window feature, containing only two virtqueues. In addition to the Desc Table and Rings, these virtqueues also contain Queue Data areas inside the respective Windows. It uses two Windows, one for data which is read-only for the driver (read Window), and a separate one for data which is read-write for the driver (write Window).

In the Descriptor Table of each virtqueue, the field le64 addr; is added to the Queue Data address of the corresponding Window to obtain the physical guest address of a buffer. A value of length in a descriptor which exceeds the Queue Data area is invalid, and its use will cause undefined behavior.

The driver must consider the Desc Table, Avail Ring and Queue Data area of the receiveq as read-only, and the Used Ring as read-write. The Desc Table, Avail Ring and Queue Data of the receiveq will be therefore allocated inside the read Window, while the Used ring will be allocated in the write Window. The driver must consider the Desc Table, Avail Ring and Queue Data area of the transmitq as read-write, and the Used Ring as read-only. The Desc Table, Avail Ring and Queue Data of the transmitq will be therefore allocated inside the write Window, while the Used Ring will be allocated in the read Window.

Note that in OPTION1, this is done by the hypervisor, while in OPTION2, this is fully under control of the peers (with some hypervisor involvement during initialization).

5.7.1 Device ID 13

5.7.2 Virtqueues 0 receiveq (RX), 1 transmitq (TX)

5.7.3 Feature Bits Possibly VIRTIO_F_MULTICAST (NOT clear yet left out for now)

5.7.4 Device configuration layout

struct virtio_peer_config {
    le64 queue_data_offset;
    le32 queue_data_size;
    u8 queue_flags; /* read-only flags*/
    u8 queue_window_idr; /* read-only */
    u8 queue_window_idw; /* read-only */
}

The fields above are queue-specific, and are thus selected by writing to the queue selector field in the common configuration structure.

queue_data_offset is the offset from the start of the Window of the Queue Data area, queue_data_size is the size of the Queue Data area. For the Read Window, the queue_data_offset and queue_data_size are read-only. For the Write Window, the queue_data_offset and queue_data_size are read-write.

The queue_flags if a flag bitfield with the following bits already defined: (1) = FLAGS_REMOTE : this queue descr, avail, and data is read-only and initialized by the remote peer, while the used ring is initialized by the driver. If this flag is not set, this queue descr, avail, and data is read-write and initialized by the driver, while the used ring is initialized by the remote peer. queue_window_idr and queue_window_idw identify the read-window and write-window for this queue (Window IDs).

5.7.5 Device Initialization Initialization of the virtqueues follows the generic procedure for Virtqueue Initialization with the following modifications.

OPTION1: the driver needs to replace the step "Allocate and zero" of the data structures and the write to the queue configuration registers with a read from the queue configuration registers to obtain the addresses of the virtqueue data structures.

OPTION 2: for each virtqueue, the driver allocates and zeroes the data structures as usual only for the read-write data structures, while skipping the read-only queue structures, which will be initialized by the device from the point of view of the driver (they are meant to be initialized by the peer). The queue_flags configuration field can be used to easily determine which fields are to be initialized, and the queue window id registers that are used to reach the data structures.

This feature under OPTION 2 adds the requirement to enable all virtqueues before the DRIVER_OK (which is already done in practice, as usual by writing 1 to the queue_enable field). Driver attempts to read back from the queue_enable field for a queue which has not been also enabled by the remote peer will have the device return 0 (disabled) until the remote peer has also initialized its own share of the data structures for the same virtqueue as it appears in the remote peer. All the queue configuration fields which still need remote initialization (queue_desc, queue_avail, queue_used) have a reset value of 0.

When the FEATURE BIT is detected, the virtio driver will delay setting of the DRIVER_OK status for the device. When both peers have enabled the queues by writing 1 to the queue_enable fields, the driver will be notified via a configuration change interrupt (VIRTIO_PCI_ISR_CONFIG). This will allow the driver to read the necessary queue configuration fields as initialized by the remote peer, and proceed setting the DRIVER_OK status for the device to signal the completion of the initialization steps.
5.7.6 Device Operation

Data is received from the peer on the receive virtqueue. Data is transmitted to the peer using the transmit virtqueue.
5.7.6.1

OMISSIS
5.7.6.2 Transmitting data

Transmitting a chunk of data of arbitrary size is done by following the steps 3.2.1 to 3.2.1.4. The device will update the used field as described in 3.2.2.
5.7.6.2.1 Packet Transmission Interrupt

OMISSIS
5.7.6.3 Receiving data

Receiving data consists in the driver checking the receiveq available ring to be able to find the receive buffers. The procedure is the one usually performed by the device, involving update of the Used ring and a notification, as described in chapter 3.2.2
5.7.xxx: Additional notes and TODOS

Just a note: the Indirect Descriptors feature (VIRTIO_RING_F_INDIRECT) may not compatible with this feature, and thus will not be negotiated by the device (?verify)

Notification mechanisms need to be looked at in detail. Mostly we should be able to reuse the existing notification mechanisms, for OPTION2 configuration change we have identified the ISR_CONFIG notification method above.

MMIO needs to be written down.

PCI capabilities need to be checked again, and the fields in CFG_WINDOW in particular. An alternative could be to extend the pci common configuration structure for the queue- specific extensions, but seems not compatible with multiple features involving similar extensions. Need to consider MMIO, as it's less extensible.

MULTICAST is out of scope of these notes, but seems feasible with some hard work without involving copies by sharing at least the transmit buffer in the producer, but the use case with peers being added and removed dynamically requires a much more complex study. Can this be solved with multiple queues, one for each peer, and configuration change notification interrupts that can disable a queue in the producer when a peer leaves, without taking down the whole device? Would need much more study.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
@ 2015-09-18 21:11           ` Paolo Bonzini
  2015-09-21 12:13         ` [Qemu-devel] " Michael S. Tsirkin
  2015-09-21 12:13         ` Michael S. Tsirkin
  2 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-09-18 21:11 UTC (permalink / raw)
  To: Claudio Fontana, Zhang, Yang Z, Michael S. Tsirkin, qemu-devel,
	virtualization, virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka



On 18/09/2015 18:29, Claudio Fontana wrote:
> 
> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> 
> https://github.com/hw-claudio/virtio-peer/wiki
> 
> It is also available as PDF there, but the text is reproduced here for commenting:
> 
> Peer shared memory communication device (virtio-peer)

Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
@ 2015-09-18 21:11           ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-09-18 21:11 UTC (permalink / raw)
  To: Claudio Fontana, Zhang, Yang Z, Michael S. Tsirkin, qemu-devel,
	virtualization, virtio-dev, opnfv-tech-discuss
  Cc: Jan Kiszka



On 18/09/2015 18:29, Claudio Fontana wrote:
> 
> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> 
> https://github.com/hw-claudio/virtio-peer/wiki
> 
> It is also available as PDF there, but the text is reproduced here for commenting:
> 
> Peer shared memory communication device (virtio-peer)

Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-18 21:11           ` Paolo Bonzini
@ 2015-09-21 10:47             ` Jan Kiszka
  -1 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-21 10:47 UTC (permalink / raw)
  To: Paolo Bonzini, Claudio Fontana, Zhang, Yang Z,
	Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss

On 2015-09-18 23:11, Paolo Bonzini wrote:
> On 18/09/2015 18:29, Claudio Fontana wrote:
>>
>> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
>>
>> https://github.com/hw-claudio/virtio-peer/wiki
>>
>> It is also available as PDF there, but the text is reproduced here for commenting:
>>
>> Peer shared memory communication device (virtio-peer)
> 
> Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?

rpmsg is a very specialized thing. It targets single AMP cores, assuming
that those have full access to the main memory. And it is also a
centralized approach where all message go through the main Linux
instance. I suspect we could cover that use case as well with generic
inter-vm shared memory device, but I didn't think about all details yet.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
@ 2015-09-21 10:47             ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-21 10:47 UTC (permalink / raw)
  To: Paolo Bonzini, Claudio Fontana, Zhang, Yang Z,
	Michael S. Tsirkin, qemu-devel, virtualization, virtio-dev,
	opnfv-tech-discuss

On 2015-09-18 23:11, Paolo Bonzini wrote:
> On 18/09/2015 18:29, Claudio Fontana wrote:
>>
>> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
>>
>> https://github.com/hw-claudio/virtio-peer/wiki
>>
>> It is also available as PDF there, but the text is reproduced here for commenting:
>>
>> Peer shared memory communication device (virtio-peer)
> 
> Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?

rpmsg is a very specialized thing. It targets single AMP cores, assuming
that those have full access to the main memory. And it is also a
centralized approach where all message go through the main Linux
instance. I suspect we could cover that use case as well with generic
inter-vm shared memory device, but I didn't think about all details yet.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
  2015-09-18 21:11           ` Paolo Bonzini
@ 2015-09-21 12:13         ` Michael S. Tsirkin
  2015-09-21 12:32             ` Jan Kiszka
  2015-09-21 12:13         ` Michael S. Tsirkin
  2 siblings, 1 reply; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-21 12:13 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: virtio-dev, Jan Kiszka, qemu-devel, virtualization, Zhang,
	Yang Z, opnfv-tech-discuss

On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
> Hello,
> 
> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> 
> https://github.com/hw-claudio/virtio-peer/wiki
> 
> It is also available as PDF there, but the text is reproduced here for commenting:
> 
> Peer shared memory communication device (virtio-peer)
> 
> General Overview
> 
> (I recommend looking at the PDF for some clarifying pictures)
> 
> The Virtio Peer shared memory communication device (virtio-peer) is a
> virtual device which allows high performance low latency guest to
> guest communication. It uses a new queue extension feature tentatively
> called VIRTIO_F_WINDOW which indicates that descriptor tables,
> available and used rings and Queue Data reside in physical memory
> ranges called Windows, each identified with an unique identifier
> called WindowID.

So if I had to summarize the difference from regular virtio,
I'd say the main one is that this uses window id + offset
instead of the physical address.


My question is - why do it?

All windows are in memory space, are they not?

How about guest using full physical addresses,
and hypervisor sending the window physical address
to VM2?

VM2 can uses that to find both window id and offset.


This way at least VM1 can use regular virtio without changes.

-- 
MST

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
  2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
  2015-09-18 21:11           ` Paolo Bonzini
  2015-09-21 12:13         ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-21 12:13         ` Michael S. Tsirkin
  2 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-21 12:13 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: virtio-dev, Jan Kiszka, qemu-devel, virtualization, Zhang,
	Yang Z, opnfv-tech-discuss

On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
> Hello,
> 
> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> 
> https://github.com/hw-claudio/virtio-peer/wiki
> 
> It is also available as PDF there, but the text is reproduced here for commenting:
> 
> Peer shared memory communication device (virtio-peer)
> 
> General Overview
> 
> (I recommend looking at the PDF for some clarifying pictures)
> 
> The Virtio Peer shared memory communication device (virtio-peer) is a
> virtual device which allows high performance low latency guest to
> guest communication. It uses a new queue extension feature tentatively
> called VIRTIO_F_WINDOW which indicates that descriptor tables,
> available and used rings and Queue Data reside in physical memory
> ranges called Windows, each identified with an unique identifier
> called WindowID.

So if I had to summarize the difference from regular virtio,
I'd say the main one is that this uses window id + offset
instead of the physical address.


My question is - why do it?

All windows are in memory space, are they not?

How about guest using full physical addresses,
and hypervisor sending the window physical address
to VM2?

VM2 can uses that to find both window id and offset.


This way at least VM1 can use regular virtio without changes.

-- 
MST

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-21 10:47             ` Jan Kiszka
@ 2015-09-21 12:15               ` Paolo Bonzini
  -1 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-09-21 12:15 UTC (permalink / raw)
  To: Jan Kiszka, Claudio Fontana, Zhang, Yang Z, Michael S. Tsirkin,
	qemu-devel, virtualization, virtio-dev, opnfv-tech-discuss



On 21/09/2015 12:47, Jan Kiszka wrote:
>> > Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?
> rpmsg is a very specialized thing. It targets single AMP cores, assuming
> that those have full access to the main memory.

Yes, this is why I did say "apart from the windows idea".

> And it is also a
> centralized approach where all message go through the main Linux
> instance. I suspect we could cover that use case as well with generic
> inter-vm shared memory device, but I didn't think about all details yet.

The virtqueue handling seems very similar between the two.  However, the
messages for rpmsg however have a small header (struct rpmsg_hdr in
include/linux/rpmsg.h) and there is a weird feature bit VIRTIO_RPMSG_F_NS.

So I guess virtio-rpmsg and virtio-peer are about as similar as
virtio-serial and virtio-peer.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
@ 2015-09-21 12:15               ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-09-21 12:15 UTC (permalink / raw)
  To: Jan Kiszka, Claudio Fontana, Zhang, Yang Z, Michael S. Tsirkin,
	qemu-devel, virtualization, virtio-dev, opnfv-tech-discuss



On 21/09/2015 12:47, Jan Kiszka wrote:
>> > Apart from the windows idea, how does virtio-peer compare to virtio-rpmsg?
> rpmsg is a very specialized thing. It targets single AMP cores, assuming
> that those have full access to the main memory.

Yes, this is why I did say "apart from the windows idea".

> And it is also a
> centralized approach where all message go through the main Linux
> instance. I suspect we could cover that use case as well with generic
> inter-vm shared memory device, but I didn't think about all details yet.

The virtqueue handling seems very similar between the two.  However, the
messages for rpmsg however have a small header (struct rpmsg_hdr in
include/linux/rpmsg.h) and there is a weird feature bit VIRTIO_RPMSG_F_NS.

So I guess virtio-rpmsg and virtio-peer are about as similar as
virtio-serial and virtio-peer.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-21 12:13         ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-09-21 12:32             ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-21 12:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Claudio Fontana
  Cc: Zhang, Yang Z, virtio-dev, opnfv-tech-discuss, qemu-devel,
	virtualization

On 2015-09-21 14:13, Michael S. Tsirkin wrote:
> On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
>> Hello,
>>
>> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
>>
>> https://github.com/hw-claudio/virtio-peer/wiki
>>
>> It is also available as PDF there, but the text is reproduced here for commenting:
>>
>> Peer shared memory communication device (virtio-peer)
>>
>> General Overview
>>
>> (I recommend looking at the PDF for some clarifying pictures)
>>
>> The Virtio Peer shared memory communication device (virtio-peer) is a
>> virtual device which allows high performance low latency guest to
>> guest communication. It uses a new queue extension feature tentatively
>> called VIRTIO_F_WINDOW which indicates that descriptor tables,
>> available and used rings and Queue Data reside in physical memory
>> ranges called Windows, each identified with an unique identifier
>> called WindowID.
> 
> So if I had to summarize the difference from regular virtio,
> I'd say the main one is that this uses window id + offset
> instead of the physical address.
> 
> 
> My question is - why do it?
> 
> All windows are in memory space, are they not?
> 
> How about guest using full physical addresses,
> and hypervisor sending the window physical address
> to VM2?
> 
> VM2 can uses that to find both window id and offset.
> 
> 
> This way at least VM1 can use regular virtio without changes.

What would be the value of having different drivers in VM1 and VM2,
specifically if both run Linux?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
@ 2015-09-21 12:32             ` Jan Kiszka
  0 siblings, 0 replies; 80+ messages in thread
From: Jan Kiszka @ 2015-09-21 12:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Claudio Fontana
  Cc: Zhang, Yang Z, virtio-dev, opnfv-tech-discuss, qemu-devel,
	virtualization

On 2015-09-21 14:13, Michael S. Tsirkin wrote:
> On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
>> Hello,
>>
>> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
>>
>> https://github.com/hw-claudio/virtio-peer/wiki
>>
>> It is also available as PDF there, but the text is reproduced here for commenting:
>>
>> Peer shared memory communication device (virtio-peer)
>>
>> General Overview
>>
>> (I recommend looking at the PDF for some clarifying pictures)
>>
>> The Virtio Peer shared memory communication device (virtio-peer) is a
>> virtual device which allows high performance low latency guest to
>> guest communication. It uses a new queue extension feature tentatively
>> called VIRTIO_F_WINDOW which indicates that descriptor tables,
>> available and used rings and Queue Data reside in physical memory
>> ranges called Windows, each identified with an unique identifier
>> called WindowID.
> 
> So if I had to summarize the difference from regular virtio,
> I'd say the main one is that this uses window id + offset
> instead of the physical address.
> 
> 
> My question is - why do it?
> 
> All windows are in memory space, are they not?
> 
> How about guest using full physical addresses,
> and hypervisor sending the window physical address
> to VM2?
> 
> VM2 can uses that to find both window id and offset.
> 
> 
> This way at least VM1 can use regular virtio without changes.

What would be the value of having different drivers in VM1 and VM2,
specifically if both run Linux?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] RFC: virtio-peer shared memory based peer communication device
  2015-09-21 12:32             ` Jan Kiszka
  (?)
@ 2015-09-24 10:04             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-24 10:04 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio Fontana, qemu-devel, virtualization, Zhang,
	Yang Z, opnfv-tech-discuss

On Mon, Sep 21, 2015 at 02:32:10PM +0200, Jan Kiszka wrote:
> On 2015-09-21 14:13, Michael S. Tsirkin wrote:
> > On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
> >> Hello,
> >>
> >> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> >>
> >> https://github.com/hw-claudio/virtio-peer/wiki
> >>
> >> It is also available as PDF there, but the text is reproduced here for commenting:
> >>
> >> Peer shared memory communication device (virtio-peer)
> >>
> >> General Overview
> >>
> >> (I recommend looking at the PDF for some clarifying pictures)
> >>
> >> The Virtio Peer shared memory communication device (virtio-peer) is a
> >> virtual device which allows high performance low latency guest to
> >> guest communication. It uses a new queue extension feature tentatively
> >> called VIRTIO_F_WINDOW which indicates that descriptor tables,
> >> available and used rings and Queue Data reside in physical memory
> >> ranges called Windows, each identified with an unique identifier
> >> called WindowID.
> > 
> > So if I had to summarize the difference from regular virtio,
> > I'd say the main one is that this uses window id + offset
> > instead of the physical address.
> > 
> > 
> > My question is - why do it?
> > 
> > All windows are in memory space, are they not?
> > 
> > How about guest using full physical addresses,
> > and hypervisor sending the window physical address
> > to VM2?
> > 
> > VM2 can uses that to find both window id and offset.
> > 
> > 
> > This way at least VM1 can use regular virtio without changes.
> 
> What would be the value of having different drivers in VM1 and VM2,
> specifically if both run Linux?
> 
> Jan

It's common to have a VM act as a switch between others.

In this setup, there's value in being able to support existing guests as
endpoints, with new drivers only required for the switch.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: RFC: virtio-peer shared memory based peer communication device
  2015-09-21 12:32             ` Jan Kiszka
  (?)
  (?)
@ 2015-09-24 10:04             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-09-24 10:04 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-dev, Claudio Fontana, qemu-devel, virtualization, Zhang,
	Yang Z, opnfv-tech-discuss

On Mon, Sep 21, 2015 at 02:32:10PM +0200, Jan Kiszka wrote:
> On 2015-09-21 14:13, Michael S. Tsirkin wrote:
> > On Fri, Sep 18, 2015 at 06:29:27PM +0200, Claudio Fontana wrote:
> >> Hello,
> >>
> >> this is a first RFC for virtio-peer 0.1, which is still very much a work in progress:
> >>
> >> https://github.com/hw-claudio/virtio-peer/wiki
> >>
> >> It is also available as PDF there, but the text is reproduced here for commenting:
> >>
> >> Peer shared memory communication device (virtio-peer)
> >>
> >> General Overview
> >>
> >> (I recommend looking at the PDF for some clarifying pictures)
> >>
> >> The Virtio Peer shared memory communication device (virtio-peer) is a
> >> virtual device which allows high performance low latency guest to
> >> guest communication. It uses a new queue extension feature tentatively
> >> called VIRTIO_F_WINDOW which indicates that descriptor tables,
> >> available and used rings and Queue Data reside in physical memory
> >> ranges called Windows, each identified with an unique identifier
> >> called WindowID.
> > 
> > So if I had to summarize the difference from regular virtio,
> > I'd say the main one is that this uses window id + offset
> > instead of the physical address.
> > 
> > 
> > My question is - why do it?
> > 
> > All windows are in memory space, are they not?
> > 
> > How about guest using full physical addresses,
> > and hypervisor sending the window physical address
> > to VM2?
> > 
> > VM2 can uses that to find both window id and offset.
> > 
> > 
> > This way at least VM1 can use regular virtio without changes.
> 
> What would be the value of having different drivers in VM1 and VM2,
> specifically if both run Linux?
> 
> Jan

It's common to have a VM act as a switch between others.

In this setup, there's value in being able to support existing guests as
endpoints, with new drivers only required for the switch.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
                     ` (5 preceding siblings ...)
  (?)
@ 2015-10-06 21:42   ` Nakajima, Jun
  2015-10-07  5:39     ` Michael S. Tsirkin
  2015-10-07  5:39     ` [Qemu-devel] " Michael S. Tsirkin
  -1 siblings, 2 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-10-06 21:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

Hi Michael,

Looks like the discussions tapered off, but do you have a plan to
implement this if people are eventually fine with it? We want to
extend this to support multiple VMs.

On Mon, Aug 31, 2015 at 11:35 AM, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> Hello!
>> During the KVM forum, we discussed supporting virtio on top
>> of ivshmem. I have considered it, and came up with an alternative
>> that has several advantages over that - please see below.
>> Comments welcome.
>
> Hi Michael,
>
> I like this, and it should be able to achieve what I presented at KVM
> Forum (vhost-user-shmem).
> Comments below.
>
>>
>> -----
>>
>> Existing solutions to userspace switching between VMs on the
>> same host are vhost-user and ivshmem.
>>
>> vhost-user works by mapping memory of all VMs being bridged into the
>> switch memory space.
>>
>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
>> VMs are required to use this region to store packets. The switch only
>> needs access to this region.
>>
>> Another difference between vhost-user and ivshmem surfaces when polling
>> is used. With vhost-user, the switch is required to handle
>> data movement between VMs, if using polling, this means that 1 host CPU
>> needs to be sacrificed for this task.
>>
>> This is easiest to understand when one of the VMs is
>> used with VF pass-through. This can be schematically shown below:
>>
>> +-- VM1 --------------+            +---VM2-----------+
>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
>> +---------------------+            +-----------------+
>>
>>
>> With ivshmem in theory communication can happen directly, with two VMs
>> polling the shared memory region.
>>
>>
>> I won't spend time listing advantages of vhost-user over ivshmem.
>> Instead, having identified two advantages of ivshmem over vhost-user,
>> below is a proposal to extend vhost-user to gain the advantages
>> of ivshmem.
>>
>>
>> 1: virtio in guest can be extended to allow support
>> for IOMMUs. This provides guest with full flexibility
>> about memory which is readable or write able by each device.
>
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> VFIO working for general direct-I/O (including VFs) in guests, as you
> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> remapping table on x86 (i.e. nested VT-d).
>
>> By setting up a virtio device for each other VM we need to
>> communicate to, guest gets full control of its security, from
>> mapping all memory (like with current vhost-user) to only
>> mapping buffers used for networking (like ivshmem) to
>> transient mappings for the duration of data transfer only.
>
> And I think that we can use VMFUNC to have such transient mappings.
>
>> This also allows use of VFIO within guests, for improved
>> security.
>>
>> vhost user would need to be extended to send the
>> mappings programmed by guest IOMMU.
>
> Right. We need to think about cases where other VMs (VM3, etc.) join
> the group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
>
>>
>> 2. qemu can be extended to serve as a vhost-user client:
>> remote VM mappings over the vhost-user protocol, and
>> map them into another VM's memory.
>> This mapping can take, for example, the form of
>> a BAR of a pci device, which I'll call here vhost-pci -
>> with bus address allowed
>> by VM1's IOMMU mappings being translated into
>> offsets within this BAR within VM2's physical
>> memory space.
>
> I think it's sensible.
>
>>
>> Since the translation can be a simple one, VM2
>> can perform it within its vhost-pci device driver.
>>
>> While this setup would be the most useful with polling,
>> VM1's ioeventfd can also be mapped to
>> another VM2's irqfd, and vice versa, such that VMs
>> can trigger interrupts to each other without need
>> for a helper thread on the host.
>>
>>
>> The resulting channel might look something like the following:
>>
>> +-- VM1 --------------+  +---VM2-----------+
>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
>> +---------------------+  +-----------------+
>>
>> comparing the two diagrams, a vhost-user thread on the host is
>> no longer required, reducing the host CPU utilization when
>> polling is active.  At the same time, VM2 can not access all of VM1's
>> memory - it is limited by the iommu configuration setup by VM1.
>>
>>
>> Advantages over ivshmem:
>>
>> - more flexibility, endpoint VMs do not have to place data at any
>>   specific locations to use the device, in practice this likely
>>   means less data copies.
>> - better standardization/code reuse
>>   virtio changes within guests would be fairly easy to implement
>>   and would also benefit other backends, besides vhost-user
>>   standard hotplug interfaces can be used to add and remove these
>>   channels as VMs are added or removed.
>> - migration support
>>   It's easy to implement since ownership of memory is well defined.
>>   For example, during migration VM2 can notify hypervisor of VM1
>>   by updating dirty bitmap each time is writes into VM1 memory.
>
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions
>
>>
>> Thanks,
>>
>> --
>> MST



-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-08-31 18:35   ` Nakajima, Jun
                     ` (4 preceding siblings ...)
  (?)
@ 2015-10-06 21:42   ` Nakajima, Jun
  -1 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2015-10-06 21:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

Hi Michael,

Looks like the discussions tapered off, but do you have a plan to
implement this if people are eventually fine with it? We want to
extend this to support multiple VMs.

On Mon, Aug 31, 2015 at 11:35 AM, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> Hello!
>> During the KVM forum, we discussed supporting virtio on top
>> of ivshmem. I have considered it, and came up with an alternative
>> that has several advantages over that - please see below.
>> Comments welcome.
>
> Hi Michael,
>
> I like this, and it should be able to achieve what I presented at KVM
> Forum (vhost-user-shmem).
> Comments below.
>
>>
>> -----
>>
>> Existing solutions to userspace switching between VMs on the
>> same host are vhost-user and ivshmem.
>>
>> vhost-user works by mapping memory of all VMs being bridged into the
>> switch memory space.
>>
>> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
>> VMs are required to use this region to store packets. The switch only
>> needs access to this region.
>>
>> Another difference between vhost-user and ivshmem surfaces when polling
>> is used. With vhost-user, the switch is required to handle
>> data movement between VMs, if using polling, this means that 1 host CPU
>> needs to be sacrificed for this task.
>>
>> This is easiest to understand when one of the VMs is
>> used with VF pass-through. This can be schematically shown below:
>>
>> +-- VM1 --------------+            +---VM2-----------+
>> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
>> +---------------------+            +-----------------+
>>
>>
>> With ivshmem in theory communication can happen directly, with two VMs
>> polling the shared memory region.
>>
>>
>> I won't spend time listing advantages of vhost-user over ivshmem.
>> Instead, having identified two advantages of ivshmem over vhost-user,
>> below is a proposal to extend vhost-user to gain the advantages
>> of ivshmem.
>>
>>
>> 1: virtio in guest can be extended to allow support
>> for IOMMUs. This provides guest with full flexibility
>> about memory which is readable or write able by each device.
>
> I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> VFIO working for general direct-I/O (including VFs) in guests, as you
> know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> remapping table on x86 (i.e. nested VT-d).
>
>> By setting up a virtio device for each other VM we need to
>> communicate to, guest gets full control of its security, from
>> mapping all memory (like with current vhost-user) to only
>> mapping buffers used for networking (like ivshmem) to
>> transient mappings for the duration of data transfer only.
>
> And I think that we can use VMFUNC to have such transient mappings.
>
>> This also allows use of VFIO within guests, for improved
>> security.
>>
>> vhost user would need to be extended to send the
>> mappings programmed by guest IOMMU.
>
> Right. We need to think about cases where other VMs (VM3, etc.) join
> the group or some existing VM leaves.
> PCI hot-plug should work there (as you point out at "Advantages over
> ivshmem" below).
>
>>
>> 2. qemu can be extended to serve as a vhost-user client:
>> remote VM mappings over the vhost-user protocol, and
>> map them into another VM's memory.
>> This mapping can take, for example, the form of
>> a BAR of a pci device, which I'll call here vhost-pci -
>> with bus address allowed
>> by VM1's IOMMU mappings being translated into
>> offsets within this BAR within VM2's physical
>> memory space.
>
> I think it's sensible.
>
>>
>> Since the translation can be a simple one, VM2
>> can perform it within its vhost-pci device driver.
>>
>> While this setup would be the most useful with polling,
>> VM1's ioeventfd can also be mapped to
>> another VM2's irqfd, and vice versa, such that VMs
>> can trigger interrupts to each other without need
>> for a helper thread on the host.
>>
>>
>> The resulting channel might look something like the following:
>>
>> +-- VM1 --------------+  +---VM2-----------+
>> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
>> +---------------------+  +-----------------+
>>
>> comparing the two diagrams, a vhost-user thread on the host is
>> no longer required, reducing the host CPU utilization when
>> polling is active.  At the same time, VM2 can not access all of VM1's
>> memory - it is limited by the iommu configuration setup by VM1.
>>
>>
>> Advantages over ivshmem:
>>
>> - more flexibility, endpoint VMs do not have to place data at any
>>   specific locations to use the device, in practice this likely
>>   means less data copies.
>> - better standardization/code reuse
>>   virtio changes within guests would be fairly easy to implement
>>   and would also benefit other backends, besides vhost-user
>>   standard hotplug interfaces can be used to add and remove these
>>   channels as VMs are added or removed.
>> - migration support
>>   It's easy to implement since ownership of memory is well defined.
>>   For example, during migration VM2 can notify hypervisor of VM1
>>   by updating dirty bitmap each time is writes into VM1 memory.
>
> Also, the ivshmem functionality could be implemented by this proposal:
> - vswitch (or some VM) allocates memory regions in its address space, and
> - it sets up that IOMMU mappings on the VMs be translated into the regions
>
>>
>> Thanks,
>>
>> --
>> MST



-- 
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
  2015-10-06 21:42   ` [Qemu-devel] " Nakajima, Jun
  2015-10-07  5:39     ` Michael S. Tsirkin
@ 2015-10-07  5:39     ` Michael S. Tsirkin
  1 sibling, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-10-07  5:39 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Tue, Oct 06, 2015 at 02:42:34PM -0700, Nakajima, Jun wrote:
> Hi Michael,
> 
> Looks like the discussions tapered off, but do you have a plan to
> implement this if people are eventually fine with it? We want to
> extend this to support multiple VMs.

Absolutely. We are just back from holidays, and started looking at who
does what. If anyone wants to help, that'd also be nice.


> On Mon, Aug 31, 2015 at 11:35 AM, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> > On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> Hello!
> >> During the KVM forum, we discussed supporting virtio on top
> >> of ivshmem. I have considered it, and came up with an alternative
> >> that has several advantages over that - please see below.
> >> Comments welcome.
> >
> > Hi Michael,
> >
> > I like this, and it should be able to achieve what I presented at KVM
> > Forum (vhost-user-shmem).
> > Comments below.
> >
> >>
> >> -----
> >>
> >> Existing solutions to userspace switching between VMs on the
> >> same host are vhost-user and ivshmem.
> >>
> >> vhost-user works by mapping memory of all VMs being bridged into the
> >> switch memory space.
> >>
> >> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> >> VMs are required to use this region to store packets. The switch only
> >> needs access to this region.
> >>
> >> Another difference between vhost-user and ivshmem surfaces when polling
> >> is used. With vhost-user, the switch is required to handle
> >> data movement between VMs, if using polling, this means that 1 host CPU
> >> needs to be sacrificed for this task.
> >>
> >> This is easiest to understand when one of the VMs is
> >> used with VF pass-through. This can be schematically shown below:
> >>
> >> +-- VM1 --------------+            +---VM2-----------+
> >> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> >> +---------------------+            +-----------------+
> >>
> >>
> >> With ivshmem in theory communication can happen directly, with two VMs
> >> polling the shared memory region.
> >>
> >>
> >> I won't spend time listing advantages of vhost-user over ivshmem.
> >> Instead, having identified two advantages of ivshmem over vhost-user,
> >> below is a proposal to extend vhost-user to gain the advantages
> >> of ivshmem.
> >>
> >>
> >> 1: virtio in guest can be extended to allow support
> >> for IOMMUs. This provides guest with full flexibility
> >> about memory which is readable or write able by each device.
> >
> > I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> > VFIO working for general direct-I/O (including VFs) in guests, as you
> > know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> > remapping table on x86 (i.e. nested VT-d).
> >
> >> By setting up a virtio device for each other VM we need to
> >> communicate to, guest gets full control of its security, from
> >> mapping all memory (like with current vhost-user) to only
> >> mapping buffers used for networking (like ivshmem) to
> >> transient mappings for the duration of data transfer only.
> >
> > And I think that we can use VMFUNC to have such transient mappings.
> >
> >> This also allows use of VFIO within guests, for improved
> >> security.
> >>
> >> vhost user would need to be extended to send the
> >> mappings programmed by guest IOMMU.
> >
> > Right. We need to think about cases where other VMs (VM3, etc.) join
> > the group or some existing VM leaves.
> > PCI hot-plug should work there (as you point out at "Advantages over
> > ivshmem" below).
> >
> >>
> >> 2. qemu can be extended to serve as a vhost-user client:
> >> remote VM mappings over the vhost-user protocol, and
> >> map them into another VM's memory.
> >> This mapping can take, for example, the form of
> >> a BAR of a pci device, which I'll call here vhost-pci -
> >> with bus address allowed
> >> by VM1's IOMMU mappings being translated into
> >> offsets within this BAR within VM2's physical
> >> memory space.
> >
> > I think it's sensible.
> >
> >>
> >> Since the translation can be a simple one, VM2
> >> can perform it within its vhost-pci device driver.
> >>
> >> While this setup would be the most useful with polling,
> >> VM1's ioeventfd can also be mapped to
> >> another VM2's irqfd, and vice versa, such that VMs
> >> can trigger interrupts to each other without need
> >> for a helper thread on the host.
> >>
> >>
> >> The resulting channel might look something like the following:
> >>
> >> +-- VM1 --------------+  +---VM2-----------+
> >> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> >> +---------------------+  +-----------------+
> >>
> >> comparing the two diagrams, a vhost-user thread on the host is
> >> no longer required, reducing the host CPU utilization when
> >> polling is active.  At the same time, VM2 can not access all of VM1's
> >> memory - it is limited by the iommu configuration setup by VM1.
> >>
> >>
> >> Advantages over ivshmem:
> >>
> >> - more flexibility, endpoint VMs do not have to place data at any
> >>   specific locations to use the device, in practice this likely
> >>   means less data copies.
> >> - better standardization/code reuse
> >>   virtio changes within guests would be fairly easy to implement
> >>   and would also benefit other backends, besides vhost-user
> >>   standard hotplug interfaces can be used to add and remove these
> >>   channels as VMs are added or removed.
> >> - migration support
> >>   It's easy to implement since ownership of memory is well defined.
> >>   For example, during migration VM2 can notify hypervisor of VM1
> >>   by updating dirty bitmap each time is writes into VM1 memory.
> >
> > Also, the ivshmem functionality could be implemented by this proposal:
> > - vswitch (or some VM) allocates memory regions in its address space, and
> > - it sets up that IOMMU mappings on the VMs be translated into the regions
> >
> >>
> >> Thanks,
> >>
> >> --
> >> MST
> 
> 
> 
> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: rfc: vhost user enhancements for vm2vm communication
  2015-10-06 21:42   ` [Qemu-devel] " Nakajima, Jun
@ 2015-10-07  5:39     ` Michael S. Tsirkin
  2015-10-07  5:39     ` [Qemu-devel] " Michael S. Tsirkin
  1 sibling, 0 replies; 80+ messages in thread
From: Michael S. Tsirkin @ 2015-10-07  5:39 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: virtio-dev, Jan Kiszka, Claudio.Fontana, qemu-devel,
	Linux Virtualization, opnfv-tech-discuss

On Tue, Oct 06, 2015 at 02:42:34PM -0700, Nakajima, Jun wrote:
> Hi Michael,
> 
> Looks like the discussions tapered off, but do you have a plan to
> implement this if people are eventually fine with it? We want to
> extend this to support multiple VMs.

Absolutely. We are just back from holidays, and started looking at who
does what. If anyone wants to help, that'd also be nice.


> On Mon, Aug 31, 2015 at 11:35 AM, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> > On Mon, Aug 31, 2015 at 7:11 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> Hello!
> >> During the KVM forum, we discussed supporting virtio on top
> >> of ivshmem. I have considered it, and came up with an alternative
> >> that has several advantages over that - please see below.
> >> Comments welcome.
> >
> > Hi Michael,
> >
> > I like this, and it should be able to achieve what I presented at KVM
> > Forum (vhost-user-shmem).
> > Comments below.
> >
> >>
> >> -----
> >>
> >> Existing solutions to userspace switching between VMs on the
> >> same host are vhost-user and ivshmem.
> >>
> >> vhost-user works by mapping memory of all VMs being bridged into the
> >> switch memory space.
> >>
> >> By comparison, ivshmem works by exposing a shared region of memory to all VMs.
> >> VMs are required to use this region to store packets. The switch only
> >> needs access to this region.
> >>
> >> Another difference between vhost-user and ivshmem surfaces when polling
> >> is used. With vhost-user, the switch is required to handle
> >> data movement between VMs, if using polling, this means that 1 host CPU
> >> needs to be sacrificed for this task.
> >>
> >> This is easiest to understand when one of the VMs is
> >> used with VF pass-through. This can be schematically shown below:
> >>
> >> +-- VM1 --------------+            +---VM2-----------+
> >> | virtio-pci          +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
> >> +---------------------+            +-----------------+
> >>
> >>
> >> With ivshmem in theory communication can happen directly, with two VMs
> >> polling the shared memory region.
> >>
> >>
> >> I won't spend time listing advantages of vhost-user over ivshmem.
> >> Instead, having identified two advantages of ivshmem over vhost-user,
> >> below is a proposal to extend vhost-user to gain the advantages
> >> of ivshmem.
> >>
> >>
> >> 1: virtio in guest can be extended to allow support
> >> for IOMMUs. This provides guest with full flexibility
> >> about memory which is readable or write able by each device.
> >
> > I assume that you meant VFIO only for virtio by "use of VFIO".  To get
> > VFIO working for general direct-I/O (including VFs) in guests, as you
> > know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
> > remapping table on x86 (i.e. nested VT-d).
> >
> >> By setting up a virtio device for each other VM we need to
> >> communicate to, guest gets full control of its security, from
> >> mapping all memory (like with current vhost-user) to only
> >> mapping buffers used for networking (like ivshmem) to
> >> transient mappings for the duration of data transfer only.
> >
> > And I think that we can use VMFUNC to have such transient mappings.
> >
> >> This also allows use of VFIO within guests, for improved
> >> security.
> >>
> >> vhost user would need to be extended to send the
> >> mappings programmed by guest IOMMU.
> >
> > Right. We need to think about cases where other VMs (VM3, etc.) join
> > the group or some existing VM leaves.
> > PCI hot-plug should work there (as you point out at "Advantages over
> > ivshmem" below).
> >
> >>
> >> 2. qemu can be extended to serve as a vhost-user client:
> >> remote VM mappings over the vhost-user protocol, and
> >> map them into another VM's memory.
> >> This mapping can take, for example, the form of
> >> a BAR of a pci device, which I'll call here vhost-pci -
> >> with bus address allowed
> >> by VM1's IOMMU mappings being translated into
> >> offsets within this BAR within VM2's physical
> >> memory space.
> >
> > I think it's sensible.
> >
> >>
> >> Since the translation can be a simple one, VM2
> >> can perform it within its vhost-pci device driver.
> >>
> >> While this setup would be the most useful with polling,
> >> VM1's ioeventfd can also be mapped to
> >> another VM2's irqfd, and vice versa, such that VMs
> >> can trigger interrupts to each other without need
> >> for a helper thread on the host.
> >>
> >>
> >> The resulting channel might look something like the following:
> >>
> >> +-- VM1 --------------+  +---VM2-----------+
> >> | virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
> >> +---------------------+  +-----------------+
> >>
> >> comparing the two diagrams, a vhost-user thread on the host is
> >> no longer required, reducing the host CPU utilization when
> >> polling is active.  At the same time, VM2 can not access all of VM1's
> >> memory - it is limited by the iommu configuration setup by VM1.
> >>
> >>
> >> Advantages over ivshmem:
> >>
> >> - more flexibility, endpoint VMs do not have to place data at any
> >>   specific locations to use the device, in practice this likely
> >>   means less data copies.
> >> - better standardization/code reuse
> >>   virtio changes within guests would be fairly easy to implement
> >>   and would also benefit other backends, besides vhost-user
> >>   standard hotplug interfaces can be used to add and remove these
> >>   channels as VMs are added or removed.
> >> - migration support
> >>   It's easy to implement since ownership of memory is well defined.
> >>   For example, during migration VM2 can notify hypervisor of VM1
> >>   by updating dirty bitmap each time is writes into VM1 memory.
> >
> > Also, the ivshmem functionality could be implemented by this proposal:
> > - vswitch (or some VM) allocates memory regions in its address space, and
> > - it sets up that IOMMU mappings on the VMs be translated into the regions
> >
> >>
> >> Thanks,
> >>
> >> --
> >> MST
> 
> 
> 
> -- 
> Jun
> Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm communication
@ 2016-03-17 12:56 Bret Ketchum
  0 siblings, 0 replies; 80+ messages in thread
From: Bret Ketchum @ 2016-03-17 12:56 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

On Wed, Oct 07, 2015 at 08:39:40 +0300, Tsirkin, Michael wrote:
>> Looks like the discussions tapered off, but do you have a plan to
>> implement this if people are eventually fine with it? We want to
>> extend this to support multiple VMs.
>
> Absolutely. We are just back from holidays, and started looking at who
> does what. If anyone wants to help, that'd also be nice.

     Michael, will you share any progress on this topic?

[-- Attachment #2: Type: text/html, Size: 3309 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2016-03-17 12:57 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-31 14:11 [Qemu-devel] rfc: vhost user enhancements for vm2vm communication Michael S. Tsirkin
2015-08-31 18:35 ` Nakajima, Jun
2015-08-31 18:35   ` Nakajima, Jun
2015-09-01  3:03   ` [Qemu-devel] " Varun Sethi
2015-09-01  8:30     ` Michael S. Tsirkin
2015-09-01  8:30       ` Michael S. Tsirkin
2015-09-01  3:03   ` Varun Sethi
2015-09-01  8:17   ` Michael S. Tsirkin
2015-09-01 22:56     ` Nakajima, Jun
2015-09-01 22:56       ` Nakajima, Jun
2015-09-01  8:17   ` Michael S. Tsirkin
2015-10-06 21:42   ` Nakajima, Jun
2015-10-06 21:42   ` [Qemu-devel] " Nakajima, Jun
2015-10-07  5:39     ` Michael S. Tsirkin
2015-10-07  5:39     ` [Qemu-devel] " Michael S. Tsirkin
2015-09-01  7:35 ` Jan Kiszka
2015-09-01  7:35   ` Jan Kiszka
2015-09-01  8:01   ` Michael S. Tsirkin
2015-09-01  8:01   ` [Qemu-devel] " Michael S. Tsirkin
2015-09-01  9:11     ` Jan Kiszka
2015-09-01  9:11       ` Jan Kiszka
2015-09-01  9:24       ` [Qemu-devel] " Michael S. Tsirkin
2015-09-01 14:09         ` Jan Kiszka
2015-09-01 14:09           ` Jan Kiszka
2015-09-01 14:34           ` Michael S. Tsirkin
2015-09-01 14:34           ` [Qemu-devel] " Michael S. Tsirkin
2015-09-01 15:34             ` Jan Kiszka
2015-09-01 15:34               ` Jan Kiszka
2015-09-01 16:02               ` [Qemu-devel] " Michael S. Tsirkin
2015-09-01 16:02                 ` Michael S. Tsirkin
2015-09-01 16:28                 ` [Qemu-devel] " Jan Kiszka
2015-09-01 16:28                   ` Jan Kiszka
2015-09-02  0:01                   ` [Qemu-devel] " Nakajima, Jun
2015-09-02  0:01                     ` Nakajima, Jun
2015-09-02 12:15                     ` [Qemu-devel] " Michael S. Tsirkin
2015-09-02 12:15                       ` Michael S. Tsirkin
2015-09-03  4:45                       ` Nakajima, Jun
2015-09-03  4:45                       ` [Qemu-devel] " Nakajima, Jun
2015-09-03  8:09                         ` Michael S. Tsirkin
2015-09-03  8:09                         ` Michael S. Tsirkin
2015-09-03  8:08                   ` [Qemu-devel] " Michael S. Tsirkin
2015-09-03  8:08                     ` Michael S. Tsirkin
2015-09-03  8:21                     ` [Qemu-devel] " Jan Kiszka
2015-09-03  8:21                       ` Jan Kiszka
2015-09-03  8:37                       ` Michael S. Tsirkin
2015-09-03  8:37                       ` [Qemu-devel] " Michael S. Tsirkin
2015-09-03 10:25                         ` Jan Kiszka
2015-09-03 10:25                           ` Jan Kiszka
2015-09-01  9:24       ` Michael S. Tsirkin
2015-09-07 12:38 ` [Qemu-devel] " Claudio Fontana
2015-09-09  6:40   ` [opnfv-tech-discuss] " Zhang, Yang Z
2015-09-09  6:40   ` [Qemu-devel] " Zhang, Yang Z
2015-09-09  8:39     ` Claudio Fontana
2015-09-18 16:29       ` [Qemu-devel] RFC: virtio-peer shared memory based peer communication device Claudio Fontana
2015-09-18 21:11         ` Paolo Bonzini
2015-09-18 21:11           ` Paolo Bonzini
2015-09-21 10:47           ` [Qemu-devel] " Jan Kiszka
2015-09-21 10:47             ` Jan Kiszka
2015-09-21 12:15             ` [Qemu-devel] " Paolo Bonzini
2015-09-21 12:15               ` Paolo Bonzini
2015-09-21 12:13         ` [Qemu-devel] " Michael S. Tsirkin
2015-09-21 12:32           ` Jan Kiszka
2015-09-21 12:32             ` Jan Kiszka
2015-09-24 10:04             ` [Qemu-devel] " Michael S. Tsirkin
2015-09-24 10:04             ` Michael S. Tsirkin
2015-09-21 12:13         ` Michael S. Tsirkin
2015-09-18 16:29       ` Claudio Fontana
2015-09-09  8:39     ` [opnfv-tech-discuss] rfc: vhost user enhancements for vm2vm communication Claudio Fontana
2015-09-09  7:06   ` [Qemu-devel] " Michael S. Tsirkin
2015-09-11 15:39     ` Claudio Fontana
2015-09-11 15:39       ` Claudio Fontana
2015-09-13  9:12       ` Michael S. Tsirkin
2015-09-13  9:12       ` [Qemu-devel] " Michael S. Tsirkin
2015-09-14  0:43         ` [Qemu-devel] [opnfv-tech-discuss] " Zhang, Yang Z
2015-09-14  0:43           ` Zhang, Yang Z
2015-09-09  7:06   ` Michael S. Tsirkin
2015-09-07 12:38 ` Claudio Fontana
2015-09-14 16:00 ` [Qemu-devel] [virtio-dev] " Stefan Hajnoczi
2015-09-14 16:00   ` Stefan Hajnoczi
2016-03-17 12:56 [Qemu-devel] " Bret Ketchum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.