All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
@ 2022-01-06 17:03 Alex Bennée
  2022-01-10  9:49 ` Stefan Hajnoczi
  2022-01-19 17:10 ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 6+ messages in thread
From: Alex Bennée @ 2022-01-06 17:03 UTC (permalink / raw)
  To: stratos-dev, virtio-dev
  Cc: Sumit Semwal, John Stultz, Tom Gall, Jean-Philippe Brucker,
	Christopher Clark, Will Deacon, Arnd Bergmann, Vincent Guittot,
	Joakim Bech, Mike Holmes, Trilok Soni, Srivatsa Vaddagiri,
	Anmar Oueja, Stefan Hajnoczi, Jan Kiszka


Hi,

To start the new year I thought would dump some of my thoughts on
zero-copy between VM domains. For project Stratos we've gamely avoided
thinking too hard about this while we've been concentrating on solving
more tractable problems. However we can't put it off forever so lets
work through the problem.

Memory Sharing
==============

For any zero-copy to work there has to be memory sharing between the
domains. For traditional KVM this isn't a problem as the host kernel
already has access to the whole address space of all it's guests.
However type-1 setups (and now pKVM) are less promiscuous about sharing
their address space across the domains.

We've discussed options like dynamically sharing individual regions in
the past (maybe via iommu hooks). However given the performance
requirements I think that is ruled out in favour of sharing of
appropriately sized blocks of memory. Either one of the two domains has
to explicitly share a chunk of its memory with the other or the
hypervisor has to allocate the memory and make it visible to both. What
considerations do we have to take into account to do this?

 * the actual HW device may only have the ability to DMA to certain
   areas of the physical address space.
 * there may be alignment requirements for HW to access structures (e.g.
   GPU buffers/blocks)

Which domain should do the sharing? The hypervisor itself likely doesn't
have all the information to make the choice but in a distributed driver
world it won't always be the Dom0/Host equivalent. While the domain with
the HW driver in it will know what the HW needs it might not know if the
GPA's being used are actually visible to the real PA it is mapped to.

I think this means for useful memory sharing we need the allocation to
be done by the HW domain but with support from the hypervisor to
validate the region meets all the physical bus requirements.

Buffer Allocation
=================

Ultimately I think the majority of the work that will be needed comes
down to how buffer allocation is handled in the kernels. This is also
the area I'm least familiar with so I look forward to feedback from
those with deeper kernel knowledge.

For Linux there already exists the concept of DMA reachable regions that
take into account the potentially restricted set of addresses that HW
can DMA to. However we are now adding a second constraint which is where
the data is eventually going to end up.

For example the HW domain may be talking to network device but the
packet data from that device might be going to two other domains. We
wouldn't want to share a region for received network packets between
both domains because that would leak information so the network driver
needs knowledge of which shared region to allocate from and hope the HW
allows us to filter the packets appropriately (maybe via VLAN tag). I
suspect the pure HW solution of just splitting into two HW virtual
functions directly into each domain is going to remain the preserve of
expensive enterprise kit for some time.

Should the work be divided up between sub-systems? Both the network and
block device sub-systems have their own allocation strategies and would
need some knowledge about the final destination for their data. What
other driver sub-systems are going to need support for this sort of
zero-copy forwarding? While it would be nice for every VM transaction to
be zero-copy we don't really need to solve it for low speed transports.

Transparent fallback and scaling
================================

As we know memory is always a precious resource that we never have
enough of. The more we start carving up memory regions for particular
tasks the less flexibility the system has as a whole to make efficient
use of it. We can almost guarantee whatever number we pick for given
VM-to-VM conduit will be wrong. Any memory allocation system based on
regions will have to be able to fall back graciously to using other
memory in the HW domain and rely on traditional bounce buffering
approaches while under heavy load. This will be a problem for VirtIO
backends to understand when some data that needs to go to the FE domain
needs this bounce buffer treatment. This will involve tracking
destination domain metadata somewhere in the system so it can be queried
quickly.

Is there a cross-over here with the kernels existing support for NUMA
architectures? It seems to me there are similar questions about the best
place to put memory that perhaps we can treat multi-VM domains as
different NUMA zones?

Finally there is the question of scaling. While mapping individual
transactions would be painfully slow we need to think about how dynamic
a modern system is. For example do you size your shared network region
to cope with a full HD video stream of data? Most of the time the
user won't be doing anything nearly as network intensive.

Of course adding the dynamic addition (and removal) of shared memory
regions brings in more potential synchronisation problems of ensuring
shared memory isn't accessed by either side when taken down. We would
need some sort of assurance the sharee has finished with all the data in
a given region before the sharer brings the share down.

Conclusion
==========

This long text hasn't even attempted to come up with a zero-copy
architecture for Linux VMs. I'm hoping as we discuss this we can capture
all the various constraints any such system is going to need to deal
with. So my final questions are:

 - what other constraints we need to take into account?
 - can we leverage existing sub-systems to build this support?

I look forward to your thoughts ;-)

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
  2022-01-06 17:03 [virtio-dev] Some thoughts on zero-copy between VM domains for discussion Alex Bennée
@ 2022-01-10  9:49 ` Stefan Hajnoczi
  2022-01-14 14:28   ` Alex Bennée
  2022-01-19 17:10 ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2022-01-10  9:49 UTC (permalink / raw)
  To: Alex Bennée
  Cc: stratos-dev, virtio-dev, Sumit Semwal, John Stultz, Tom Gall,
	Jean-Philippe Brucker, Christopher Clark, Will Deacon,
	Arnd Bergmann, Vincent Guittot, Joakim Bech, Mike Holmes,
	Trilok Soni, Srivatsa Vaddagiri, Anmar Oueja, Jan Kiszka

[-- Attachment #1: Type: text/plain, Size: 7550 bytes --]

On Thu, Jan 06, 2022 at 05:03:38PM +0000, Alex Bennée wrote:
> 
> Hi,
> 
> To start the new year I thought would dump some of my thoughts on
> zero-copy between VM domains. For project Stratos we've gamely avoided
> thinking too hard about this while we've been concentrating on solving
> more tractable problems. However we can't put it off forever so lets
> work through the problem.
> 
> Memory Sharing
> ==============
> 
> For any zero-copy to work there has to be memory sharing between the
> domains. For traditional KVM this isn't a problem as the host kernel
> already has access to the whole address space of all it's guests.
> However type-1 setups (and now pKVM) are less promiscuous about sharing
> their address space across the domains.
> 
> We've discussed options like dynamically sharing individual regions in
> the past (maybe via iommu hooks). However given the performance
> requirements I think that is ruled out in favour of sharing of
> appropriately sized blocks of memory. Either one of the two domains has
> to explicitly share a chunk of its memory with the other or the
> hypervisor has to allocate the memory and make it visible to both. What
> considerations do we have to take into account to do this?
> 
>  * the actual HW device may only have the ability to DMA to certain
>    areas of the physical address space.
>  * there may be alignment requirements for HW to access structures (e.g.
>    GPU buffers/blocks)
> 
> Which domain should do the sharing? The hypervisor itself likely doesn't
> have all the information to make the choice but in a distributed driver
> world it won't always be the Dom0/Host equivalent. While the domain with
> the HW driver in it will know what the HW needs it might not know if the
> GPA's being used are actually visible to the real PA it is mapped to.
> 
> I think this means for useful memory sharing we need the allocation to
> be done by the HW domain but with support from the hypervisor to
> validate the region meets all the physical bus requirements.
> 
> Buffer Allocation
> =================
> 
> Ultimately I think the majority of the work that will be needed comes
> down to how buffer allocation is handled in the kernels. This is also
> the area I'm least familiar with so I look forward to feedback from
> those with deeper kernel knowledge.
> 
> For Linux there already exists the concept of DMA reachable regions that
> take into account the potentially restricted set of addresses that HW
> can DMA to. However we are now adding a second constraint which is where
> the data is eventually going to end up.
> 
> For example the HW domain may be talking to network device but the
> packet data from that device might be going to two other domains. We
> wouldn't want to share a region for received network packets between
> both domains because that would leak information so the network driver
> needs knowledge of which shared region to allocate from and hope the HW
> allows us to filter the packets appropriately (maybe via VLAN tag). I
> suspect the pure HW solution of just splitting into two HW virtual
> functions directly into each domain is going to remain the preserve of
> expensive enterprise kit for some time.
> 
> Should the work be divided up between sub-systems? Both the network and
> block device sub-systems have their own allocation strategies and would
> need some knowledge about the final destination for their data. What
> other driver sub-systems are going to need support for this sort of
> zero-copy forwarding? While it would be nice for every VM transaction to
> be zero-copy we don't really need to solve it for low speed transports.
> 
> Transparent fallback and scaling
> ================================
> 
> As we know memory is always a precious resource that we never have
> enough of. The more we start carving up memory regions for particular
> tasks the less flexibility the system has as a whole to make efficient
> use of it. We can almost guarantee whatever number we pick for given
> VM-to-VM conduit will be wrong. Any memory allocation system based on
> regions will have to be able to fall back graciously to using other
> memory in the HW domain and rely on traditional bounce buffering
> approaches while under heavy load. This will be a problem for VirtIO
> backends to understand when some data that needs to go to the FE domain
> needs this bounce buffer treatment. This will involve tracking
> destination domain metadata somewhere in the system so it can be queried
> quickly.
> 
> Is there a cross-over here with the kernels existing support for NUMA
> architectures? It seems to me there are similar questions about the best
> place to put memory that perhaps we can treat multi-VM domains as
> different NUMA zones?
> 
> Finally there is the question of scaling. While mapping individual
> transactions would be painfully slow we need to think about how dynamic
> a modern system is. For example do you size your shared network region
> to cope with a full HD video stream of data? Most of the time the
> user won't be doing anything nearly as network intensive.
> 
> Of course adding the dynamic addition (and removal) of shared memory
> regions brings in more potential synchronisation problems of ensuring
> shared memory isn't accessed by either side when taken down. We would
> need some sort of assurance the sharee has finished with all the data in
> a given region before the sharer brings the share down.
> 
> Conclusion
> ==========
> 
> This long text hasn't even attempted to come up with a zero-copy
> architecture for Linux VMs. I'm hoping as we discuss this we can capture
> all the various constraints any such system is going to need to deal
> with. So my final questions are:
> 
>  - what other constraints we need to take into account?
>  - can we leverage existing sub-systems to build this support?
> 
> I look forward to your thoughts ;-)

(Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/)
is an interesting IOMMU feature. It would be neat to have a CPU
equivalent where loads and stores from/to another address space could be
done cheaply with CPU support. I don't think this is possible today and
that's why software IOMMUs are slow for per-transaction page protection.
In other words, a virtio-net TX VM would set up a page table allowing
read access only to the TX buffers and vring and the virtual network
switch VM would have the capability to access the vring and buffers
through the TX VM's dedicated address space.)

Some storage and networking applications use buffered I/O where the
guest kernel owns the DMA buffer while others use zero-copy I/O where
guest userspace pages are pinned for DMA. I think both cases need to be
considered.

Are guest userspace-visible API changes allowed (e.g. making the
userspace application aware at buffer allocation time)? Ideally the
requirement would be that zero-copy must work for unmodified
applications, but I'm not sure if that's realistic.

By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are
memory regions (e.g. PCI BAR ranges) that the guest driver can map. If
the host pages must come from a special pool then Shared Memory Regions
would be one way to configure the guest to use this special zero-copy
area for data buffers and even vrings. New VIRTIO feature bits and
transport-specific information may need to be added to do this.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
  2022-01-10  9:49 ` Stefan Hajnoczi
@ 2022-01-14 14:28   ` Alex Bennée
  2022-01-14 17:16     ` Jean-Philippe Brucker
  2022-01-17 10:15     ` Stefan Hajnoczi
  0 siblings, 2 replies; 6+ messages in thread
From: Alex Bennée @ 2022-01-14 14:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: stratos-dev, virtio-dev, Sumit Semwal, John Stultz, Tom Gall,
	Jean-Philippe Brucker, Christopher Clark, Will Deacon,
	Arnd Bergmann, Vincent Guittot, Joakim Bech, Mike Holmes,
	Trilok Soni, Srivatsa Vaddagiri, Anmar Oueja, Jan Kiszka


Stefan Hajnoczi <stefanha@redhat.com> writes:

> [[PGP Signed Part:Undecided]]
> On Thu, Jan 06, 2022 at 05:03:38PM +0000, Alex Bennée wrote:
>> 
>> Hi,
>> 
>> To start the new year I thought would dump some of my thoughts on
>> zero-copy between VM domains. For project Stratos we've gamely avoided
>> thinking too hard about this while we've been concentrating on solving
>> more tractable problems. However we can't put it off forever so lets
>> work through the problem.
>> 
>> Memory Sharing
<snip>
>> Buffer Allocation
<snip>
>> Transparent fallback and scaling
<snip>
>> 
>>  - what other constraints we need to take into account?
>>  - can we leverage existing sub-systems to build this support?
>> 
>> I look forward to your thoughts ;-)
>
> (Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/)
> is an interesting IOMMU feature. It would be neat to have a CPU
> equivalent where loads and stores from/to another address space could be
> done cheaply with CPU support. I don't think this is possible today and
> that's why software IOMMUs are slow for per-transaction page protection.
> In other words, a virtio-net TX VM would set up a page table allowing
> read access only to the TX buffers and vring and the virtual network
> switch VM would have the capability to access the vring and buffers
> through the TX VM's dedicated address space.)

Does binding a device to an address space mean the driver allocations
will be automatically done from the address space or do the drivers need
modifying to take advantage of that? Jean-Phillipe?

> Some storage and networking applications use buffered I/O where the
> guest kernel owns the DMA buffer while others use zero-copy I/O where
> guest userspace pages are pinned for DMA. I think both cases need to be
> considered.
>
> Are guest userspace-visible API changes allowed (e.g. making the
> userspace application aware at buffer allocation time)?

I assume you mean enhanced rather than breaking APIs here? I don't see
why not. Certainly for the vhost-user backends we are writing we aren't
beholden to sticking to an old API.

> Ideally the
> requirement would be that zero-copy must work for unmodified
> applications, but I'm not sure if that's realistic.
>
> By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are
> memory regions (e.g. PCI BAR ranges) that the guest driver can map. If
> the host pages must come from a special pool then Shared Memory Regions
> would be one way to configure the guest to use this special zero-copy
> area for data buffers and even vrings. New VIRTIO feature bits and
> transport-specific information may need to be added to do this.

Are these fixed sizes or could be accommodate a growing/shrinking region?

Thanks for the pointers.

-- 
Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
  2022-01-14 14:28   ` Alex Bennée
@ 2022-01-14 17:16     ` Jean-Philippe Brucker
  2022-01-17 10:15     ` Stefan Hajnoczi
  1 sibling, 0 replies; 6+ messages in thread
From: Jean-Philippe Brucker @ 2022-01-14 17:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Stefan Hajnoczi, stratos-dev, virtio-dev, Sumit Semwal,
	John Stultz, Tom Gall, Christopher Clark, Will Deacon,
	Arnd Bergmann, Vincent Guittot, Joakim Bech, Mike Holmes,
	Trilok Soni, Srivatsa Vaddagiri, Anmar Oueja, Jan Kiszka

On Fri, Jan 14, 2022 at 02:28:11PM +0000, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Thu, Jan 06, 2022 at 05:03:38PM +0000, Alex Bennée wrote:
> >> 
> >> Hi,
> >> 
> >> To start the new year I thought would dump some of my thoughts on
> >> zero-copy between VM domains. For project Stratos we've gamely avoided
> >> thinking too hard about this while we've been concentrating on solving
> >> more tractable problems. However we can't put it off forever so lets
> >> work through the problem.
> >> 
> >> Memory Sharing
> <snip>
> >> Buffer Allocation
> <snip>
> >> Transparent fallback and scaling
> <snip>
> >> 
> >>  - what other constraints we need to take into account?
> >>  - can we leverage existing sub-systems to build this support?
> >> 
> >> I look forward to your thoughts ;-)
> >
> > (Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/)
> > is an interesting IOMMU feature. It would be neat to have a CPU
> > equivalent where loads and stores from/to another address space could be
> > done cheaply with CPU support. I don't think this is possible today and
> > that's why software IOMMUs are slow for per-transaction page protection.
> > In other words, a virtio-net TX VM would set up a page table allowing
> > read access only to the TX buffers and vring and the virtual network
> > switch VM would have the capability to access the vring and buffers
> > through the TX VM's dedicated address space.)
> 
> Does binding a device to an address space mean the driver allocations
> will be automatically done from the address space or do the drivers need
> modifying to take advantage of that? Jean-Phillipe?

The drivers do need modifying and the APIs are very different: with SVA
you're assigning partitions of devices (for example virtqueues) to
different applications. So the kernel driver only does management and
userspace takes care of the data path. A device partition accesses the
whole address space of the process it is assigned to, so there is no
explicit DMA allocation.

Note that it also requires special features from the device (multiple
address spaces with PCIe PASID, recoverable DMA page faults with PRI).
And the concept doesn't necessarily fit nicely with all device classes -
you probably don't want to load a full network stack in any application
that needs to send a couple of packets. One that demonstrates the concept
well in my opinion is the crypto/compression accelerators that use SVA in
Linux [1]: any application that needs fast compression or encryption on
some of its memory can open a queue in the accelerator, submit jobs with
pointers to input and output buffers and wait for completion while doing
something else. It is several orders of magnitude faster than letting the
CPU do this work and only the core mm deals with memory management.

Anyway this seems mostly off-topic. As Stefan said what would be useful
for our problem is help from the CPUs to share bits of address space
without disclosing the whole VM memory. At the moment any vIOMMU mapping
needs to be shadowed by the hypervisor somehow. Unless we use static
shared memory regions, frontend VM needs to tell the hypervisor which
pages it chooses to share with backend VM, and the hypervisor to map then
unmap that into backend VM's stage-2. I think that operation requires two
context switches to the hypervisor any way we look at it.

Thanks,
Jean

[1] https://lore.kernel.org/linux-iommu/1579097568-17542-1-git-send-email-zhangfei.gao@linaro.org/

> 
> > Some storage and networking applications use buffered I/O where the
> > guest kernel owns the DMA buffer while others use zero-copy I/O where
> > guest userspace pages are pinned for DMA. I think both cases need to be
> > considered.
> >
> > Are guest userspace-visible API changes allowed (e.g. making the
> > userspace application aware at buffer allocation time)?
> 
> I assume you mean enhanced rather than breaking APIs here? I don't see
> why not. Certainly for the vhost-user backends we are writing we aren't
> beholden to sticking to an old API.
> 
> > Ideally the
> > requirement would be that zero-copy must work for unmodified
> > applications, but I'm not sure if that's realistic.
> >
> > By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are
> > memory regions (e.g. PCI BAR ranges) that the guest driver can map. If
> > the host pages must come from a special pool then Shared Memory Regions
> > would be one way to configure the guest to use this special zero-copy
> > area for data buffers and even vrings. New VIRTIO feature bits and
> > transport-specific information may need to be added to do this.
> 
> Are these fixed sizes or could be accommodate a growing/shrinking region?
> 
> Thanks for the pointers.
> 
> -- 
> Alex Bennée

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
  2022-01-14 14:28   ` Alex Bennée
  2022-01-14 17:16     ` Jean-Philippe Brucker
@ 2022-01-17 10:15     ` Stefan Hajnoczi
  1 sibling, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2022-01-17 10:15 UTC (permalink / raw)
  To: Alex Bennée
  Cc: stratos-dev, virtio-dev, Sumit Semwal, John Stultz, Tom Gall,
	Jean-Philippe Brucker, Christopher Clark, Will Deacon,
	Arnd Bergmann, Vincent Guittot, Joakim Bech, Mike Holmes,
	Trilok Soni, Srivatsa Vaddagiri, Anmar Oueja, Jan Kiszka

[-- Attachment #1: Type: text/plain, Size: 3425 bytes --]

On Fri, Jan 14, 2022 at 02:28:11PM +0000, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Thu, Jan 06, 2022 at 05:03:38PM +0000, Alex Bennée wrote:
> >> 
> >> Hi,
> >> 
> >> To start the new year I thought would dump some of my thoughts on
> >> zero-copy between VM domains. For project Stratos we've gamely avoided
> >> thinking too hard about this while we've been concentrating on solving
> >> more tractable problems. However we can't put it off forever so lets
> >> work through the problem.
> >> 
> >> Memory Sharing
> <snip>
> >> Buffer Allocation
> <snip>
> >> Transparent fallback and scaling
> <snip>
> >> 
> >>  - what other constraints we need to take into account?
> >>  - can we leverage existing sub-systems to build this support?
> >> 
> >> I look forward to your thoughts ;-)
> >
> > (Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/)
> > is an interesting IOMMU feature. It would be neat to have a CPU
> > equivalent where loads and stores from/to another address space could be
> > done cheaply with CPU support. I don't think this is possible today and
> > that's why software IOMMUs are slow for per-transaction page protection.
> > In other words, a virtio-net TX VM would set up a page table allowing
> > read access only to the TX buffers and vring and the virtual network
> > switch VM would have the capability to access the vring and buffers
> > through the TX VM's dedicated address space.)
> 
> Does binding a device to an address space mean the driver allocations
> will be automatically done from the address space or do the drivers need
> modifying to take advantage of that? Jean-Phillipe?
> 
> > Some storage and networking applications use buffered I/O where the
> > guest kernel owns the DMA buffer while others use zero-copy I/O where
> > guest userspace pages are pinned for DMA. I think both cases need to be
> > considered.
> >
> > Are guest userspace-visible API changes allowed (e.g. making the
> > userspace application aware at buffer allocation time)?
> 
> I assume you mean enhanced rather than breaking APIs here? I don't see
> why not. Certainly for the vhost-user backends we are writing we aren't
> beholden to sticking to an old API.

I was thinking about guest applications. Is the goal to run umodified
guest applications or is it acceptable to require new APIs (e.g. for
buffer allocation)?

> > Ideally the
> > requirement would be that zero-copy must work for unmodified
> > applications, but I'm not sure if that's realistic.
> >
> > By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are
> > memory regions (e.g. PCI BAR ranges) that the guest driver can map. If
> > the host pages must come from a special pool then Shared Memory Regions
> > would be one way to configure the guest to use this special zero-copy
> > area for data buffers and even vrings. New VIRTIO feature bits and
> > transport-specific information may need to be added to do this.
> 
> Are these fixed sizes or could be accommodate a growing/shrinking region?

Shared Memory Regions are currently static although it may be possible
to extend the spec. The virtio-pci transport represents Shared Memory
Regions using PCI Capabilities and I'm not sure if the PCI Capabilities
list is allowed to change at runtime.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
  2022-01-06 17:03 [virtio-dev] Some thoughts on zero-copy between VM domains for discussion Alex Bennée
  2022-01-10  9:49 ` Stefan Hajnoczi
@ 2022-01-19 17:10 ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-19 17:10 UTC (permalink / raw)
  To: Alex Bennée
  Cc: stratos-dev, virtio-dev, Sumit Semwal, John Stultz, Tom Gall,
	Jean-Philippe Brucker, Christopher Clark, Will Deacon,
	Arnd Bergmann, Vincent Guittot, Joakim Bech, Mike Holmes,
	Trilok Soni, Srivatsa Vaddagiri, Anmar Oueja, Stefan Hajnoczi,
	Jan Kiszka

* Alex Bennée (alex.bennee@linaro.org) wrote:
> 
> Hi,
> 
> To start the new year I thought would dump some of my thoughts on
> zero-copy between VM domains. For project Stratos we've gamely avoided
> thinking too hard about this while we've been concentrating on solving
> more tractable problems. However we can't put it off forever so lets
> work through the problem.

Can you explain a bit more about your case you're trying to solve?
There's lots of different reasons to share memory between VMs and things
which all have different required semantics.

To add to your list of things to think about, consider live migration,
where both devices sharing memory can change it, there needs to be some
interconnection with the migration dirty page tracking.  For postcopy
there needs to be some interaction with the concept of when each VM
stops running on the source and flips over.

Dave

> Memory Sharing
> ==============
> 
> For any zero-copy to work there has to be memory sharing between the
> domains. For traditional KVM this isn't a problem as the host kernel
> already has access to the whole address space of all it's guests.
> However type-1 setups (and now pKVM) are less promiscuous about sharing
> their address space across the domains.
> 
> We've discussed options like dynamically sharing individual regions in
> the past (maybe via iommu hooks). However given the performance
> requirements I think that is ruled out in favour of sharing of
> appropriately sized blocks of memory. Either one of the two domains has
> to explicitly share a chunk of its memory with the other or the
> hypervisor has to allocate the memory and make it visible to both. What
> considerations do we have to take into account to do this?
> 
>  * the actual HW device may only have the ability to DMA to certain
>    areas of the physical address space.
>  * there may be alignment requirements for HW to access structures (e.g.
>    GPU buffers/blocks)
> 
> Which domain should do the sharing? The hypervisor itself likely doesn't
> have all the information to make the choice but in a distributed driver
> world it won't always be the Dom0/Host equivalent. While the domain with
> the HW driver in it will know what the HW needs it might not know if the
> GPA's being used are actually visible to the real PA it is mapped to.
> 
> I think this means for useful memory sharing we need the allocation to
> be done by the HW domain but with support from the hypervisor to
> validate the region meets all the physical bus requirements.
> 
> Buffer Allocation
> =================
> 
> Ultimately I think the majority of the work that will be needed comes
> down to how buffer allocation is handled in the kernels. This is also
> the area I'm least familiar with so I look forward to feedback from
> those with deeper kernel knowledge.
> 
> For Linux there already exists the concept of DMA reachable regions that
> take into account the potentially restricted set of addresses that HW
> can DMA to. However we are now adding a second constraint which is where
> the data is eventually going to end up.
> 
> For example the HW domain may be talking to network device but the
> packet data from that device might be going to two other domains. We
> wouldn't want to share a region for received network packets between
> both domains because that would leak information so the network driver
> needs knowledge of which shared region to allocate from and hope the HW
> allows us to filter the packets appropriately (maybe via VLAN tag). I
> suspect the pure HW solution of just splitting into two HW virtual
> functions directly into each domain is going to remain the preserve of
> expensive enterprise kit for some time.
> 
> Should the work be divided up between sub-systems? Both the network and
> block device sub-systems have their own allocation strategies and would
> need some knowledge about the final destination for their data. What
> other driver sub-systems are going to need support for this sort of
> zero-copy forwarding? While it would be nice for every VM transaction to
> be zero-copy we don't really need to solve it for low speed transports.
> 
> Transparent fallback and scaling
> ================================
> 
> As we know memory is always a precious resource that we never have
> enough of. The more we start carving up memory regions for particular
> tasks the less flexibility the system has as a whole to make efficient
> use of it. We can almost guarantee whatever number we pick for given
> VM-to-VM conduit will be wrong. Any memory allocation system based on
> regions will have to be able to fall back graciously to using other
> memory in the HW domain and rely on traditional bounce buffering
> approaches while under heavy load. This will be a problem for VirtIO
> backends to understand when some data that needs to go to the FE domain
> needs this bounce buffer treatment. This will involve tracking
> destination domain metadata somewhere in the system so it can be queried
> quickly.
> 
> Is there a cross-over here with the kernels existing support for NUMA
> architectures? It seems to me there are similar questions about the best
> place to put memory that perhaps we can treat multi-VM domains as
> different NUMA zones?
> 
> Finally there is the question of scaling. While mapping individual
> transactions would be painfully slow we need to think about how dynamic
> a modern system is. For example do you size your shared network region
> to cope with a full HD video stream of data? Most of the time the
> user won't be doing anything nearly as network intensive.
> 
> Of course adding the dynamic addition (and removal) of shared memory
> regions brings in more potential synchronisation problems of ensuring
> shared memory isn't accessed by either side when taken down. We would
> need some sort of assurance the sharee has finished with all the data in
> a given region before the sharer brings the share down.
> 
> Conclusion
> ==========
> 
> This long text hasn't even attempted to come up with a zero-copy
> architecture for Linux VMs. I'm hoping as we discuss this we can capture
> all the various constraints any such system is going to need to deal
> with. So my final questions are:
> 
>  - what other constraints we need to take into account?
>  - can we leverage existing sub-systems to build this support?
> 
> I look forward to your thoughts ;-)
> 
> -- 
> Alex Bennée
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-19 17:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06 17:03 [virtio-dev] Some thoughts on zero-copy between VM domains for discussion Alex Bennée
2022-01-10  9:49 ` Stefan Hajnoczi
2022-01-14 14:28   ` Alex Bennée
2022-01-14 17:16     ` Jean-Philippe Brucker
2022-01-17 10:15     ` Stefan Hajnoczi
2022-01-19 17:10 ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.