Reducing vdpa migration downtime because of memory pin / maps

All of lore.kernel.org
 help / color / mirror / Atom feed

* Reducing vdpa migration downtime because of memory pin / maps
@ 2023-04-05 11:37 Eugenio Perez Martin
  2023-04-10  2:14 ` Jason Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-04-05 11:37 UTC (permalink / raw)
  To: qemu-level
  Cc: Jason Wang, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

Hi!

As mentioned in the last upstream virtio-networking meeting, one of
the factors that adds more downtime to migration is the handling of
the guest memory (pin, map, etc). At this moment this handling is
bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
destination device waits until all the guest memory / state is
migrated to start pinning all the memory.

The proposal is to bind it to the char device life cycle (open vs
close), so all the guest memory can be pinned for all the guest / qemu
lifecycle.

This has two main problems:
* At this moment the reset semantics forces the vdpa device to unmap
all the memory. So this change needs a vhost vdpa feature flag.
* This may increase the initialization time. Maybe we can delay it if
qemu is not the destination of a LM. Anyway I think this should be
done as an optimization on top.

Any ideas or comments in this regard?

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-05 11:37 Reducing vdpa migration downtime because of memory pin / maps Eugenio Perez Martin
@ 2023-04-10  2:14 ` Jason Wang
  2023-04-10  3:16   ` longpeng2--- via
  2023-04-11 12:33 ` Eugenio Perez Martin
  2023-06-06 22:44 ` Si-Wei Liu
  2 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2023-04-10  2:14 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-level, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> Hi!
>
> As mentioned in the last upstream virtio-networking meeting, one of
> the factors that adds more downtime to migration is the handling of
> the guest memory (pin, map, etc). At this moment this handling is
> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> destination device waits until all the guest memory / state is
> migrated to start pinning all the memory.
>
> The proposal is to bind it to the char device life cycle (open vs
> close), so all the guest memory can be pinned for all the guest / qemu
> lifecycle.
>
> This has two main problems:
> * At this moment the reset semantics forces the vdpa device to unmap
> all the memory. So this change needs a vhost vdpa feature flag.

Is this true? I didn't find any codes to unmap the memory in
vhost_vdpa_set_status().

Thanks

> * This may increase the initialization time. Maybe we can delay it if
> qemu is not the destination of a LM. Anyway I think this should be
> done as an optimization on top.
>
> Any ideas or comments in this regard?
>
> Thanks!
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-10  2:14 ` Jason Wang
@ 2023-04-10  3:16   ` longpeng2--- via
  2023-04-10  3:21     ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: longpeng2--- via @ 2023-04-10  3:16 UTC (permalink / raw)
  To: Jason Wang, Eugenio Perez Martin
  Cc: qemu-level, Michael Tsirkin, Si-Wei Liu, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Gonglei (Arei)



在 2023/4/10 10:14, Jason Wang 写道:
> On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>
>> Hi!
>>
>> As mentioned in the last upstream virtio-networking meeting, one of
>> the factors that adds more downtime to migration is the handling of
>> the guest memory (pin, map, etc). At this moment this handling is
>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>> destination device waits until all the guest memory / state is
>> migrated to start pinning all the memory.
>>
>> The proposal is to bind it to the char device life cycle (open vs
>> close), so all the guest memory can be pinned for all the guest / qemu
>> lifecycle.
>>
>> This has two main problems:
>> * At this moment the reset semantics forces the vdpa device to unmap
>> all the memory. So this change needs a vhost vdpa feature flag.
> 
> Is this true? I didn't find any codes to unmap the memory in
> vhost_vdpa_set_status().
> 

It could depend on the vendor driver, for example, the vdpasim would do 
something like that.

vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset

> Thanks
> 
>> * This may increase the initialization time. Maybe we can delay it if
>> qemu is not the destination of a LM. Anyway I think this should be
>> done as an optimization on top.
>>
>> Any ideas or comments in this regard?
>>
>> Thanks!
>>
> 
> .


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-10  3:16   ` longpeng2--- via
@ 2023-04-10  3:21     ` Jason Wang
  2023-04-10  9:04       ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2023-04-10  3:21 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  Cc: Eugenio Perez Martin, qemu-level, Michael Tsirkin, Si-Wei Liu,
	Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
Service Product Dept.) <longpeng2@huawei.com> wrote:
>
>
>
> 在 2023/4/10 10:14, Jason Wang 写道:
> > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>
> >> Hi!
> >>
> >> As mentioned in the last upstream virtio-networking meeting, one of
> >> the factors that adds more downtime to migration is the handling of
> >> the guest memory (pin, map, etc). At this moment this handling is
> >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >> destination device waits until all the guest memory / state is
> >> migrated to start pinning all the memory.
> >>
> >> The proposal is to bind it to the char device life cycle (open vs
> >> close), so all the guest memory can be pinned for all the guest / qemu
> >> lifecycle.
> >>
> >> This has two main problems:
> >> * At this moment the reset semantics forces the vdpa device to unmap
> >> all the memory. So this change needs a vhost vdpa feature flag.
> >
> > Is this true? I didn't find any codes to unmap the memory in
> > vhost_vdpa_set_status().
> >
>
> It could depend on the vendor driver, for example, the vdpasim would do
> something like that.
>
> vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset

This looks like a bug. Or I wonder if any user space depends on this
behaviour, if yes, we really need a new flag then.

Thanks

>
> > Thanks
> >
> >> * This may increase the initialization time. Maybe we can delay it if
> >> qemu is not the destination of a LM. Anyway I think this should be
> >> done as an optimization on top.
> >>
> >> Any ideas or comments in this regard?
> >>
> >> Thanks!
> >>
> >
> > .
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-10  3:21     ` Jason Wang
@ 2023-04-10  9:04       ` Eugenio Perez Martin
  2023-04-11  2:25         ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-04-10  9:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	qemu-level, Michael Tsirkin, Si-Wei Liu, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Mon, Apr 10, 2023 at 5:22 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
> Service Product Dept.) <longpeng2@huawei.com> wrote:
> >
> >
> >
> > 在 2023/4/10 10:14, Jason Wang 写道:
> > > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >>
> > >> Hi!
> > >>
> > >> As mentioned in the last upstream virtio-networking meeting, one of
> > >> the factors that adds more downtime to migration is the handling of
> > >> the guest memory (pin, map, etc). At this moment this handling is
> > >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > >> destination device waits until all the guest memory / state is
> > >> migrated to start pinning all the memory.
> > >>
> > >> The proposal is to bind it to the char device life cycle (open vs
> > >> close), so all the guest memory can be pinned for all the guest / qemu
> > >> lifecycle.
> > >>
> > >> This has two main problems:
> > >> * At this moment the reset semantics forces the vdpa device to unmap
> > >> all the memory. So this change needs a vhost vdpa feature flag.
> > >
> > > Is this true? I didn't find any codes to unmap the memory in
> > > vhost_vdpa_set_status().
> > >
> >
> > It could depend on the vendor driver, for example, the vdpasim would do
> > something like that.
> >
> > vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset
>
> This looks like a bug. Or I wonder if any user space depends on this
> behaviour, if yes, we really need a new flag then.
>

My understanding was that we depend on this for cases like qemu
crashes. We don't do an unmap(-1ULL) or anything like that to make
sure the device is clean when we bind a second qemu to the same
device. That's why I think that close() should clean them. Or maybe
even open().

The only other option I see is to remove the whole vhost-vdpa device
every time, or am I missing something?

Thanks!



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-10  9:04       ` Eugenio Perez Martin
@ 2023-04-11  2:25         ` Jason Wang
  2023-04-11  6:28           ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2023-04-11  2:25 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	qemu-level, Michael Tsirkin, Si-Wei Liu, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Mon, Apr 10, 2023 at 5:05 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, Apr 10, 2023 at 5:22 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
> > Service Product Dept.) <longpeng2@huawei.com> wrote:
> > >
> > >
> > >
> > > 在 2023/4/10 10:14, Jason Wang 写道:
> > > > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >>
> > > >> Hi!
> > > >>
> > > >> As mentioned in the last upstream virtio-networking meeting, one of
> > > >> the factors that adds more downtime to migration is the handling of
> > > >> the guest memory (pin, map, etc). At this moment this handling is
> > > >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > > >> destination device waits until all the guest memory / state is
> > > >> migrated to start pinning all the memory.
> > > >>
> > > >> The proposal is to bind it to the char device life cycle (open vs
> > > >> close), so all the guest memory can be pinned for all the guest / qemu
> > > >> lifecycle.
> > > >>
> > > >> This has two main problems:
> > > >> * At this moment the reset semantics forces the vdpa device to unmap
> > > >> all the memory. So this change needs a vhost vdpa feature flag.
> > > >
> > > > Is this true? I didn't find any codes to unmap the memory in
> > > > vhost_vdpa_set_status().
> > > >
> > >
> > > It could depend on the vendor driver, for example, the vdpasim would do
> > > something like that.
> > >
> > > vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset
> >
> > This looks like a bug. Or I wonder if any user space depends on this
> > behaviour, if yes, we really need a new flag then.
> >
>
> My understanding was that we depend on this for cases like qemu
> crashes. We don't do an unmap(-1ULL) or anything like that to make
> sure the device is clean when we bind a second qemu to the same
> device. That's why I think that close() should clean them.

In vhost_vdpa_release() we do:

vhost_vdpa_release()
    vhost_vdpa_cleanup()
        for_each_as()
            vhost_vdpa_remove_as()
                vhost_vdpa_iotlb_unmap(0ULL, 0ULL - 1)
        vhost_vdpa_free_domain()

Anything wrong here?

Conceptually, the address mapping is not a part of the abstraction for
a virtio device now. So resetting the memory mapping during virtio
device reset seems wrong.

Thanks

> Or maybe
> even open().
>
> The only other option I see is to remove the whole vhost-vdpa device
> every time, or am I missing something?
>
> Thanks!
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-11  2:25         ` Jason Wang
@ 2023-04-11  6:28           ` Eugenio Perez Martin
  2023-04-11  6:36             ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-04-11  6:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	qemu-level, Michael Tsirkin, Si-Wei Liu, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Tue, Apr 11, 2023 at 4:26 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Apr 10, 2023 at 5:05 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 5:22 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
> > > Service Product Dept.) <longpeng2@huawei.com> wrote:
> > > >
> > > >
> > > >
> > > > 在 2023/4/10 10:14, Jason Wang 写道:
> > > > > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >>
> > > > >> Hi!
> > > > >>
> > > > >> As mentioned in the last upstream virtio-networking meeting, one of
> > > > >> the factors that adds more downtime to migration is the handling of
> > > > >> the guest memory (pin, map, etc). At this moment this handling is
> > > > >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > > > >> destination device waits until all the guest memory / state is
> > > > >> migrated to start pinning all the memory.
> > > > >>
> > > > >> The proposal is to bind it to the char device life cycle (open vs
> > > > >> close), so all the guest memory can be pinned for all the guest / qemu
> > > > >> lifecycle.
> > > > >>
> > > > >> This has two main problems:
> > > > >> * At this moment the reset semantics forces the vdpa device to unmap
> > > > >> all the memory. So this change needs a vhost vdpa feature flag.
> > > > >
> > > > > Is this true? I didn't find any codes to unmap the memory in
> > > > > vhost_vdpa_set_status().
> > > > >
> > > >
> > > > It could depend on the vendor driver, for example, the vdpasim would do
> > > > something like that.
> > > >
> > > > vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset
> > >
> > > This looks like a bug. Or I wonder if any user space depends on this
> > > behaviour, if yes, we really need a new flag then.
> > >
> >
> > My understanding was that we depend on this for cases like qemu
> > crashes. We don't do an unmap(-1ULL) or anything like that to make
> > sure the device is clean when we bind a second qemu to the same
> > device. That's why I think that close() should clean them.
>
> In vhost_vdpa_release() we do:
>
> vhost_vdpa_release()
>     vhost_vdpa_cleanup()
>         for_each_as()
>             vhost_vdpa_remove_as()
>                 vhost_vdpa_iotlb_unmap(0ULL, 0ULL - 1)
>         vhost_vdpa_free_domain()
>
> Anything wrong here?
>

No, I think we just trusted in different pre-existing cleanup points
"semantics".

> Conceptually, the address mapping is not a part of the abstraction for
> a virtio device now. So resetting the memory mapping during virtio
> device reset seems wrong.
>

I agree. So then no change in the kernel should be needed but to
revert this cleanup on device reset. I guess we should document it
ops->reset just in case?

Thanks!



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-11  6:28           ` Eugenio Perez Martin
@ 2023-04-11  6:36             ` Jason Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2023-04-11  6:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	qemu-level, Michael Tsirkin, Si-Wei Liu, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Tue, Apr 11, 2023 at 2:29 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Apr 11, 2023 at 4:26 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 5:05 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Apr 10, 2023 at 5:22 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 10, 2023 at 11:17 AM Longpeng (Mike, Cloud Infrastructure
> > > > Service Product Dept.) <longpeng2@huawei.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > 在 2023/4/10 10:14, Jason Wang 写道:
> > > > > > On Wed, Apr 5, 2023 at 7:38 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >>
> > > > > >> Hi!
> > > > > >>
> > > > > >> As mentioned in the last upstream virtio-networking meeting, one of
> > > > > >> the factors that adds more downtime to migration is the handling of
> > > > > >> the guest memory (pin, map, etc). At this moment this handling is
> > > > > >> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > > > > >> destination device waits until all the guest memory / state is
> > > > > >> migrated to start pinning all the memory.
> > > > > >>
> > > > > >> The proposal is to bind it to the char device life cycle (open vs
> > > > > >> close), so all the guest memory can be pinned for all the guest / qemu
> > > > > >> lifecycle.
> > > > > >>
> > > > > >> This has two main problems:
> > > > > >> * At this moment the reset semantics forces the vdpa device to unmap
> > > > > >> all the memory. So this change needs a vhost vdpa feature flag.
> > > > > >
> > > > > > Is this true? I didn't find any codes to unmap the memory in
> > > > > > vhost_vdpa_set_status().
> > > > > >
> > > > >
> > > > > It could depend on the vendor driver, for example, the vdpasim would do
> > > > > something like that.
> > > > >
> > > > > vhost_vdpa_set_status->vdpa_reset->vdpasim_reset->vdpasim_do_reset->vhost_iotlb_reset
> > > >
> > > > This looks like a bug. Or I wonder if any user space depends on this
> > > > behaviour, if yes, we really need a new flag then.
> > > >
> > >
> > > My understanding was that we depend on this for cases like qemu
> > > crashes. We don't do an unmap(-1ULL) or anything like that to make
> > > sure the device is clean when we bind a second qemu to the same
> > > device. That's why I think that close() should clean them.
> >
> > In vhost_vdpa_release() we do:
> >
> > vhost_vdpa_release()
> >     vhost_vdpa_cleanup()
> >         for_each_as()
> >             vhost_vdpa_remove_as()
> >                 vhost_vdpa_iotlb_unmap(0ULL, 0ULL - 1)
> >         vhost_vdpa_free_domain()
> >
> > Anything wrong here?
> >
>
> No, I think we just trusted in different pre-existing cleanup points
> "semantics".
>
> > Conceptually, the address mapping is not a part of the abstraction for
> > a virtio device now. So resetting the memory mapping during virtio
> > device reset seems wrong.
> >
>
> I agree. So then no change in the kernel should be needed but to
> revert this cleanup on device reset.

I think you meant simulator, then yes.

> I guess we should document it
> ops->reset just in case?

Probably.

Thanks

>
> Thanks!
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-05 11:37 Reducing vdpa migration downtime because of memory pin / maps Eugenio Perez Martin
  2023-04-10  2:14 ` Jason Wang
@ 2023-04-11 12:33 ` Eugenio Perez Martin
  2023-04-12  5:56   ` Jason Wang
  2023-06-06 22:44 ` Si-Wei Liu
  2 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-04-11 12:33 UTC (permalink / raw)
  To: qemu-level
  Cc: Jason Wang, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Wed, Apr 5, 2023 at 1:37 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> Hi!
>
> As mentioned in the last upstream virtio-networking meeting, one of
> the factors that adds more downtime to migration is the handling of
> the guest memory (pin, map, etc). At this moment this handling is
> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> destination device waits until all the guest memory / state is
> migrated to start pinning all the memory.
>
> The proposal is to bind it to the char device life cycle (open vs
> close), so all the guest memory can be pinned for all the guest / qemu
> lifecycle.
>
> This has two main problems:
> * At this moment the reset semantics forces the vdpa device to unmap
> all the memory. So this change needs a vhost vdpa feature flag.
> * This may increase the initialization time. Maybe we can delay it if
> qemu is not the destination of a LM. Anyway I think this should be
> done as an optimization on top.
>

Expanding on this we could reduce the pinning even more now that vring
supports VA [1] with the emulated CVQ.

Something like:
- Add VHOST_VRING_GROUP_CAN_USE_VA ioctl to check if a given VQ group
capability. Passthrough devices with emulated CVQ would return false
for the dataplane and true for the control vq group.
- If that is true, qemu does not need to map and translate addresses
for CVQ but to directly provide VA for buffers. This avoids pinning,
translations, etc in this case.

Thanks!

[1] https://lore.kernel.org/virtualization/20230404131326.44403-2-sgarzare@redhat.com/



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-11 12:33 ` Eugenio Perez Martin
@ 2023-04-12  5:56   ` Jason Wang
  2023-04-12  6:18     ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2023-04-12  5:56 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-level, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Tue, Apr 11, 2023 at 8:34 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Apr 5, 2023 at 1:37 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > Hi!
> >
> > As mentioned in the last upstream virtio-networking meeting, one of
> > the factors that adds more downtime to migration is the handling of
> > the guest memory (pin, map, etc). At this moment this handling is
> > bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > destination device waits until all the guest memory / state is
> > migrated to start pinning all the memory.
> >
> > The proposal is to bind it to the char device life cycle (open vs
> > close), so all the guest memory can be pinned for all the guest / qemu
> > lifecycle.
> >
> > This has two main problems:
> > * At this moment the reset semantics forces the vdpa device to unmap
> > all the memory. So this change needs a vhost vdpa feature flag.
> > * This may increase the initialization time. Maybe we can delay it if
> > qemu is not the destination of a LM. Anyway I think this should be
> > done as an optimization on top.
> >
>
> Expanding on this we could reduce the pinning even more now that vring
> supports VA [1] with the emulated CVQ.

Note that VA for hardware means the device needs to support page fault
through either PRI or vendor specific interface.

>
> Something like:
> - Add VHOST_VRING_GROUP_CAN_USE_VA ioctl to check if a given VQ group
> capability. Passthrough devices with emulated CVQ would return false
> for the dataplane and true for the control vq group.
> - If that is true, qemu does not need to map and translate addresses
> for CVQ but to directly provide VA for buffers. This avoids pinning,
> translations, etc in this case.

For CVQ yes, but we only avoid the pinning for CVQ not others.

Thanks

>
> Thanks!
>
> [1] https://lore.kernel.org/virtualization/20230404131326.44403-2-sgarzare@redhat.com/
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-12  5:56   ` Jason Wang
@ 2023-04-12  6:18     ` Jason Wang
  2023-04-13  7:27       ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2023-04-12  6:18 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-level, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Wed, Apr 12, 2023 at 1:56 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Apr 11, 2023 at 8:34 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Apr 5, 2023 at 1:37 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > Hi!
> > >
> > > As mentioned in the last upstream virtio-networking meeting, one of
> > > the factors that adds more downtime to migration is the handling of
> > > the guest memory (pin, map, etc). At this moment this handling is
> > > bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > > destination device waits until all the guest memory / state is
> > > migrated to start pinning all the memory.
> > >
> > > The proposal is to bind it to the char device life cycle (open vs
> > > close), so all the guest memory can be pinned for all the guest / qemu
> > > lifecycle.
> > >
> > > This has two main problems:
> > > * At this moment the reset semantics forces the vdpa device to unmap
> > > all the memory. So this change needs a vhost vdpa feature flag.
> > > * This may increase the initialization time. Maybe we can delay it if
> > > qemu is not the destination of a LM. Anyway I think this should be
> > > done as an optimization on top.
> > >
> >
> > Expanding on this we could reduce the pinning even more now that vring
> > supports VA [1] with the emulated CVQ.
>
> Note that VA for hardware means the device needs to support page fault
> through either PRI or vendor specific interface.
>
> >
> > Something like:
> > - Add VHOST_VRING_GROUP_CAN_USE_VA ioctl to check if a given VQ group
> > capability. Passthrough devices with emulated CVQ would return false
> > for the dataplane and true for the control vq group.

We don't even need this actually, since the pinning is not visible to
the userspace. Userspace can only see the IOTLB abstraction actually.

We can invent a group->use_va, then when we attach AS to a group that
can use va, we can avoid the pinning.

Thanks

> > - If that is true, qemu does not need to map and translate addresses
> > for CVQ but to directly provide VA for buffers. This avoids pinning,
> > translations, etc in this case.
>
> For CVQ yes, but we only avoid the pinning for CVQ not others.
>
> Thanks
>
> >
> > Thanks!
> >
> > [1] https://lore.kernel.org/virtualization/20230404131326.44403-2-sgarzare@redhat.com/
> >



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-12  6:18     ` Jason Wang
@ 2023-04-13  7:27       ` Eugenio Perez Martin
  0 siblings, 0 replies; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-04-13  7:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-level, Michael Tsirkin, Si-Wei Liu, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert

On Wed, Apr 12, 2023 at 8:19 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Apr 12, 2023 at 1:56 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 8:34 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 1:37 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > As mentioned in the last upstream virtio-networking meeting, one of
> > > > the factors that adds more downtime to migration is the handling of
> > > > the guest memory (pin, map, etc). At this moment this handling is
> > > > bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > > > destination device waits until all the guest memory / state is
> > > > migrated to start pinning all the memory.
> > > >
> > > > The proposal is to bind it to the char device life cycle (open vs
> > > > close), so all the guest memory can be pinned for all the guest / qemu
> > > > lifecycle.
> > > >
> > > > This has two main problems:
> > > > * At this moment the reset semantics forces the vdpa device to unmap
> > > > all the memory. So this change needs a vhost vdpa feature flag.
> > > > * This may increase the initialization time. Maybe we can delay it if
> > > > qemu is not the destination of a LM. Anyway I think this should be
> > > > done as an optimization on top.
> > > >
> > >
> > > Expanding on this we could reduce the pinning even more now that vring
> > > supports VA [1] with the emulated CVQ.
> >
> > Note that VA for hardware means the device needs to support page fault
> > through either PRI or vendor specific interface.
> >
> > >
> > > Something like:
> > > - Add VHOST_VRING_GROUP_CAN_USE_VA ioctl to check if a given VQ group
> > > capability. Passthrough devices with emulated CVQ would return false
> > > for the dataplane and true for the control vq group.
>
> We don't even need this actually, since the pinning is not visible to
> the userspace. Userspace can only see the IOTLB abstraction actually.
>
> We can invent a group->use_va, then when we attach AS to a group that
> can use va, we can avoid the pinning.
>

That would solve one part for sure, but SVQ will keep translating HVA
to SVQ IOVA, and then the kernel needs to translate it back. With
VHOST_VRING_GROUP_CAN_USE_VA, the SVQ and the kernel skip all
translation.

Thanks!

> Thanks
>
> > > - If that is true, qemu does not need to map and translate addresses
> > > for CVQ but to directly provide VA for buffers. This avoids pinning,
> > > translations, etc in this case.
> >
> > For CVQ yes, but we only avoid the pinning for CVQ not others.
> >
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > [1] https://lore.kernel.org/virtualization/20230404131326.44403-2-sgarzare@redhat.com/
> > >
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-04-05 11:37 Reducing vdpa migration downtime because of memory pin / maps Eugenio Perez Martin
  2023-04-10  2:14 ` Jason Wang
  2023-04-11 12:33 ` Eugenio Perez Martin
@ 2023-06-06 22:44 ` Si-Wei Liu
  2023-06-07  8:08   ` Eugenio Perez Martin
  2 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-06-06 22:44 UTC (permalink / raw)
  To: Eugenio Perez Martin, qemu-level
  Cc: Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

Sorry for reviving this old thread, I lost the best timing to follow up 
on this while I was on vacation. I have been working on this and found 
out some discrepancy, please see below.

On 4/5/23 04:37, Eugenio Perez Martin wrote:
> Hi!
>
> As mentioned in the last upstream virtio-networking meeting, one of
> the factors that adds more downtime to migration is the handling of
> the guest memory (pin, map, etc). At this moment this handling is
> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> destination device waits until all the guest memory / state is
> migrated to start pinning all the memory.
>
> The proposal is to bind it to the char device life cycle (open vs
> close),

Hmmm, really? If it's the life cycle for char device, the next guest / 
qemu launch on the same vhost-vdpa device node won't make it work.

>   so all the guest memory can be pinned for all the guest / qemu
> lifecycle.

I think to tie pinning to guest / qemu process life cycle makes more 
sense. Essentially this pinning part needs to be decoupled from the 
iotlb mapping abstraction layer, and can / should work as a standalone 
uAPI. Such that QEMU at the destination may launch and pin all guest's 
memory as needed without having to start the device, while awaiting any 
incoming migration request. Though problem is, there's no existing vhost 
uAPI that could properly serve as the vehicle for that. SET_OWNER / 
SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against 
introducing a new but clean vhost uAPI for pinning guest pages, subject 
to guest's life cycle?

Another concern is the use_va stuff, originally it tags to the device 
level and is made static at the time of device instantiation, which is 
fine. But others to come just find a new home at per-group level or 
per-vq level struct. Hard to tell whether or not pinning is actually 
needed for the latter use_va friends, as they are essentially tied to 
the virtio life cycle or feature negotiation. While guest / Qemu starts 
way earlier than that. Perhaps just ignore those sub-device level use_va 
usages? Presumably !use_va at the device level is sufficient to infer 
the need of pinning for device?

Regards,
-Siwei

>
> This has two main problems:
> * At this moment the reset semantics forces the vdpa device to unmap
> all the memory. So this change needs a vhost vdpa feature flag.
> * This may increase the initialization time. Maybe we can delay it if
> qemu is not the destination of a LM. Anyway I think this should be
> done as an optimization on top.
>
> Any ideas or comments in this regard?
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-06 22:44 ` Si-Wei Liu
@ 2023-06-07  8:08   ` Eugenio Perez Martin
  2023-06-08 22:40     ` Si-Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-06-07  8:08 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-level, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Sorry for reviving this old thread, I lost the best timing to follow up
> on this while I was on vacation. I have been working on this and found
> out some discrepancy, please see below.
>
> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> > Hi!
> >
> > As mentioned in the last upstream virtio-networking meeting, one of
> > the factors that adds more downtime to migration is the handling of
> > the guest memory (pin, map, etc). At this moment this handling is
> > bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> > destination device waits until all the guest memory / state is
> > migrated to start pinning all the memory.
> >
> > The proposal is to bind it to the char device life cycle (open vs
> > close),
>
> Hmmm, really? If it's the life cycle for char device, the next guest /
> qemu launch on the same vhost-vdpa device node won't make it work.
>

Maybe my sentence was not accurate, but I think we're on the same page here.

Two qemu instances opening the same char device at the same time are
not allowed, and vhost_vdpa_release clean all the maps. So the next
qemu that opens the char device should see a clean device anyway.

>
> >   so all the guest memory can be pinned for all the guest / qemu
> > lifecycle.
>
> I think to tie pinning to guest / qemu process life cycle makes more
> sense. Essentially this pinning part needs to be decoupled from the
> iotlb mapping abstraction layer, and can / should work as a standalone
> uAPI. Such that QEMU at the destination may launch and pin all guest's
> memory as needed without having to start the device, while awaiting any
> incoming migration request. Though problem is, there's no existing vhost
> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> introducing a new but clean vhost uAPI for pinning guest pages, subject
> to guest's life cycle?
>

I think that to pin or not pin memory maps should be a kernel
decision, not to be driven by qemu. I'm not against it if needed, but
let me know if the current "clean at close" address your concerns.

> Another concern is the use_va stuff, originally it tags to the device
> level and is made static at the time of device instantiation, which is
> fine. But others to come just find a new home at per-group level or
> per-vq level struct. Hard to tell whether or not pinning is actually
> needed for the latter use_va friends, as they are essentially tied to
> the virtio life cycle or feature negotiation. While guest / Qemu starts
> way earlier than that. Perhaps just ignore those sub-device level use_va
> usages? Presumably !use_va at the device level is sufficient to infer
> the need of pinning for device?
>

I don't follow this. But I have the feeling that the subject of my
original mail is way more accurate if I would have said just "memory
maps".

I still consider the way to fix it is to actually delegate that to the
kernel vdpa, so it can choose if a particular ASID needs the pin or
not. But let me know if I missed something.

Thanks!

> Regards,
> -Siwei
>
>
> >
> > This has two main problems:
> > * At this moment the reset semantics forces the vdpa device to unmap
> > all the memory. So this change needs a vhost vdpa feature flag.
> > * This may increase the initialization time. Maybe we can delay it if
> > qemu is not the destination of a LM. Anyway I think this should be
> > done as an optimization on top.
> >
> > Any ideas or comments in this regard?
> >
> > Thanks!
> >
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-07  8:08   ` Eugenio Perez Martin
@ 2023-06-08 22:40     ` Si-Wei Liu
  2023-06-09  3:18       ` Jason Wang
  2023-06-09 14:32       ` Eugenio Perez Martin
  0 siblings, 2 replies; 27+ messages in thread
From: Si-Wei Liu @ 2023-06-08 22:40 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-level, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea


On 6/7/23 01:08, Eugenio Perez Martin wrote:
> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Sorry for reviving this old thread, I lost the best timing to follow up
>> on this while I was on vacation. I have been working on this and found
>> out some discrepancy, please see below.
>>
>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>> Hi!
>>>
>>> As mentioned in the last upstream virtio-networking meeting, one of
>>> the factors that adds more downtime to migration is the handling of
>>> the guest memory (pin, map, etc). At this moment this handling is
>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>> destination device waits until all the guest memory / state is
>>> migrated to start pinning all the memory.
>>>
>>> The proposal is to bind it to the char device life cycle (open vs
>>> close),
>> Hmmm, really? If it's the life cycle for char device, the next guest /
>> qemu launch on the same vhost-vdpa device node won't make it work.
>>
> Maybe my sentence was not accurate, but I think we're on the same page here.
>
> Two qemu instances opening the same char device at the same time are
> not allowed, and vhost_vdpa_release clean all the maps. So the next
> qemu that opens the char device should see a clean device anyway.

I mean the pin can't be done at the time of char device open, where the 
user address space is not known/bound yet. The earliest point possible 
for pinning would be until the vhost_attach_mm() call from SET_OWNER is 
done. Actually I think the counterpart vhost_detach_mm() only gets 
handled in vhost_vdpa_release() at device close time is a resulting 
artifact and amiss of today's vhost protocol - the opposite RESET_OWNER 
call is not made mandatory hence only seen implemented in vhost-net 
device today. One qemu instance could well exec(3) another new qemu 
instance to live upgrade itself while keeping all emulated devices and 
guest alive. The current vhost design simply prohibits this from happening.


>
>>>    so all the guest memory can be pinned for all the guest / qemu
>>> lifecycle.
>> I think to tie pinning to guest / qemu process life cycle makes more
>> sense. Essentially this pinning part needs to be decoupled from the
>> iotlb mapping abstraction layer, and can / should work as a standalone
>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>> memory as needed without having to start the device, while awaiting any
>> incoming migration request. Though problem is, there's no existing vhost
>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>> to guest's life cycle?
>>
> I think that to pin or not pin memory maps should be a kernel
> decision, not to be driven by qemu.

It's kernel decision for sure. I am with this part.

> I'm not against it if needed, but
> let me know if the current "clean at close" address your concerns.

To better facilitate QEMU exec (live update) case, I propose we add new 
vhost uAPI pair for explicit pinning request - which would live with 
user mm's, or more precisely qemu instance's lifecycle.

>
>> Another concern is the use_va stuff, originally it tags to the device
>> level and is made static at the time of device instantiation, which is
>> fine. But others to come just find a new home at per-group level or
>> per-vq level struct. Hard to tell whether or not pinning is actually
>> needed for the latter use_va friends, as they are essentially tied to
>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>> way earlier than that. Perhaps just ignore those sub-device level use_va
>> usages? Presumably !use_va at the device level is sufficient to infer
>> the need of pinning for device?
>>
> I don't follow this. But I have the feeling that the subject of my
> original mail is way more accurate if I would have said just "memory
> maps".

I think the iotlb layer in vhost-vdpa just provides the abstraction for 
mapping, not pinning. Although in some case mapping implicitly relies on 
pinning for DMA purpose, it doesn't have to tie to that in uAPI 
semantics. We can do explicit on-demand pinning for cases for e.g. 
warming up device at live migration destination.


>
> I still consider the way to fix it is to actually delegate that to the
> kernel vdpa, so it can choose if a particular ASID needs the pin or
> not. But let me know if I missed something.

You can disregard this for now. I will discuss that further with you 
guys while bind_mm and per-group use_va stuffs are landed.

Thanks!
-Siwei



>
> Thanks!
>
>> Regards,
>> -Siwei
>>
>>
>>> This has two main problems:
>>> * At this moment the reset semantics forces the vdpa device to unmap
>>> all the memory. So this change needs a vhost vdpa feature flag.
>>> * This may increase the initialization time. Maybe we can delay it if
>>> qemu is not the destination of a LM. Anyway I think this should be
>>> done as an optimization on top.
>>>
>>> Any ideas or comments in this regard?
>>>
>>> Thanks!
>>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-08 22:40     ` Si-Wei Liu
@ 2023-06-09  3:18       ` Jason Wang
  2023-06-09 14:32       ` Eugenio Perez Martin
  1 sibling, 0 replies; 27+ messages in thread
From: Jason Wang @ 2023-06-09  3:18 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Eugenio Perez Martin, qemu-level, Michael Tsirkin, Longpeng,
	Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Fri, Jun 9, 2023 at 6:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> > On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Sorry for reviving this old thread, I lost the best timing to follow up
> >> on this while I was on vacation. I have been working on this and found
> >> out some discrepancy, please see below.
> >>
> >> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>> Hi!
> >>>
> >>> As mentioned in the last upstream virtio-networking meeting, one of
> >>> the factors that adds more downtime to migration is the handling of
> >>> the guest memory (pin, map, etc). At this moment this handling is
> >>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>> destination device waits until all the guest memory / state is
> >>> migrated to start pinning all the memory.
> >>>
> >>> The proposal is to bind it to the char device life cycle (open vs
> >>> close),
> >> Hmmm, really? If it's the life cycle for char device, the next guest /
> >> qemu launch on the same vhost-vdpa device node won't make it work.
> >>
> > Maybe my sentence was not accurate, but I think we're on the same page here.
> >
> > Two qemu instances opening the same char device at the same time are
> > not allowed, and vhost_vdpa_release clean all the maps. So the next
> > qemu that opens the char device should see a clean device anyway.
>
> I mean the pin can't be done at the time of char device open, where the
> user address space is not known/bound yet. The earliest point possible
> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> done. Actually I think the counterpart vhost_detach_mm() only gets
> handled in vhost_vdpa_release() at device close time is a resulting
> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> call is not made mandatory hence only seen implemented in vhost-net
> device today. One qemu instance could well exec(3) another new qemu
> instance to live upgrade itself while keeping all emulated devices and
> guest alive. The current vhost design simply prohibits this from happening.

I'm not sure I fully understand the issue you mention here. What is
missed and can iommufd help anyhow?

>
>
> >
> >>>    so all the guest memory can be pinned for all the guest / qemu
> >>> lifecycle.
> >> I think to tie pinning to guest / qemu process life cycle makes more
> >> sense. Essentially this pinning part needs to be decoupled from the
> >> iotlb mapping abstraction layer, and can / should work as a standalone
> >> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >> memory as needed without having to start the device, while awaiting any
> >> incoming migration request. Though problem is, there's no existing vhost
> >> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >> to guest's life cycle?
> >>
> > I think that to pin or not pin memory maps should be a kernel
> > decision, not to be driven by qemu.
>
> It's kernel decision for sure. I am with this part.
>
> > I'm not against it if needed, but
> > let me know if the current "clean at close" address your concerns.
>
> To better facilitate QEMU exec (live update) case, I propose we add new
> vhost uAPI pair for explicit pinning request - which would live with
> user mm's, or more precisely qemu instance's lifecycle.

Is this something similar to iommufd?

Btw, I'm not sure whether or not it's good to expose pinning to the
userspace. For example, vhost allows virtual mapping instead of dma
mapping which doesn't require pinning at all.

Thanks

>
> >
> >> Another concern is the use_va stuff, originally it tags to the device
> >> level and is made static at the time of device instantiation, which is
> >> fine. But others to come just find a new home at per-group level or
> >> per-vq level struct. Hard to tell whether or not pinning is actually
> >> needed for the latter use_va friends, as they are essentially tied to
> >> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >> way earlier than that. Perhaps just ignore those sub-device level use_va
> >> usages? Presumably !use_va at the device level is sufficient to infer
> >> the need of pinning for device?
> >>
> > I don't follow this. But I have the feeling that the subject of my
> > original mail is way more accurate if I would have said just "memory
> > maps".
>
> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> mapping, not pinning. Although in some case mapping implicitly relies on
> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> semantics. We can do explicit on-demand pinning for cases for e.g.
> warming up device at live migration destination.
>
>
> >
> > I still consider the way to fix it is to actually delegate that to the
> > kernel vdpa, so it can choose if a particular ASID needs the pin or
> > not. But let me know if I missed something.
>
> You can disregard this for now. I will discuss that further with you
> guys while bind_mm and per-group use_va stuffs are landed.
>
> Thanks!
> -Siwei
>
>
>
> >
> > Thanks!
> >
> >> Regards,
> >> -Siwei
> >>
> >>
> >>> This has two main problems:
> >>> * At this moment the reset semantics forces the vdpa device to unmap
> >>> all the memory. So this change needs a vhost vdpa feature flag.
> >>> * This may increase the initialization time. Maybe we can delay it if
> >>> qemu is not the destination of a LM. Anyway I think this should be
> >>> done as an optimization on top.
> >>>
> >>> Any ideas or comments in this regard?
> >>>
> >>> Thanks!
> >>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-08 22:40     ` Si-Wei Liu
  2023-06-09  3:18       ` Jason Wang
@ 2023-06-09 14:32       ` Eugenio Perez Martin
  2023-06-27  6:36         ` Si-Wei Liu
  1 sibling, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-06-09 14:32 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> > On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Sorry for reviving this old thread, I lost the best timing to follow up
> >> on this while I was on vacation. I have been working on this and found
> >> out some discrepancy, please see below.
> >>
> >> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>> Hi!
> >>>
> >>> As mentioned in the last upstream virtio-networking meeting, one of
> >>> the factors that adds more downtime to migration is the handling of
> >>> the guest memory (pin, map, etc). At this moment this handling is
> >>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>> destination device waits until all the guest memory / state is
> >>> migrated to start pinning all the memory.
> >>>
> >>> The proposal is to bind it to the char device life cycle (open vs
> >>> close),
> >> Hmmm, really? If it's the life cycle for char device, the next guest /
> >> qemu launch on the same vhost-vdpa device node won't make it work.
> >>
> > Maybe my sentence was not accurate, but I think we're on the same page here.
> >
> > Two qemu instances opening the same char device at the same time are
> > not allowed, and vhost_vdpa_release clean all the maps. So the next
> > qemu that opens the char device should see a clean device anyway.
>
> I mean the pin can't be done at the time of char device open, where the
> user address space is not known/bound yet. The earliest point possible
> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> done.

Maybe we are deviating, let me start again.

Using QEMU code, what I'm proposing is to modify the lifecycle of the
.listener member of struct vhost_vdpa.

At this moment, the memory listener is registered at
vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
and is unregistered in both vhost_vdpa_reset_status and
vhost_vdpa_cleanup.

My original proposal was just to move the memory listener registration
to the last vhost_vdpa_init, and remove the unregister from
vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
be the same, the device should not realize this change.

One of the concerns was that it could delay VM initialization, and I
didn't profile it but I think that may be the case. I'm not sure about
the right fix but I think the change is easy to profile. If that is
the case, we could:
* use a flag (listener->address_space ?) and only register the
listener in _init if waiting for a migration, do it in _start
otherwise.
* something like io_uring, where the map can be done in parallel and
we can fail _start if some of them fails.

> Actually I think the counterpart vhost_detach_mm() only gets
> handled in vhost_vdpa_release() at device close time is a resulting
> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> call is not made mandatory hence only seen implemented in vhost-net
> device today. One qemu instance could well exec(3) another new qemu
> instance to live upgrade itself while keeping all emulated devices and
> guest alive. The current vhost design simply prohibits this from happening.
>

Ok, I was not aware of this. Thanks for explaining it!

>
> >
> >>>    so all the guest memory can be pinned for all the guest / qemu
> >>> lifecycle.
> >> I think to tie pinning to guest / qemu process life cycle makes more
> >> sense. Essentially this pinning part needs to be decoupled from the
> >> iotlb mapping abstraction layer, and can / should work as a standalone
> >> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >> memory as needed without having to start the device, while awaiting any
> >> incoming migration request. Though problem is, there's no existing vhost
> >> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >> to guest's life cycle?
> >>
> > I think that to pin or not pin memory maps should be a kernel
> > decision, not to be driven by qemu.
>
> It's kernel decision for sure. I am with this part.
>
> > I'm not against it if needed, but
> > let me know if the current "clean at close" address your concerns.
>
> To better facilitate QEMU exec (live update) case, I propose we add new
> vhost uAPI pair for explicit pinning request - which would live with
> user mm's, or more precisely qemu instance's lifecycle.
>

Ok I see your problem better now, but I think it should be solved at
kernel level. Does that live update need to forcefully unpin and pin
the memory again, or that is just a consequence of how it works the
memory listener right now?

Why not extend the RESET_OWNER to the rest of devices? It seems the
most natural way to me.

Thanks!


> >
> >> Another concern is the use_va stuff, originally it tags to the device
> >> level and is made static at the time of device instantiation, which is
> >> fine. But others to come just find a new home at per-group level or
> >> per-vq level struct. Hard to tell whether or not pinning is actually
> >> needed for the latter use_va friends, as they are essentially tied to
> >> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >> way earlier than that. Perhaps just ignore those sub-device level use_va
> >> usages? Presumably !use_va at the device level is sufficient to infer
> >> the need of pinning for device?
> >>
> > I don't follow this. But I have the feeling that the subject of my
> > original mail is way more accurate if I would have said just "memory
> > maps".
>
> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> mapping, not pinning. Although in some case mapping implicitly relies on
> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> semantics. We can do explicit on-demand pinning for cases for e.g.
> warming up device at live migration destination.
>
>
> >
> > I still consider the way to fix it is to actually delegate that to the
> > kernel vdpa, so it can choose if a particular ASID needs the pin or
> > not. But let me know if I missed something.
>
> You can disregard this for now. I will discuss that further with you
> guys while bind_mm and per-group use_va stuffs are landed.
>
> Thanks!
> -Siwei
>
>
>
> >
> > Thanks!
> >
> >> Regards,
> >> -Siwei
> >>
> >>
> >>> This has two main problems:
> >>> * At this moment the reset semantics forces the vdpa device to unmap
> >>> all the memory. So this change needs a vhost vdpa feature flag.
> >>> * This may increase the initialization time. Maybe we can delay it if
> >>> qemu is not the destination of a LM. Anyway I think this should be
> >>> done as an optimization on top.
> >>>
> >>> Any ideas or comments in this regard?
> >>>
> >>> Thanks!
> >>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-09 14:32       ` Eugenio Perez Martin
@ 2023-06-27  6:36         ` Si-Wei Liu
  2023-07-05 18:03           ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-06-27  6:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea



On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>> Sorry for reviving this old thread, I lost the best timing to follow up
>>>> on this while I was on vacation. I have been working on this and found
>>>> out some discrepancy, please see below.
>>>>
>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>>>> Hi!
>>>>>
>>>>> As mentioned in the last upstream virtio-networking meeting, one of
>>>>> the factors that adds more downtime to migration is the handling of
>>>>> the guest memory (pin, map, etc). At this moment this handling is
>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>>>> destination device waits until all the guest memory / state is
>>>>> migrated to start pinning all the memory.
>>>>>
>>>>> The proposal is to bind it to the char device life cycle (open vs
>>>>> close),
>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
>>>> qemu launch on the same vhost-vdpa device node won't make it work.
>>>>
>>> Maybe my sentence was not accurate, but I think we're on the same page here.
>>>
>>> Two qemu instances opening the same char device at the same time are
>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
>>> qemu that opens the char device should see a clean device anyway.
>> I mean the pin can't be done at the time of char device open, where the
>> user address space is not known/bound yet. The earliest point possible
>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
>> done.
> Maybe we are deviating, let me start again.
>
> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> .listener member of struct vhost_vdpa.
>
> At this moment, the memory listener is registered at
> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> and is unregistered in both vhost_vdpa_reset_status and
> vhost_vdpa_cleanup.
>
> My original proposal was just to move the memory listener registration
> to the last vhost_vdpa_init, and remove the unregister from
> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> be the same, the device should not realize this change.
This can address LM downtime latency for sure, but it won't help 
downtime during dynamic SVQ switch - which still needs to go through the 
full unmap/map cycle (that includes the slow part for pinning) from 
passthrough to SVQ mode. Be noted not every device could work with a 
separate ASID for SVQ descriptors. The fix should expect to work on 
normal vDPA vendor devices without a separate descriptor ASID, with 
platform IOMMU underneath or with on-chip IOMMU.

>
> One of the concerns was that it could delay VM initialization, and I
> didn't profile it but I think that may be the case.
Yes, that's the concern here - we should not introduce regression to 
normal VM boot process/time. In case of large VM it's very easy to see 
the side effect if we go this way.

>   I'm not sure about
> the right fix but I think the change is easy to profile. If that is
> the case, we could:
> * use a flag (listener->address_space ?) and only register the
> listener in _init if waiting for a migration, do it in _start
> otherwise.
Just doing this alone won't help SVQ mode switch downtime, see the 
reason stated above.

> * something like io_uring, where the map can be done in parallel and
> we can fail _start if some of them fails.
This can alleviate the problem somehow, but still sub-optimal and not 
scalable with larger size. I'd like zero or least pinning to be done at 
the SVQ switch or migration time.

>
>> Actually I think the counterpart vhost_detach_mm() only gets
>> handled in vhost_vdpa_release() at device close time is a resulting
>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
>> call is not made mandatory hence only seen implemented in vhost-net
>> device today. One qemu instance could well exec(3) another new qemu
>> instance to live upgrade itself while keeping all emulated devices and
>> guest alive. The current vhost design simply prohibits this from happening.
>>
> Ok, I was not aware of this. Thanks for explaining it!
>
>>>>>     so all the guest memory can be pinned for all the guest / qemu
>>>>> lifecycle.
>>>> I think to tie pinning to guest / qemu process life cycle makes more
>>>> sense. Essentially this pinning part needs to be decoupled from the
>>>> iotlb mapping abstraction layer, and can / should work as a standalone
>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>>>> memory as needed without having to start the device, while awaiting any
>>>> incoming migration request. Though problem is, there's no existing vhost
>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>>>> to guest's life cycle?
>>>>
>>> I think that to pin or not pin memory maps should be a kernel
>>> decision, not to be driven by qemu.
>> It's kernel decision for sure. I am with this part.
>>
>>> I'm not against it if needed, but
>>> let me know if the current "clean at close" address your concerns.
>> To better facilitate QEMU exec (live update) case, I propose we add new
>> vhost uAPI pair for explicit pinning request - which would live with
>> user mm's, or more precisely qemu instance's lifecycle.
>>
> Ok I see your problem better now, but I think it should be solved at
> kernel level. Does that live update need to forcefully unpin and pin
> the memory again,
No, it should avoid the unpin&pin cycle, otherwise it'd defeat the 
downtime expectation. The exec(3)'d process should inherit the page 
pinning and/or mlock accounting from the original QEMU process, while 
keeping original page pinning intact. Physical page mappings for DMA can 
be kept as is to avoid the need of reprogramming device, though in this 
case the existing vhost iotlb entries should be updated to reflect the 
new HVA in the exec(3)'d QEMU process.

>   or that is just a consequence of how it works the
> memory listener right now?
>
> Why not extend the RESET_OWNER to the rest of devices? It seems the
> most natural way to me.
Not sure, I think RESET_OWNER might be too heavy weighted to implement 
live update, and people are not clear what the exact semantics are by 
using it (which part of the device state is being reset, and how 
much)... In addition, people working on iommufd intended to make this a 
"one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:

https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/

New uAPI to just change ownership of mm seems a better fit to me...

Thanks,
-Siwei

>
> Thanks!
>
>
>>>> Another concern is the use_va stuff, originally it tags to the device
>>>> level and is made static at the time of device instantiation, which is
>>>> fine. But others to come just find a new home at per-group level or
>>>> per-vq level struct. Hard to tell whether or not pinning is actually
>>>> needed for the latter use_va friends, as they are essentially tied to
>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
>>>> usages? Presumably !use_va at the device level is sufficient to infer
>>>> the need of pinning for device?
>>>>
>>> I don't follow this. But I have the feeling that the subject of my
>>> original mail is way more accurate if I would have said just "memory
>>> maps".
>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
>> mapping, not pinning. Although in some case mapping implicitly relies on
>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
>> semantics. We can do explicit on-demand pinning for cases for e.g.
>> warming up device at live migration destination.
>>
>>
>>> I still consider the way to fix it is to actually delegate that to the
>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
>>> not. But let me know if I missed something.
>> You can disregard this for now. I will discuss that further with you
>> guys while bind_mm and per-group use_va stuffs are landed.
>>
>> Thanks!
>> -Siwei
>>
>>
>>
>>> Thanks!
>>>
>>>> Regards,
>>>> -Siwei
>>>>
>>>>
>>>>> This has two main problems:
>>>>> * At this moment the reset semantics forces the vdpa device to unmap
>>>>> all the memory. So this change needs a vhost vdpa feature flag.
>>>>> * This may increase the initialization time. Maybe we can delay it if
>>>>> qemu is not the destination of a LM. Anyway I think this should be
>>>>> done as an optimization on top.
>>>>>
>>>>> Any ideas or comments in this regard?
>>>>>
>>>>> Thanks!
>>>>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-06-27  6:36         ` Si-Wei Liu
@ 2023-07-05 18:03           ` Eugenio Perez Martin
  2023-07-06  0:13             ` Si-Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-07-05 18:03 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> > On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>> Sorry for reviving this old thread, I lost the best timing to follow up
> >>>> on this while I was on vacation. I have been working on this and found
> >>>> out some discrepancy, please see below.
> >>>>
> >>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>> Hi!
> >>>>>
> >>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>> the factors that adds more downtime to migration is the handling of
> >>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>>>> destination device waits until all the guest memory / state is
> >>>>> migrated to start pinning all the memory.
> >>>>>
> >>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>> close),
> >>>> Hmmm, really? If it's the life cycle for char device, the next guest /
> >>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>
> >>> Maybe my sentence was not accurate, but I think we're on the same page here.
> >>>
> >>> Two qemu instances opening the same char device at the same time are
> >>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>> qemu that opens the char device should see a clean device anyway.
> >> I mean the pin can't be done at the time of char device open, where the
> >> user address space is not known/bound yet. The earliest point possible
> >> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >> done.
> > Maybe we are deviating, let me start again.
> >
> > Using QEMU code, what I'm proposing is to modify the lifecycle of the
> > .listener member of struct vhost_vdpa.
> >
> > At this moment, the memory listener is registered at
> > vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> > and is unregistered in both vhost_vdpa_reset_status and
> > vhost_vdpa_cleanup.
> >
> > My original proposal was just to move the memory listener registration
> > to the last vhost_vdpa_init, and remove the unregister from
> > vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> > be the same, the device should not realize this change.
> This can address LM downtime latency for sure, but it won't help
> downtime during dynamic SVQ switch - which still needs to go through the
> full unmap/map cycle (that includes the slow part for pinning) from
> passthrough to SVQ mode. Be noted not every device could work with a
> separate ASID for SVQ descriptors. The fix should expect to work on
> normal vDPA vendor devices without a separate descriptor ASID, with
> platform IOMMU underneath or with on-chip IOMMU.
>

At this moment the SVQ switch is very inefficient mapping-wise, as it
unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
allocated in low regions of the iova space, and then the guest memory
is allocated in this new IOVA region incrementally.

We can optimize that if we place SVQ in a free GPA area instead. All
of the "translations" still need to be done, to ensure the guest
doesn't have access to SVQ vring. That way, qemu will not send all the
unmaps & maps, only the new ones. And vhost/vdpa does not need to call
unpin_user_page / pin_user_pages for all the guest memory.

More optimizations include the batching of the SVQ vrings.

> >
> > One of the concerns was that it could delay VM initialization, and I
> > didn't profile it but I think that may be the case.
> Yes, that's the concern here - we should not introduce regression to
> normal VM boot process/time. In case of large VM it's very easy to see
> the side effect if we go this way.
>
> >   I'm not sure about
> > the right fix but I think the change is easy to profile. If that is
> > the case, we could:
> > * use a flag (listener->address_space ?) and only register the
> > listener in _init if waiting for a migration, do it in _start
> > otherwise.
> Just doing this alone won't help SVQ mode switch downtime, see the
> reason stated above.
>
> > * something like io_uring, where the map can be done in parallel and
> > we can fail _start if some of them fails.
> This can alleviate the problem somehow, but still sub-optimal and not
> scalable with larger size. I'd like zero or least pinning to be done at
> the SVQ switch or migration time.
>

To reduce even further the pinning at SVQ time we would need to
preallocate SVQ vrings before suspending the device.

> >
> >> Actually I think the counterpart vhost_detach_mm() only gets
> >> handled in vhost_vdpa_release() at device close time is a resulting
> >> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >> call is not made mandatory hence only seen implemented in vhost-net
> >> device today. One qemu instance could well exec(3) another new qemu
> >> instance to live upgrade itself while keeping all emulated devices and
> >> guest alive. The current vhost design simply prohibits this from happening.
> >>
> > Ok, I was not aware of this. Thanks for explaining it!
> >
> >>>>>     so all the guest memory can be pinned for all the guest / qemu
> >>>>> lifecycle.
> >>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>> iotlb mapping abstraction layer, and can / should work as a standalone
> >>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >>>> memory as needed without having to start the device, while awaiting any
> >>>> incoming migration request. Though problem is, there's no existing vhost
> >>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >>>> to guest's life cycle?
> >>>>
> >>> I think that to pin or not pin memory maps should be a kernel
> >>> decision, not to be driven by qemu.
> >> It's kernel decision for sure. I am with this part.
> >>
> >>> I'm not against it if needed, but
> >>> let me know if the current "clean at close" address your concerns.
> >> To better facilitate QEMU exec (live update) case, I propose we add new
> >> vhost uAPI pair for explicit pinning request - which would live with
> >> user mm's, or more precisely qemu instance's lifecycle.
> >>
> > Ok I see your problem better now, but I think it should be solved at
> > kernel level. Does that live update need to forcefully unpin and pin
> > the memory again,
> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> downtime expectation. The exec(3)'d process should inherit the page
> pinning and/or mlock accounting from the original QEMU process, while
> keeping original page pinning intact. Physical page mappings for DMA can
> be kept as is to avoid the need of reprogramming device, though in this
> case the existing vhost iotlb entries should be updated to reflect the
> new HVA in the exec(3)'d QEMU process.
>
> >   or that is just a consequence of how it works the
> > memory listener right now?
> >
> > Why not extend the RESET_OWNER to the rest of devices? It seems the
> > most natural way to me.
> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> live update, and people are not clear what the exact semantics are by
> using it (which part of the device state is being reset, and how
> much)... In addition, people working on iommufd intended to make this a
> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
>
> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
>
> New uAPI to just change ownership of mm seems a better fit to me...
>

I'm not sure about the right solution here, but there are other
proposals to batch ioctls. But maybe creating a new one fits better.

Thanks!

> Thanks,
> -Siwei
>
> >
> > Thanks!
> >
> >
> >>>> Another concern is the use_va stuff, originally it tags to the device
> >>>> level and is made static at the time of device instantiation, which is
> >>>> fine. But others to come just find a new home at per-group level or
> >>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>> needed for the latter use_va friends, as they are essentially tied to
> >>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >>>> way earlier than that. Perhaps just ignore those sub-device level use_va
> >>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>> the need of pinning for device?
> >>>>
> >>> I don't follow this. But I have the feeling that the subject of my
> >>> original mail is way more accurate if I would have said just "memory
> >>> maps".
> >> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >> mapping, not pinning. Although in some case mapping implicitly relies on
> >> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >> semantics. We can do explicit on-demand pinning for cases for e.g.
> >> warming up device at live migration destination.
> >>
> >>
> >>> I still consider the way to fix it is to actually delegate that to the
> >>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>> not. But let me know if I missed something.
> >> You can disregard this for now. I will discuss that further with you
> >> guys while bind_mm and per-group use_va stuffs are landed.
> >>
> >> Thanks!
> >> -Siwei
> >>
> >>
> >>
> >>> Thanks!
> >>>
> >>>> Regards,
> >>>> -Siwei
> >>>>
> >>>>
> >>>>> This has two main problems:
> >>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>> * This may increase the initialization time. Maybe we can delay it if
> >>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>> done as an optimization on top.
> >>>>>
> >>>>> Any ideas or comments in this regard?
> >>>>>
> >>>>> Thanks!
> >>>>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-05 18:03           ` Eugenio Perez Martin
@ 2023-07-06  0:13             ` Si-Wei Liu
  2023-07-06  5:46               ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-07-06  0:13 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea



On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
>>>>>> on this while I was on vacation. I have been working on this and found
>>>>>> out some discrepancy, please see below.
>>>>>>
>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
>>>>>>> the factors that adds more downtime to migration is the handling of
>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>>>>>> destination device waits until all the guest memory / state is
>>>>>>> migrated to start pinning all the memory.
>>>>>>>
>>>>>>> The proposal is to bind it to the char device life cycle (open vs
>>>>>>> close),
>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
>>>>>>
>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
>>>>>
>>>>> Two qemu instances opening the same char device at the same time are
>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
>>>>> qemu that opens the char device should see a clean device anyway.
>>>> I mean the pin can't be done at the time of char device open, where the
>>>> user address space is not known/bound yet. The earliest point possible
>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
>>>> done.
>>> Maybe we are deviating, let me start again.
>>>
>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
>>> .listener member of struct vhost_vdpa.
>>>
>>> At this moment, the memory listener is registered at
>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
>>> and is unregistered in both vhost_vdpa_reset_status and
>>> vhost_vdpa_cleanup.
>>>
>>> My original proposal was just to move the memory listener registration
>>> to the last vhost_vdpa_init, and remove the unregister from
>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
>>> be the same, the device should not realize this change.
>> This can address LM downtime latency for sure, but it won't help
>> downtime during dynamic SVQ switch - which still needs to go through the
>> full unmap/map cycle (that includes the slow part for pinning) from
>> passthrough to SVQ mode. Be noted not every device could work with a
>> separate ASID for SVQ descriptors. The fix should expect to work on
>> normal vDPA vendor devices without a separate descriptor ASID, with
>> platform IOMMU underneath or with on-chip IOMMU.
>>
> At this moment the SVQ switch is very inefficient mapping-wise, as it
> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> allocated in low regions of the iova space, and then the guest memory
> is allocated in this new IOVA region incrementally.
Yep. The key to build this fast path for SVQ switching I think is to 
maintain the identity mapping for the passthrough queues so that QEMU 
can reuse the old mappings for guest memory (e.g. GIOVA identity mapped 
to GPA) while incrementally adding new mappings for SVQ vrings.

>
> We can optimize that if we place SVQ in a free GPA area instead.
Here's a question though: it might not be hard to find a free GPA range 
for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit 
ranges), but I'm not sure if easy to find a free GIOVA range for the 
vIOMMU case - particularly this has to work in the same entire 64bit 
IOVA address ranges that (for now) QEMU won't be able to "reserve" a 
specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be 
done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and 
virito-iommu) so that we can call it out as a generic means for SVQ 
switching optimization?

If this QEMU/vIOMMU "hack" is not universally feasible, I would rather 
build a fast path in the kernel via a new vhost IOTLB command, say 
INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing 
(passthrough) mappings and update to use the SVQ ones in a single batch, 
while keeping the pages for guest memory always pinned (the kernel will 
make this decision). This doesn't expose pinning to userspace, and can 
also fix downtime issue.

>   All
> of the "translations" still need to be done, to ensure the guest
> doesn't have access to SVQ vring. That way, qemu will not send all the
> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> unpin_user_page / pin_user_pages for all the guest memory.
>
> More optimizations include the batching of the SVQ vrings.
Nods.

>
>>> One of the concerns was that it could delay VM initialization, and I
>>> didn't profile it but I think that may be the case.
>> Yes, that's the concern here - we should not introduce regression to
>> normal VM boot process/time. In case of large VM it's very easy to see
>> the side effect if we go this way.
>>
>>>    I'm not sure about
>>> the right fix but I think the change is easy to profile. If that is
>>> the case, we could:
>>> * use a flag (listener->address_space ?) and only register the
>>> listener in _init if waiting for a migration, do it in _start
>>> otherwise.
>> Just doing this alone won't help SVQ mode switch downtime, see the
>> reason stated above.
>>
>>> * something like io_uring, where the map can be done in parallel and
>>> we can fail _start if some of them fails.
>> This can alleviate the problem somehow, but still sub-optimal and not
>> scalable with larger size. I'd like zero or least pinning to be done at
>> the SVQ switch or migration time.
>>
> To reduce even further the pinning at SVQ time we would need to
> preallocate SVQ vrings before suspending the device.
Yep. Preallocate SVQ vrings at the start of migration, but before 
suspending the device. This will work under the assumption that we can 
reserve or "steal" some ranges from the GPA or GIOVA space...

Thanks,
-Siwei

>
>>>> Actually I think the counterpart vhost_detach_mm() only gets
>>>> handled in vhost_vdpa_release() at device close time is a resulting
>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
>>>> call is not made mandatory hence only seen implemented in vhost-net
>>>> device today. One qemu instance could well exec(3) another new qemu
>>>> instance to live upgrade itself while keeping all emulated devices and
>>>> guest alive. The current vhost design simply prohibits this from happening.
>>>>
>>> Ok, I was not aware of this. Thanks for explaining it!
>>>
>>>>>>>      so all the guest memory can be pinned for all the guest / qemu
>>>>>>> lifecycle.
>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
>>>>>> sense. Essentially this pinning part needs to be decoupled from the
>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>>>>>> memory as needed without having to start the device, while awaiting any
>>>>>> incoming migration request. Though problem is, there's no existing vhost
>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>>>>>> to guest's life cycle?
>>>>>>
>>>>> I think that to pin or not pin memory maps should be a kernel
>>>>> decision, not to be driven by qemu.
>>>> It's kernel decision for sure. I am with this part.
>>>>
>>>>> I'm not against it if needed, but
>>>>> let me know if the current "clean at close" address your concerns.
>>>> To better facilitate QEMU exec (live update) case, I propose we add new
>>>> vhost uAPI pair for explicit pinning request - which would live with
>>>> user mm's, or more precisely qemu instance's lifecycle.
>>>>
>>> Ok I see your problem better now, but I think it should be solved at
>>> kernel level. Does that live update need to forcefully unpin and pin
>>> the memory again,
>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
>> downtime expectation. The exec(3)'d process should inherit the page
>> pinning and/or mlock accounting from the original QEMU process, while
>> keeping original page pinning intact. Physical page mappings for DMA can
>> be kept as is to avoid the need of reprogramming device, though in this
>> case the existing vhost iotlb entries should be updated to reflect the
>> new HVA in the exec(3)'d QEMU process.
>>
>>>    or that is just a consequence of how it works the
>>> memory listener right now?
>>>
>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
>>> most natural way to me.
>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
>> live update, and people are not clear what the exact semantics are by
>> using it (which part of the device state is being reset, and how
>> much)... In addition, people working on iommufd intended to make this a
>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
>>
>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
>>
>> New uAPI to just change ownership of mm seems a better fit to me...
>>
> I'm not sure about the right solution here, but there are other
> proposals to batch ioctls. But maybe creating a new one fits better.
>
> Thanks!
>
>> Thanks,
>> -Siwei
>>
>>> Thanks!
>>>
>>>
>>>>>> Another concern is the use_va stuff, originally it tags to the device
>>>>>> level and is made static at the time of device instantiation, which is
>>>>>> fine. But others to come just find a new home at per-group level or
>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
>>>>>> needed for the latter use_va friends, as they are essentially tied to
>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
>>>>>> the need of pinning for device?
>>>>>>
>>>>> I don't follow this. But I have the feeling that the subject of my
>>>>> original mail is way more accurate if I would have said just "memory
>>>>> maps".
>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
>>>> mapping, not pinning. Although in some case mapping implicitly relies on
>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
>>>> warming up device at live migration destination.
>>>>
>>>>
>>>>> I still consider the way to fix it is to actually delegate that to the
>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
>>>>> not. But let me know if I missed something.
>>>> You can disregard this for now. I will discuss that further with you
>>>> guys while bind_mm and per-group use_va stuffs are landed.
>>>>
>>>> Thanks!
>>>> -Siwei
>>>>
>>>>
>>>>
>>>>> Thanks!
>>>>>
>>>>>> Regards,
>>>>>> -Siwei
>>>>>>
>>>>>>
>>>>>>> This has two main problems:
>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
>>>>>>> * This may increase the initialization time. Maybe we can delay it if
>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
>>>>>>> done as an optimization on top.
>>>>>>>
>>>>>>> Any ideas or comments in this regard?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-06  0:13             ` Si-Wei Liu
@ 2023-07-06  5:46               ` Eugenio Perez Martin
  2023-07-08  9:14                 ` Si-Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-07-06  5:46 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> > On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
> >>>>>> on this while I was on vacation. I have been working on this and found
> >>>>>> out some discrepancy, please see below.
> >>>>>>
> >>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>> Hi!
> >>>>>>>
> >>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>>>> the factors that adds more downtime to migration is the handling of
> >>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>>>>>> destination device waits until all the guest memory / state is
> >>>>>>> migrated to start pinning all the memory.
> >>>>>>>
> >>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>> close),
> >>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
> >>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>
> >>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
> >>>>>
> >>>>> Two qemu instances opening the same char device at the same time are
> >>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>> qemu that opens the char device should see a clean device anyway.
> >>>> I mean the pin can't be done at the time of char device open, where the
> >>>> user address space is not known/bound yet. The earliest point possible
> >>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >>>> done.
> >>> Maybe we are deviating, let me start again.
> >>>
> >>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>> .listener member of struct vhost_vdpa.
> >>>
> >>> At this moment, the memory listener is registered at
> >>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>> and is unregistered in both vhost_vdpa_reset_status and
> >>> vhost_vdpa_cleanup.
> >>>
> >>> My original proposal was just to move the memory listener registration
> >>> to the last vhost_vdpa_init, and remove the unregister from
> >>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>> be the same, the device should not realize this change.
> >> This can address LM downtime latency for sure, but it won't help
> >> downtime during dynamic SVQ switch - which still needs to go through the
> >> full unmap/map cycle (that includes the slow part for pinning) from
> >> passthrough to SVQ mode. Be noted not every device could work with a
> >> separate ASID for SVQ descriptors. The fix should expect to work on
> >> normal vDPA vendor devices without a separate descriptor ASID, with
> >> platform IOMMU underneath or with on-chip IOMMU.
> >>
> > At this moment the SVQ switch is very inefficient mapping-wise, as it
> > unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> > allocated in low regions of the iova space, and then the guest memory
> > is allocated in this new IOVA region incrementally.
> Yep. The key to build this fast path for SVQ switching I think is to
> maintain the identity mapping for the passthrough queues so that QEMU
> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> to GPA) while incrementally adding new mappings for SVQ vrings.
>
> >
> > We can optimize that if we place SVQ in a free GPA area instead.
> Here's a question though: it might not be hard to find a free GPA range
> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> ranges), but I'm not sure if easy to find a free GIOVA range for the
> vIOMMU case - particularly this has to work in the same entire 64bit
> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> virito-iommu) so that we can call it out as a generic means for SVQ
> switching optimization?
>

In the case vIOMMU allocates a new block we will use the same algorithm as now:
* Find a new free IOVA chunk of the same size
* Map this new SVQ IOVA, that may or may not be the same as SVQ

Since we must go through the translation phase to sanitize guest's
available descriptors anyway, it has zero added cost.

Another option would be to move the SVQ vring to a new region, but I
don't see any advantage on maintaining 1:1 mapping at that point.

> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> build a fast path in the kernel via a new vhost IOTLB command, say
> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> (passthrough) mappings and update to use the SVQ ones in a single batch,
> while keeping the pages for guest memory always pinned (the kernel will
> make this decision). This doesn't expose pinning to userspace, and can
> also fix downtime issue.
>
> >   All
> > of the "translations" still need to be done, to ensure the guest
> > doesn't have access to SVQ vring. That way, qemu will not send all the
> > unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> > unpin_user_page / pin_user_pages for all the guest memory.
> >
> > More optimizations include the batching of the SVQ vrings.
> Nods.
>
> >
> >>> One of the concerns was that it could delay VM initialization, and I
> >>> didn't profile it but I think that may be the case.
> >> Yes, that's the concern here - we should not introduce regression to
> >> normal VM boot process/time. In case of large VM it's very easy to see
> >> the side effect if we go this way.
> >>
> >>>    I'm not sure about
> >>> the right fix but I think the change is easy to profile. If that is
> >>> the case, we could:
> >>> * use a flag (listener->address_space ?) and only register the
> >>> listener in _init if waiting for a migration, do it in _start
> >>> otherwise.
> >> Just doing this alone won't help SVQ mode switch downtime, see the
> >> reason stated above.
> >>
> >>> * something like io_uring, where the map can be done in parallel and
> >>> we can fail _start if some of them fails.
> >> This can alleviate the problem somehow, but still sub-optimal and not
> >> scalable with larger size. I'd like zero or least pinning to be done at
> >> the SVQ switch or migration time.
> >>
> > To reduce even further the pinning at SVQ time we would need to
> > preallocate SVQ vrings before suspending the device.
> Yep. Preallocate SVQ vrings at the start of migration, but before
> suspending the device. This will work under the assumption that we can
> reserve or "steal" some ranges from the GPA or GIOVA space...
>
> Thanks,
> -Siwei
>
> >
> >>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>> device today. One qemu instance could well exec(3) another new qemu
> >>>> instance to live upgrade itself while keeping all emulated devices and
> >>>> guest alive. The current vhost design simply prohibits this from happening.
> >>>>
> >>> Ok, I was not aware of this. Thanks for explaining it!
> >>>
> >>>>>>>      so all the guest memory can be pinned for all the guest / qemu
> >>>>>>> lifecycle.
> >>>>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
> >>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >>>>>> memory as needed without having to start the device, while awaiting any
> >>>>>> incoming migration request. Though problem is, there's no existing vhost
> >>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >>>>>> to guest's life cycle?
> >>>>>>
> >>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>> decision, not to be driven by qemu.
> >>>> It's kernel decision for sure. I am with this part.
> >>>>
> >>>>> I'm not against it if needed, but
> >>>>> let me know if the current "clean at close" address your concerns.
> >>>> To better facilitate QEMU exec (live update) case, I propose we add new
> >>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>
> >>> Ok I see your problem better now, but I think it should be solved at
> >>> kernel level. Does that live update need to forcefully unpin and pin
> >>> the memory again,
> >> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >> downtime expectation. The exec(3)'d process should inherit the page
> >> pinning and/or mlock accounting from the original QEMU process, while
> >> keeping original page pinning intact. Physical page mappings for DMA can
> >> be kept as is to avoid the need of reprogramming device, though in this
> >> case the existing vhost iotlb entries should be updated to reflect the
> >> new HVA in the exec(3)'d QEMU process.
> >>
> >>>    or that is just a consequence of how it works the
> >>> memory listener right now?
> >>>
> >>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>> most natural way to me.
> >> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >> live update, and people are not clear what the exact semantics are by
> >> using it (which part of the device state is being reset, and how
> >> much)... In addition, people working on iommufd intended to make this a
> >> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>
> >> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
> >>
> >> New uAPI to just change ownership of mm seems a better fit to me...
> >>
> > I'm not sure about the right solution here, but there are other
> > proposals to batch ioctls. But maybe creating a new one fits better.
> >
> > Thanks!
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> Thanks!
> >>>
> >>>
> >>>>>> Another concern is the use_va stuff, originally it tags to the device
> >>>>>> level and is made static at the time of device instantiation, which is
> >>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>>>> needed for the latter use_va friends, as they are essentially tied to
> >>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
> >>>>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>>>> the need of pinning for device?
> >>>>>>
> >>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>> original mail is way more accurate if I would have said just "memory
> >>>>> maps".
> >>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >>>> mapping, not pinning. Although in some case mapping implicitly relies on
> >>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>> warming up device at live migration destination.
> >>>>
> >>>>
> >>>>> I still consider the way to fix it is to actually delegate that to the
> >>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>> not. But let me know if I missed something.
> >>>> You can disregard this for now. I will discuss that further with you
> >>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>
> >>>> Thanks!
> >>>> -Siwei
> >>>>
> >>>>
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> Regards,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>> This has two main problems:
> >>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>> * This may increase the initialization time. Maybe we can delay it if
> >>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>> done as an optimization on top.
> >>>>>>>
> >>>>>>> Any ideas or comments in this regard?
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-06  5:46               ` Eugenio Perez Martin
@ 2023-07-08  9:14                 ` Si-Wei Liu
  2023-07-10  6:04                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-07-08  9:14 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea



On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
>>>>>>>> on this while I was on vacation. I have been working on this and found
>>>>>>>> out some discrepancy, please see below.
>>>>>>>>
>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
>>>>>>>>> the factors that adds more downtime to migration is the handling of
>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>>>>>>>> destination device waits until all the guest memory / state is
>>>>>>>>> migrated to start pinning all the memory.
>>>>>>>>>
>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
>>>>>>>>> close),
>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
>>>>>>>>
>>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
>>>>>>>
>>>>>>> Two qemu instances opening the same char device at the same time are
>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
>>>>>>> qemu that opens the char device should see a clean device anyway.
>>>>>> I mean the pin can't be done at the time of char device open, where the
>>>>>> user address space is not known/bound yet. The earliest point possible
>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
>>>>>> done.
>>>>> Maybe we are deviating, let me start again.
>>>>>
>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
>>>>> .listener member of struct vhost_vdpa.
>>>>>
>>>>> At this moment, the memory listener is registered at
>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
>>>>> and is unregistered in both vhost_vdpa_reset_status and
>>>>> vhost_vdpa_cleanup.
>>>>>
>>>>> My original proposal was just to move the memory listener registration
>>>>> to the last vhost_vdpa_init, and remove the unregister from
>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
>>>>> be the same, the device should not realize this change.
>>>> This can address LM downtime latency for sure, but it won't help
>>>> downtime during dynamic SVQ switch - which still needs to go through the
>>>> full unmap/map cycle (that includes the slow part for pinning) from
>>>> passthrough to SVQ mode. Be noted not every device could work with a
>>>> separate ASID for SVQ descriptors. The fix should expect to work on
>>>> normal vDPA vendor devices without a separate descriptor ASID, with
>>>> platform IOMMU underneath or with on-chip IOMMU.
>>>>
>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
>>> allocated in low regions of the iova space, and then the guest memory
>>> is allocated in this new IOVA region incrementally.
>> Yep. The key to build this fast path for SVQ switching I think is to
>> maintain the identity mapping for the passthrough queues so that QEMU
>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
>> to GPA) while incrementally adding new mappings for SVQ vrings.
>>
>>> We can optimize that if we place SVQ in a free GPA area instead.
>> Here's a question though: it might not be hard to find a free GPA range
>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
>> ranges), but I'm not sure if easy to find a free GIOVA range for the
>> vIOMMU case - particularly this has to work in the same entire 64bit
>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
>> virito-iommu) so that we can call it out as a generic means for SVQ
>> switching optimization?
>>
> In the case vIOMMU allocates a new block we will use the same algorithm as now:
> * Find a new free IOVA chunk of the same size
> * Map this new SVQ IOVA, that may or may not be the same as SVQ
>
> Since we must go through the translation phase to sanitize guest's
> available descriptors anyway, it has zero added cost.
Not sure I followed, this can work but doesn't seem able to reuse the 
old host kernel mappings for guest memory, hence still requires remap of 
the entire host IOVA ranges when SVQ IOVA comes along. I think by 
maintaining 1:1 identity map on guest memory, we don't have to bother 
tearing down existing HVA->HPA mappings in kernel thus save the 
expensive pinning calls at large. I don't clearly see under this scheme, 
how the new SVQ IOVA may work with potential conflict on IOVA space from 
hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory 
mapping can't be kept.

> Another option would be to move the SVQ vring to a new region, but I
> don't see any advantage on maintaining 1:1 mapping at that point.
See above. For pinning avoidance point of view (i.e. QEMU software 
optimization on SVQ switching downtime), I think it's crucial to 
maintain 1:1 mapping and move SVQ vring to a specific region. But I'm 
not sure how complex it will be for QEMU to get it implemented in a 
clean way.

Thanks,
-Siwei

>
>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
>> build a fast path in the kernel via a new vhost IOTLB command, say
>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
>> (passthrough) mappings and update to use the SVQ ones in a single batch,
>> while keeping the pages for guest memory always pinned (the kernel will
>> make this decision). This doesn't expose pinning to userspace, and can
>> also fix downtime issue.
>>
>>>    All
>>> of the "translations" still need to be done, to ensure the guest
>>> doesn't have access to SVQ vring. That way, qemu will not send all the
>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
>>> unpin_user_page / pin_user_pages for all the guest memory.
>>>
>>> More optimizations include the batching of the SVQ vrings.
>> Nods.
>>
>>>>> One of the concerns was that it could delay VM initialization, and I
>>>>> didn't profile it but I think that may be the case.
>>>> Yes, that's the concern here - we should not introduce regression to
>>>> normal VM boot process/time. In case of large VM it's very easy to see
>>>> the side effect if we go this way.
>>>>
>>>>>     I'm not sure about
>>>>> the right fix but I think the change is easy to profile. If that is
>>>>> the case, we could:
>>>>> * use a flag (listener->address_space ?) and only register the
>>>>> listener in _init if waiting for a migration, do it in _start
>>>>> otherwise.
>>>> Just doing this alone won't help SVQ mode switch downtime, see the
>>>> reason stated above.
>>>>
>>>>> * something like io_uring, where the map can be done in parallel and
>>>>> we can fail _start if some of them fails.
>>>> This can alleviate the problem somehow, but still sub-optimal and not
>>>> scalable with larger size. I'd like zero or least pinning to be done at
>>>> the SVQ switch or migration time.
>>>>
>>> To reduce even further the pinning at SVQ time we would need to
>>> preallocate SVQ vrings before suspending the device.
>> Yep. Preallocate SVQ vrings at the start of migration, but before
>> suspending the device. This will work under the assumption that we can
>> reserve or "steal" some ranges from the GPA or GIOVA space...
>>
>> Thanks,
>> -Siwei
>>
>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
>>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
>>>>>> call is not made mandatory hence only seen implemented in vhost-net
>>>>>> device today. One qemu instance could well exec(3) another new qemu
>>>>>> instance to live upgrade itself while keeping all emulated devices and
>>>>>> guest alive. The current vhost design simply prohibits this from happening.
>>>>>>
>>>>> Ok, I was not aware of this. Thanks for explaining it!
>>>>>
>>>>>>>>>       so all the guest memory can be pinned for all the guest / qemu
>>>>>>>>> lifecycle.
>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
>>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>>>>>>>> memory as needed without having to start the device, while awaiting any
>>>>>>>> incoming migration request. Though problem is, there's no existing vhost
>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>>>>>>>> to guest's life cycle?
>>>>>>>>
>>>>>>> I think that to pin or not pin memory maps should be a kernel
>>>>>>> decision, not to be driven by qemu.
>>>>>> It's kernel decision for sure. I am with this part.
>>>>>>
>>>>>>> I'm not against it if needed, but
>>>>>>> let me know if the current "clean at close" address your concerns.
>>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
>>>>>> vhost uAPI pair for explicit pinning request - which would live with
>>>>>> user mm's, or more precisely qemu instance's lifecycle.
>>>>>>
>>>>> Ok I see your problem better now, but I think it should be solved at
>>>>> kernel level. Does that live update need to forcefully unpin and pin
>>>>> the memory again,
>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
>>>> downtime expectation. The exec(3)'d process should inherit the page
>>>> pinning and/or mlock accounting from the original QEMU process, while
>>>> keeping original page pinning intact. Physical page mappings for DMA can
>>>> be kept as is to avoid the need of reprogramming device, though in this
>>>> case the existing vhost iotlb entries should be updated to reflect the
>>>> new HVA in the exec(3)'d QEMU process.
>>>>
>>>>>     or that is just a consequence of how it works the
>>>>> memory listener right now?
>>>>>
>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
>>>>> most natural way to me.
>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
>>>> live update, and people are not clear what the exact semantics are by
>>>> using it (which part of the device state is being reset, and how
>>>> much)... In addition, people working on iommufd intended to make this a
>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
>>>>
>>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
>>>>
>>>> New uAPI to just change ownership of mm seems a better fit to me...
>>>>
>>> I'm not sure about the right solution here, but there are other
>>> proposals to batch ioctls. But maybe creating a new one fits better.
>>>
>>> Thanks!
>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>>>> Another concern is the use_va stuff, originally it tags to the device
>>>>>>>> level and is made static at the time of device instantiation, which is
>>>>>>>> fine. But others to come just find a new home at per-group level or
>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
>>>>>>>> needed for the latter use_va friends, as they are essentially tied to
>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
>>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
>>>>>>>> the need of pinning for device?
>>>>>>>>
>>>>>>> I don't follow this. But I have the feeling that the subject of my
>>>>>>> original mail is way more accurate if I would have said just "memory
>>>>>>> maps".
>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
>>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
>>>>>> warming up device at live migration destination.
>>>>>>
>>>>>>
>>>>>>> I still consider the way to fix it is to actually delegate that to the
>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
>>>>>>> not. But let me know if I missed something.
>>>>>> You can disregard this for now. I will discuss that further with you
>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
>>>>>>
>>>>>> Thanks!
>>>>>> -Siwei
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>
>>>>>>>>> This has two main problems:
>>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
>>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
>>>>>>>>> done as an optimization on top.
>>>>>>>>>
>>>>>>>>> Any ideas or comments in this regard?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-08  9:14                 ` Si-Wei Liu
@ 2023-07-10  6:04                   ` Eugenio Perez Martin
  2023-07-17 19:56                     ` Si-Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-07-10  6:04 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> > On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
> >>>>>>>> on this while I was on vacation. I have been working on this and found
> >>>>>>>> out some discrepancy, please see below.
> >>>>>>>>
> >>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>>>>>> the factors that adds more downtime to migration is the handling of
> >>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>>>>>>>> destination device waits until all the guest memory / state is
> >>>>>>>>> migrated to start pinning all the memory.
> >>>>>>>>>
> >>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>>>> close),
> >>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
> >>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>>>
> >>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
> >>>>>>>
> >>>>>>> Two qemu instances opening the same char device at the same time are
> >>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>>>> qemu that opens the char device should see a clean device anyway.
> >>>>>> I mean the pin can't be done at the time of char device open, where the
> >>>>>> user address space is not known/bound yet. The earliest point possible
> >>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >>>>>> done.
> >>>>> Maybe we are deviating, let me start again.
> >>>>>
> >>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>>>> .listener member of struct vhost_vdpa.
> >>>>>
> >>>>> At this moment, the memory listener is registered at
> >>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>>>> and is unregistered in both vhost_vdpa_reset_status and
> >>>>> vhost_vdpa_cleanup.
> >>>>>
> >>>>> My original proposal was just to move the memory listener registration
> >>>>> to the last vhost_vdpa_init, and remove the unregister from
> >>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>>>> be the same, the device should not realize this change.
> >>>> This can address LM downtime latency for sure, but it won't help
> >>>> downtime during dynamic SVQ switch - which still needs to go through the
> >>>> full unmap/map cycle (that includes the slow part for pinning) from
> >>>> passthrough to SVQ mode. Be noted not every device could work with a
> >>>> separate ASID for SVQ descriptors. The fix should expect to work on
> >>>> normal vDPA vendor devices without a separate descriptor ASID, with
> >>>> platform IOMMU underneath or with on-chip IOMMU.
> >>>>
> >>> At this moment the SVQ switch is very inefficient mapping-wise, as it
> >>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> >>> allocated in low regions of the iova space, and then the guest memory
> >>> is allocated in this new IOVA region incrementally.
> >> Yep. The key to build this fast path for SVQ switching I think is to
> >> maintain the identity mapping for the passthrough queues so that QEMU
> >> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> >> to GPA) while incrementally adding new mappings for SVQ vrings.
> >>
> >>> We can optimize that if we place SVQ in a free GPA area instead.
> >> Here's a question though: it might not be hard to find a free GPA range
> >> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> >> ranges), but I'm not sure if easy to find a free GIOVA range for the
> >> vIOMMU case - particularly this has to work in the same entire 64bit
> >> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> >> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> >> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> >> virito-iommu) so that we can call it out as a generic means for SVQ
> >> switching optimization?
> >>
> > In the case vIOMMU allocates a new block we will use the same algorithm as now:
> > * Find a new free IOVA chunk of the same size
> > * Map this new SVQ IOVA, that may or may not be the same as SVQ
> >
> > Since we must go through the translation phase to sanitize guest's
> > available descriptors anyway, it has zero added cost.
> Not sure I followed, this can work but doesn't seem able to reuse the
> old host kernel mappings for guest memory, hence still requires remap of
> the entire host IOVA ranges when SVQ IOVA comes along. I think by
> maintaining 1:1 identity map on guest memory, we don't have to bother
> tearing down existing HVA->HPA mappings in kernel thus save the
> expensive pinning calls at large. I don't clearly see under this scheme,
> how the new SVQ IOVA may work with potential conflict on IOVA space from
> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
> mapping can't be kept.
>

There is no need to maintain the 1:1 for memory mapped after the
pinning. The bigger reason to maintain them is to reduce the downtime
because of pinning. After that, we can reuse the method we're using at
this moment, looking for a new free hole for the new map. ew only need
to pin the new map.

> > Another option would be to move the SVQ vring to a new region, but I
> > don't see any advantage on maintaining 1:1 mapping at that point.
> See above. For pinning avoidance point of view (i.e. QEMU software
> optimization on SVQ switching downtime), I think it's crucial to
> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
> not sure how complex it will be for QEMU to get it implemented in a
> clean way.
>
> Thanks,
> -Siwei
>
> >
> >> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> >> build a fast path in the kernel via a new vhost IOTLB command, say
> >> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> >> (passthrough) mappings and update to use the SVQ ones in a single batch,
> >> while keeping the pages for guest memory always pinned (the kernel will
> >> make this decision). This doesn't expose pinning to userspace, and can
> >> also fix downtime issue.
> >>
> >>>    All
> >>> of the "translations" still need to be done, to ensure the guest
> >>> doesn't have access to SVQ vring. That way, qemu will not send all the
> >>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> >>> unpin_user_page / pin_user_pages for all the guest memory.
> >>>
> >>> More optimizations include the batching of the SVQ vrings.
> >> Nods.
> >>
> >>>>> One of the concerns was that it could delay VM initialization, and I
> >>>>> didn't profile it but I think that may be the case.
> >>>> Yes, that's the concern here - we should not introduce regression to
> >>>> normal VM boot process/time. In case of large VM it's very easy to see
> >>>> the side effect if we go this way.
> >>>>
> >>>>>     I'm not sure about
> >>>>> the right fix but I think the change is easy to profile. If that is
> >>>>> the case, we could:
> >>>>> * use a flag (listener->address_space ?) and only register the
> >>>>> listener in _init if waiting for a migration, do it in _start
> >>>>> otherwise.
> >>>> Just doing this alone won't help SVQ mode switch downtime, see the
> >>>> reason stated above.
> >>>>
> >>>>> * something like io_uring, where the map can be done in parallel and
> >>>>> we can fail _start if some of them fails.
> >>>> This can alleviate the problem somehow, but still sub-optimal and not
> >>>> scalable with larger size. I'd like zero or least pinning to be done at
> >>>> the SVQ switch or migration time.
> >>>>
> >>> To reduce even further the pinning at SVQ time we would need to
> >>> preallocate SVQ vrings before suspending the device.
> >> Yep. Preallocate SVQ vrings at the start of migration, but before
> >> suspending the device. This will work under the assumption that we can
> >> reserve or "steal" some ranges from the GPA or GIOVA space...
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >>>>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>>>> device today. One qemu instance could well exec(3) another new qemu
> >>>>>> instance to live upgrade itself while keeping all emulated devices and
> >>>>>> guest alive. The current vhost design simply prohibits this from happening.
> >>>>>>
> >>>>> Ok, I was not aware of this. Thanks for explaining it!
> >>>>>
> >>>>>>>>>       so all the guest memory can be pinned for all the guest / qemu
> >>>>>>>>> lifecycle.
> >>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
> >>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >>>>>>>> memory as needed without having to start the device, while awaiting any
> >>>>>>>> incoming migration request. Though problem is, there's no existing vhost
> >>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >>>>>>>> to guest's life cycle?
> >>>>>>>>
> >>>>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>>>> decision, not to be driven by qemu.
> >>>>>> It's kernel decision for sure. I am with this part.
> >>>>>>
> >>>>>>> I'm not against it if needed, but
> >>>>>>> let me know if the current "clean at close" address your concerns.
> >>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
> >>>>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>>>
> >>>>> Ok I see your problem better now, but I think it should be solved at
> >>>>> kernel level. Does that live update need to forcefully unpin and pin
> >>>>> the memory again,
> >>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >>>> downtime expectation. The exec(3)'d process should inherit the page
> >>>> pinning and/or mlock accounting from the original QEMU process, while
> >>>> keeping original page pinning intact. Physical page mappings for DMA can
> >>>> be kept as is to avoid the need of reprogramming device, though in this
> >>>> case the existing vhost iotlb entries should be updated to reflect the
> >>>> new HVA in the exec(3)'d QEMU process.
> >>>>
> >>>>>     or that is just a consequence of how it works the
> >>>>> memory listener right now?
> >>>>>
> >>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>>>> most natural way to me.
> >>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >>>> live update, and people are not clear what the exact semantics are by
> >>>> using it (which part of the device state is being reset, and how
> >>>> much)... In addition, people working on iommufd intended to make this a
> >>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>>>
> >>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
> >>>>
> >>>> New uAPI to just change ownership of mm seems a better fit to me...
> >>>>
> >>> I'm not sure about the right solution here, but there are other
> >>> proposals to batch ioctls. But maybe creating a new one fits better.
> >>>
> >>> Thanks!
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>
> >>>>>>>> Another concern is the use_va stuff, originally it tags to the device
> >>>>>>>> level and is made static at the time of device instantiation, which is
> >>>>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>>>>>> needed for the latter use_va friends, as they are essentially tied to
> >>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
> >>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>>>>>> the need of pinning for device?
> >>>>>>>>
> >>>>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>>>> original mail is way more accurate if I would have said just "memory
> >>>>>>> maps".
> >>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
> >>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>>>> warming up device at live migration destination.
> >>>>>>
> >>>>>>
> >>>>>>> I still consider the way to fix it is to actually delegate that to the
> >>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>>>> not. But let me know if I missed something.
> >>>>>> You can disregard this for now. I will discuss that further with you
> >>>>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> This has two main problems:
> >>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
> >>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>>>> done as an optimization on top.
> >>>>>>>>>
> >>>>>>>>> Any ideas or comments in this regard?
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-10  6:04                   ` Eugenio Perez Martin
@ 2023-07-17 19:56                     ` Si-Wei Liu
  2023-07-19 10:40                       ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-07-17 19:56 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

Hey,

I am now back from the break. Sorry for the delayed response, please see 
in line.

On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:
> On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
>>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
>>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
>>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
>>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
>>>>>>>>>> on this while I was on vacation. I have been working on this and found
>>>>>>>>>> out some discrepancy, please see below.
>>>>>>>>>>
>>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>>>>>>>>>> Hi!
>>>>>>>>>>>
>>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
>>>>>>>>>>> the factors that adds more downtime to migration is the handling of
>>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
>>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>>>>>>>>>> destination device waits until all the guest memory / state is
>>>>>>>>>>> migrated to start pinning all the memory.
>>>>>>>>>>>
>>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
>>>>>>>>>>> close),
>>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
>>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
>>>>>>>>>>
>>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
>>>>>>>>>
>>>>>>>>> Two qemu instances opening the same char device at the same time are
>>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
>>>>>>>>> qemu that opens the char device should see a clean device anyway.
>>>>>>>> I mean the pin can't be done at the time of char device open, where the
>>>>>>>> user address space is not known/bound yet. The earliest point possible
>>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
>>>>>>>> done.
>>>>>>> Maybe we are deviating, let me start again.
>>>>>>>
>>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
>>>>>>> .listener member of struct vhost_vdpa.
>>>>>>>
>>>>>>> At this moment, the memory listener is registered at
>>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
>>>>>>> and is unregistered in both vhost_vdpa_reset_status and
>>>>>>> vhost_vdpa_cleanup.
>>>>>>>
>>>>>>> My original proposal was just to move the memory listener registration
>>>>>>> to the last vhost_vdpa_init, and remove the unregister from
>>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
>>>>>>> be the same, the device should not realize this change.
>>>>>> This can address LM downtime latency for sure, but it won't help
>>>>>> downtime during dynamic SVQ switch - which still needs to go through the
>>>>>> full unmap/map cycle (that includes the slow part for pinning) from
>>>>>> passthrough to SVQ mode. Be noted not every device could work with a
>>>>>> separate ASID for SVQ descriptors. The fix should expect to work on
>>>>>> normal vDPA vendor devices without a separate descriptor ASID, with
>>>>>> platform IOMMU underneath or with on-chip IOMMU.
>>>>>>
>>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
>>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
>>>>> allocated in low regions of the iova space, and then the guest memory
>>>>> is allocated in this new IOVA region incrementally.
>>>> Yep. The key to build this fast path for SVQ switching I think is to
>>>> maintain the identity mapping for the passthrough queues so that QEMU
>>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
>>>> to GPA) while incrementally adding new mappings for SVQ vrings.
>>>>
>>>>> We can optimize that if we place SVQ in a free GPA area instead.
>>>> Here's a question though: it might not be hard to find a free GPA range
>>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
>>>> ranges), but I'm not sure if easy to find a free GIOVA range for the
>>>> vIOMMU case - particularly this has to work in the same entire 64bit
>>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
>>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
>>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
>>>> virito-iommu) so that we can call it out as a generic means for SVQ
>>>> switching optimization?
>>>>
>>> In the case vIOMMU allocates a new block we will use the same algorithm as now:
>>> * Find a new free IOVA chunk of the same size
>>> * Map this new SVQ IOVA, that may or may not be the same as SVQ
>>>
>>> Since we must go through the translation phase to sanitize guest's
>>> available descriptors anyway, it has zero added cost.
>> Not sure I followed, this can work but doesn't seem able to reuse the
>> old host kernel mappings for guest memory, hence still requires remap of
>> the entire host IOVA ranges when SVQ IOVA comes along. I think by
>> maintaining 1:1 identity map on guest memory, we don't have to bother
>> tearing down existing HVA->HPA mappings in kernel thus save the
>> expensive pinning calls at large. I don't clearly see under this scheme,
>> how the new SVQ IOVA may work with potential conflict on IOVA space from
>> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
>> mapping can't be kept.
>>
> There is no need to maintain the 1:1 for memory mapped after the
> pinning. The bigger reason to maintain them is to reduce the downtime
> because of pinning.
Yes, if guest users don't care about SVQ switching downtime there's no 
need to maintain 1:1, and SVQ translation like today should work for 
them. However most live migration users who care about downtime during 
live migration also care about the downtime for SVQ switching - you 
don't want to say that brownout time like 300ms or so in the mid of live 
migration at the cost of complete service loss of 4 to 5 seconds at the 
start of migration is a win to them. What I said above has the 
presumption that we both are looking at (the possibility of) a 
generic/software way to optimize/reduce pinning overhead on the downtime 
- say what can be done at QEMU level or host kernel to avoid pinning (at 
SVQ switch) rather than put burden on every hardware vendor to implement 
a separate ASID for SVQ.

> After that, we can reuse the method we're using at
> this moment, looking for a new free hole for the new map. ew only need
> to pin the new map.
Consider this sequence:
- initially host device uses original GPA for guest memory mapping
- live migration starts off, QEMU's iova tree maps guest memory GPA 1:1 
to IOVA
- SVQ new maps allocated from a free hole on iova tree that happen to 
fall just above the IOVA range for guest memory GPA
- new memory hot plugged to guest while migration is still going on in 
source host. although this hot plugged memory region sits right above 
the guest memory blocks in guest memory space (GPA), it will get new map 
in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova 
tree. Since QEMU mediates and translates virtqueue access to memory via 
SVQ (all guest memory maps to the same IOVA for SVQ), so far so good
- for some reason live migration fails before VM is able to be migrated 
to the destination host, VM has to resume from the source host immediately
- since live migration had failed, QEMU will unmap SVQ vrings from the 
IOVA tree and then stop shadowing virtqueue access
- the host vDPA device now has incorrect passthrough GPA mapping for the 
part of newly plugged guest memory, as that belongs to the IOVA range 
where SVQ translation is in use

Although it might not be easy for SVQ vrings and plugged memory to 
collide on the same GPA/IOVA range, this is something we should prevent 
from happening in the first place I'd assume. You can say we only look 
from higher IOVA ranges to map SVQ so that the lower range can be 
reserved for latent hot plug memory, but this still needs complicated 
implementation to deal with IOVA range fragmentation and special 
collision prevention from breaking the 1:1 map to guest memory space.

Regards,
-Siwei

>
>>> Another option would be to move the SVQ vring to a new region, but I
>>> don't see any advantage on maintaining 1:1 mapping at that point.
>> See above. For pinning avoidance point of view (i.e. QEMU software
>> optimization on SVQ switching downtime), I think it's crucial to
>> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
>> not sure how complex it will be for QEMU to get it implemented in a
>> clean way.
>>
>> Thanks,
>> -Siwei
>>
>>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
>>>> build a fast path in the kernel via a new vhost IOTLB command, say
>>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
>>>> (passthrough) mappings and update to use the SVQ ones in a single batch,
>>>> while keeping the pages for guest memory always pinned (the kernel will
>>>> make this decision). This doesn't expose pinning to userspace, and can
>>>> also fix downtime issue.
>>>>
>>>>>     All
>>>>> of the "translations" still need to be done, to ensure the guest
>>>>> doesn't have access to SVQ vring. That way, qemu will not send all the
>>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
>>>>> unpin_user_page / pin_user_pages for all the guest memory.
>>>>>
>>>>> More optimizations include the batching of the SVQ vrings.
>>>> Nods.
>>>>
>>>>>>> One of the concerns was that it could delay VM initialization, and I
>>>>>>> didn't profile it but I think that may be the case.
>>>>>> Yes, that's the concern here - we should not introduce regression to
>>>>>> normal VM boot process/time. In case of large VM it's very easy to see
>>>>>> the side effect if we go this way.
>>>>>>
>>>>>>>      I'm not sure about
>>>>>>> the right fix but I think the change is easy to profile. If that is
>>>>>>> the case, we could:
>>>>>>> * use a flag (listener->address_space ?) and only register the
>>>>>>> listener in _init if waiting for a migration, do it in _start
>>>>>>> otherwise.
>>>>>> Just doing this alone won't help SVQ mode switch downtime, see the
>>>>>> reason stated above.
>>>>>>
>>>>>>> * something like io_uring, where the map can be done in parallel and
>>>>>>> we can fail _start if some of them fails.
>>>>>> This can alleviate the problem somehow, but still sub-optimal and not
>>>>>> scalable with larger size. I'd like zero or least pinning to be done at
>>>>>> the SVQ switch or migration time.
>>>>>>
>>>>> To reduce even further the pinning at SVQ time we would need to
>>>>> preallocate SVQ vrings before suspending the device.
>>>> Yep. Preallocate SVQ vrings at the start of migration, but before
>>>> suspending the device. This will work under the assumption that we can
>>>> reserve or "steal" some ranges from the GPA or GIOVA space...
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
>>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
>>>>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
>>>>>>>> call is not made mandatory hence only seen implemented in vhost-net
>>>>>>>> device today. One qemu instance could well exec(3) another new qemu
>>>>>>>> instance to live upgrade itself while keeping all emulated devices and
>>>>>>>> guest alive. The current vhost design simply prohibits this from happening.
>>>>>>>>
>>>>>>> Ok, I was not aware of this. Thanks for explaining it!
>>>>>>>
>>>>>>>>>>>        so all the guest memory can be pinned for all the guest / qemu
>>>>>>>>>>> lifecycle.
>>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
>>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
>>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
>>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>>>>>>>>>> memory as needed without having to start the device, while awaiting any
>>>>>>>>>> incoming migration request. Though problem is, there's no existing vhost
>>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>>>>>>>>>> to guest's life cycle?
>>>>>>>>>>
>>>>>>>>> I think that to pin or not pin memory maps should be a kernel
>>>>>>>>> decision, not to be driven by qemu.
>>>>>>>> It's kernel decision for sure. I am with this part.
>>>>>>>>
>>>>>>>>> I'm not against it if needed, but
>>>>>>>>> let me know if the current "clean at close" address your concerns.
>>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
>>>>>>>> vhost uAPI pair for explicit pinning request - which would live with
>>>>>>>> user mm's, or more precisely qemu instance's lifecycle.
>>>>>>>>
>>>>>>> Ok I see your problem better now, but I think it should be solved at
>>>>>>> kernel level. Does that live update need to forcefully unpin and pin
>>>>>>> the memory again,
>>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
>>>>>> downtime expectation. The exec(3)'d process should inherit the page
>>>>>> pinning and/or mlock accounting from the original QEMU process, while
>>>>>> keeping original page pinning intact. Physical page mappings for DMA can
>>>>>> be kept as is to avoid the need of reprogramming device, though in this
>>>>>> case the existing vhost iotlb entries should be updated to reflect the
>>>>>> new HVA in the exec(3)'d QEMU process.
>>>>>>
>>>>>>>      or that is just a consequence of how it works the
>>>>>>> memory listener right now?
>>>>>>>
>>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
>>>>>>> most natural way to me.
>>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
>>>>>> live update, and people are not clear what the exact semantics are by
>>>>>> using it (which part of the device state is being reset, and how
>>>>>> much)... In addition, people working on iommufd intended to make this a
>>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
>>>>>>
>>>>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
>>>>>>
>>>>>> New uAPI to just change ownership of mm seems a better fit to me...
>>>>>>
>>>>> I'm not sure about the right solution here, but there are other
>>>>> proposals to batch ioctls. But maybe creating a new one fits better.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>>>> Another concern is the use_va stuff, originally it tags to the device
>>>>>>>>>> level and is made static at the time of device instantiation, which is
>>>>>>>>>> fine. But others to come just find a new home at per-group level or
>>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
>>>>>>>>>> needed for the latter use_va friends, as they are essentially tied to
>>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
>>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
>>>>>>>>>> the need of pinning for device?
>>>>>>>>>>
>>>>>>>>> I don't follow this. But I have the feeling that the subject of my
>>>>>>>>> original mail is way more accurate if I would have said just "memory
>>>>>>>>> maps".
>>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
>>>>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
>>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
>>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
>>>>>>>> warming up device at live migration destination.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I still consider the way to fix it is to actually delegate that to the
>>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
>>>>>>>>> not. But let me know if I missed something.
>>>>>>>> You can disregard this for now. I will discuss that further with you
>>>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> This has two main problems:
>>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
>>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
>>>>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
>>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
>>>>>>>>>>> done as an optimization on top.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas or comments in this regard?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-17 19:56                     ` Si-Wei Liu
@ 2023-07-19 10:40                       ` Eugenio Perez Martin
  2023-07-20  0:48                         ` Si-Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-07-19 10:40 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> Hey,
>
> I am now back from the break. Sorry for the delayed response, please see
> in line.
>
> On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:
> > On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> >>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
> >>>>>>>>>> on this while I was on vacation. I have been working on this and found
> >>>>>>>>>> out some discrepancy, please see below.
> >>>>>>>>>>
> >>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>>>>>> Hi!
> >>>>>>>>>>>
> >>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>>>>>>>> the factors that adds more downtime to migration is the handling of
> >>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>>>>>>>>>> destination device waits until all the guest memory / state is
> >>>>>>>>>>> migrated to start pinning all the memory.
> >>>>>>>>>>>
> >>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>>>>>> close),
> >>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
> >>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>>>>>
> >>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
> >>>>>>>>>
> >>>>>>>>> Two qemu instances opening the same char device at the same time are
> >>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>>>>>> qemu that opens the char device should see a clean device anyway.
> >>>>>>>> I mean the pin can't be done at the time of char device open, where the
> >>>>>>>> user address space is not known/bound yet. The earliest point possible
> >>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >>>>>>>> done.
> >>>>>>> Maybe we are deviating, let me start again.
> >>>>>>>
> >>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>>>>>> .listener member of struct vhost_vdpa.
> >>>>>>>
> >>>>>>> At this moment, the memory listener is registered at
> >>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>>>>>> and is unregistered in both vhost_vdpa_reset_status and
> >>>>>>> vhost_vdpa_cleanup.
> >>>>>>>
> >>>>>>> My original proposal was just to move the memory listener registration
> >>>>>>> to the last vhost_vdpa_init, and remove the unregister from
> >>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>>>>>> be the same, the device should not realize this change.
> >>>>>> This can address LM downtime latency for sure, but it won't help
> >>>>>> downtime during dynamic SVQ switch - which still needs to go through the
> >>>>>> full unmap/map cycle (that includes the slow part for pinning) from
> >>>>>> passthrough to SVQ mode. Be noted not every device could work with a
> >>>>>> separate ASID for SVQ descriptors. The fix should expect to work on
> >>>>>> normal vDPA vendor devices without a separate descriptor ASID, with
> >>>>>> platform IOMMU underneath or with on-chip IOMMU.
> >>>>>>
> >>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
> >>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> >>>>> allocated in low regions of the iova space, and then the guest memory
> >>>>> is allocated in this new IOVA region incrementally.
> >>>> Yep. The key to build this fast path for SVQ switching I think is to
> >>>> maintain the identity mapping for the passthrough queues so that QEMU
> >>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> >>>> to GPA) while incrementally adding new mappings for SVQ vrings.
> >>>>
> >>>>> We can optimize that if we place SVQ in a free GPA area instead.
> >>>> Here's a question though: it might not be hard to find a free GPA range
> >>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> >>>> ranges), but I'm not sure if easy to find a free GIOVA range for the
> >>>> vIOMMU case - particularly this has to work in the same entire 64bit
> >>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> >>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> >>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> >>>> virito-iommu) so that we can call it out as a generic means for SVQ
> >>>> switching optimization?
> >>>>
> >>> In the case vIOMMU allocates a new block we will use the same algorithm as now:
> >>> * Find a new free IOVA chunk of the same size
> >>> * Map this new SVQ IOVA, that may or may not be the same as SVQ
> >>>
> >>> Since we must go through the translation phase to sanitize guest's
> >>> available descriptors anyway, it has zero added cost.
> >> Not sure I followed, this can work but doesn't seem able to reuse the
> >> old host kernel mappings for guest memory, hence still requires remap of
> >> the entire host IOVA ranges when SVQ IOVA comes along. I think by
> >> maintaining 1:1 identity map on guest memory, we don't have to bother
> >> tearing down existing HVA->HPA mappings in kernel thus save the
> >> expensive pinning calls at large. I don't clearly see under this scheme,
> >> how the new SVQ IOVA may work with potential conflict on IOVA space from
> >> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
> >> mapping can't be kept.
> >>
> > There is no need to maintain the 1:1 for memory mapped after the
> > pinning. The bigger reason to maintain them is to reduce the downtime
> > because of pinning.
> Yes, if guest users don't care about SVQ switching downtime there's no
> need to maintain 1:1, and SVQ translation like today should work for
> them. However most live migration users who care about downtime during
> live migration also care about the downtime for SVQ switching - you
> don't want to say that brownout time like 300ms or so in the mid of live
> migration at the cost of complete service loss of 4 to 5 seconds at the
> start of migration is a win to them. What I said above has the
> presumption that we both are looking at (the possibility of) a
> generic/software way to optimize/reduce pinning overhead on the downtime
> - say what can be done at QEMU level or host kernel to avoid pinning (at
> SVQ switch) rather than put burden on every hardware vendor to implement
> a separate ASID for SVQ.
>

Your assumption is right, let's talk in the example as I think it
would be easier to continue this.

> > After that, we can reuse the method we're using at
> > this moment, looking for a new free hole for the new map. ew only need
> > to pin the new map.
> Consider this sequence:
> - initially host device uses original GPA for guest memory mapping
> - live migration starts off, QEMU's iova tree maps guest memory GPA 1:1
> to IOVA

GPA is already mapped & pinned, there is no need to map it again. all
the remapping is done at the moment, and what I'm proposing is to not
to do so.

> - SVQ new maps allocated from a free hole on iova tree that happen to
> fall just above the IOVA range for guest memory GPA
> - new memory hot plugged to guest while migration is still going on in
> source host. although this hot plugged memory region sits right above
> the guest memory blocks in guest memory space (GPA), it will get new map
> in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova
> tree. Since QEMU mediates and translates virtqueue access to memory via
> SVQ (all guest memory maps to the same IOVA for SVQ), so far so good
> - for some reason live migration fails before VM is able to be migrated
> to the destination host, VM has to resume from the source host immediately
> - since live migration had failed, QEMU will unmap SVQ vrings from the
> IOVA tree and then stop shadowing virtqueue access
> - the host vDPA device now has incorrect passthrough GPA mapping for the
> part of newly plugged guest memory, as that belongs to the IOVA range
> where SVQ translation is in use
>

I'm assuming we can also support <300ms downtime in that case of
memory hotplug, and that the pin / unpin of SVQ regions is bearable.

As a first step, we can also move the SVQ vrings IOVA in the event of
hot-plugging memory. but that would require a restart of the device
unless RING_RESET is supported. Would that improve the situation?

Another solution is to effectively be smarter in the case of a DMA
remap, and not unpin memory that is going to be pin anyway.

> Although it might not be easy for SVQ vrings and plugged memory to
> collide on the same GPA/IOVA range, this is something we should prevent
> from happening in the first place I'd assume. You can say we only look
> from higher IOVA ranges to map SVQ so that the lower range can be
> reserved for latent hot plug memory, but this still needs complicated
> implementation to deal with IOVA range fragmentation and special
> collision prevention from breaking the 1:1 map to guest memory space.
>
> Regards,
> -Siwei
>
> >
> >>> Another option would be to move the SVQ vring to a new region, but I
> >>> don't see any advantage on maintaining 1:1 mapping at that point.
> >> See above. For pinning avoidance point of view (i.e. QEMU software
> >> optimization on SVQ switching downtime), I think it's crucial to
> >> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
> >> not sure how complex it will be for QEMU to get it implemented in a
> >> clean way.
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> >>>> build a fast path in the kernel via a new vhost IOTLB command, say
> >>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> >>>> (passthrough) mappings and update to use the SVQ ones in a single batch,
> >>>> while keeping the pages for guest memory always pinned (the kernel will
> >>>> make this decision). This doesn't expose pinning to userspace, and can
> >>>> also fix downtime issue.
> >>>>
> >>>>>     All
> >>>>> of the "translations" still need to be done, to ensure the guest
> >>>>> doesn't have access to SVQ vring. That way, qemu will not send all the
> >>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> >>>>> unpin_user_page / pin_user_pages for all the guest memory.
> >>>>>
> >>>>> More optimizations include the batching of the SVQ vrings.
> >>>> Nods.
> >>>>
> >>>>>>> One of the concerns was that it could delay VM initialization, and I
> >>>>>>> didn't profile it but I think that may be the case.
> >>>>>> Yes, that's the concern here - we should not introduce regression to
> >>>>>> normal VM boot process/time. In case of large VM it's very easy to see
> >>>>>> the side effect if we go this way.
> >>>>>>
> >>>>>>>      I'm not sure about
> >>>>>>> the right fix but I think the change is easy to profile. If that is
> >>>>>>> the case, we could:
> >>>>>>> * use a flag (listener->address_space ?) and only register the
> >>>>>>> listener in _init if waiting for a migration, do it in _start
> >>>>>>> otherwise.
> >>>>>> Just doing this alone won't help SVQ mode switch downtime, see the
> >>>>>> reason stated above.
> >>>>>>
> >>>>>>> * something like io_uring, where the map can be done in parallel and
> >>>>>>> we can fail _start if some of them fails.
> >>>>>> This can alleviate the problem somehow, but still sub-optimal and not
> >>>>>> scalable with larger size. I'd like zero or least pinning to be done at
> >>>>>> the SVQ switch or migration time.
> >>>>>>
> >>>>> To reduce even further the pinning at SVQ time we would need to
> >>>>> preallocate SVQ vrings before suspending the device.
> >>>> Yep. Preallocate SVQ vrings at the start of migration, but before
> >>>> suspending the device. This will work under the assumption that we can
> >>>> reserve or "steal" some ranges from the GPA or GIOVA space...
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >>>>>>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>>>>>> device today. One qemu instance could well exec(3) another new qemu
> >>>>>>>> instance to live upgrade itself while keeping all emulated devices and
> >>>>>>>> guest alive. The current vhost design simply prohibits this from happening.
> >>>>>>>>
> >>>>>>> Ok, I was not aware of this. Thanks for explaining it!
> >>>>>>>
> >>>>>>>>>>>        so all the guest memory can be pinned for all the guest / qemu
> >>>>>>>>>>> lifecycle.
> >>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
> >>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >>>>>>>>>> memory as needed without having to start the device, while awaiting any
> >>>>>>>>>> incoming migration request. Though problem is, there's no existing vhost
> >>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >>>>>>>>>> to guest's life cycle?
> >>>>>>>>>>
> >>>>>>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>>>>>> decision, not to be driven by qemu.
> >>>>>>>> It's kernel decision for sure. I am with this part.
> >>>>>>>>
> >>>>>>>>> I'm not against it if needed, but
> >>>>>>>>> let me know if the current "clean at close" address your concerns.
> >>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
> >>>>>>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>>>>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>>>>>
> >>>>>>> Ok I see your problem better now, but I think it should be solved at
> >>>>>>> kernel level. Does that live update need to forcefully unpin and pin
> >>>>>>> the memory again,
> >>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >>>>>> downtime expectation. The exec(3)'d process should inherit the page
> >>>>>> pinning and/or mlock accounting from the original QEMU process, while
> >>>>>> keeping original page pinning intact. Physical page mappings for DMA can
> >>>>>> be kept as is to avoid the need of reprogramming device, though in this
> >>>>>> case the existing vhost iotlb entries should be updated to reflect the
> >>>>>> new HVA in the exec(3)'d QEMU process.
> >>>>>>
> >>>>>>>      or that is just a consequence of how it works the
> >>>>>>> memory listener right now?
> >>>>>>>
> >>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>>>>>> most natural way to me.
> >>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >>>>>> live update, and people are not clear what the exact semantics are by
> >>>>>> using it (which part of the device state is being reset, and how
> >>>>>> much)... In addition, people working on iommufd intended to make this a
> >>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>>>>>
> >>>>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
> >>>>>>
> >>>>>> New uAPI to just change ownership of mm seems a better fit to me...
> >>>>>>
> >>>>> I'm not sure about the right solution here, but there are other
> >>>>> proposals to batch ioctls. But maybe creating a new one fits better.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> Another concern is the use_va stuff, originally it tags to the device
> >>>>>>>>>> level and is made static at the time of device instantiation, which is
> >>>>>>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>>>>>>>> needed for the latter use_va friends, as they are essentially tied to
> >>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
> >>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>>>>>>>> the need of pinning for device?
> >>>>>>>>>>
> >>>>>>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>>>>>> original mail is way more accurate if I would have said just "memory
> >>>>>>>>> maps".
> >>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >>>>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
> >>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>>>>>> warming up device at live migration destination.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> I still consider the way to fix it is to actually delegate that to the
> >>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>>>>>> not. But let me know if I missed something.
> >>>>>>>> You can disregard this for now. I will discuss that further with you
> >>>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> This has two main problems:
> >>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
> >>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>>>>>> done as an optimization on top.
> >>>>>>>>>>>
> >>>>>>>>>>> Any ideas or comments in this regard?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-19 10:40                       ` Eugenio Perez Martin
@ 2023-07-20  0:48                         ` Si-Wei Liu
  2023-08-02 12:42                           ` Eugenio Perez Martin
  0 siblings, 1 reply; 27+ messages in thread
From: Si-Wei Liu @ 2023-07-20  0:48 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea



On 7/19/2023 3:40 AM, Eugenio Perez Martin wrote:
> On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> Hey,
>>
>> I am now back from the break. Sorry for the delayed response, please see
>> in line.
>>
>> On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:
>>> On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
>>>>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
>>>>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
>>>>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
>>>>>>>>>>>> on this while I was on vacation. I have been working on this and found
>>>>>>>>>>>> out some discrepancy, please see below.
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>
>>>>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
>>>>>>>>>>>>> the factors that adds more downtime to migration is the handling of
>>>>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
>>>>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
>>>>>>>>>>>>> destination device waits until all the guest memory / state is
>>>>>>>>>>>>> migrated to start pinning all the memory.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
>>>>>>>>>>>>> close),
>>>>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
>>>>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
>>>>>>>>>>>>
>>>>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
>>>>>>>>>>>
>>>>>>>>>>> Two qemu instances opening the same char device at the same time are
>>>>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
>>>>>>>>>>> qemu that opens the char device should see a clean device anyway.
>>>>>>>>>> I mean the pin can't be done at the time of char device open, where the
>>>>>>>>>> user address space is not known/bound yet. The earliest point possible
>>>>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
>>>>>>>>>> done.
>>>>>>>>> Maybe we are deviating, let me start again.
>>>>>>>>>
>>>>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
>>>>>>>>> .listener member of struct vhost_vdpa.
>>>>>>>>>
>>>>>>>>> At this moment, the memory listener is registered at
>>>>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
>>>>>>>>> and is unregistered in both vhost_vdpa_reset_status and
>>>>>>>>> vhost_vdpa_cleanup.
>>>>>>>>>
>>>>>>>>> My original proposal was just to move the memory listener registration
>>>>>>>>> to the last vhost_vdpa_init, and remove the unregister from
>>>>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
>>>>>>>>> be the same, the device should not realize this change.
>>>>>>>> This can address LM downtime latency for sure, but it won't help
>>>>>>>> downtime during dynamic SVQ switch - which still needs to go through the
>>>>>>>> full unmap/map cycle (that includes the slow part for pinning) from
>>>>>>>> passthrough to SVQ mode. Be noted not every device could work with a
>>>>>>>> separate ASID for SVQ descriptors. The fix should expect to work on
>>>>>>>> normal vDPA vendor devices without a separate descriptor ASID, with
>>>>>>>> platform IOMMU underneath or with on-chip IOMMU.
>>>>>>>>
>>>>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
>>>>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
>>>>>>> allocated in low regions of the iova space, and then the guest memory
>>>>>>> is allocated in this new IOVA region incrementally.
>>>>>> Yep. The key to build this fast path for SVQ switching I think is to
>>>>>> maintain the identity mapping for the passthrough queues so that QEMU
>>>>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
>>>>>> to GPA) while incrementally adding new mappings for SVQ vrings.
>>>>>>
>>>>>>> We can optimize that if we place SVQ in a free GPA area instead.
>>>>>> Here's a question though: it might not be hard to find a free GPA range
>>>>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
>>>>>> ranges), but I'm not sure if easy to find a free GIOVA range for the
>>>>>> vIOMMU case - particularly this has to work in the same entire 64bit
>>>>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
>>>>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
>>>>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
>>>>>> virito-iommu) so that we can call it out as a generic means for SVQ
>>>>>> switching optimization?
>>>>>>
>>>>> In the case vIOMMU allocates a new block we will use the same algorithm as now:
>>>>> * Find a new free IOVA chunk of the same size
>>>>> * Map this new SVQ IOVA, that may or may not be the same as SVQ
>>>>>
>>>>> Since we must go through the translation phase to sanitize guest's
>>>>> available descriptors anyway, it has zero added cost.
>>>> Not sure I followed, this can work but doesn't seem able to reuse the
>>>> old host kernel mappings for guest memory, hence still requires remap of
>>>> the entire host IOVA ranges when SVQ IOVA comes along. I think by
>>>> maintaining 1:1 identity map on guest memory, we don't have to bother
>>>> tearing down existing HVA->HPA mappings in kernel thus save the
>>>> expensive pinning calls at large. I don't clearly see under this scheme,
>>>> how the new SVQ IOVA may work with potential conflict on IOVA space from
>>>> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
>>>> mapping can't be kept.
>>>>
>>> There is no need to maintain the 1:1 for memory mapped after the
>>> pinning. The bigger reason to maintain them is to reduce the downtime
>>> because of pinning.
>> Yes, if guest users don't care about SVQ switching downtime there's no
>> need to maintain 1:1, and SVQ translation like today should work for
>> them. However most live migration users who care about downtime during
>> live migration also care about the downtime for SVQ switching - you
>> don't want to say that brownout time like 300ms or so in the mid of live
>> migration at the cost of complete service loss of 4 to 5 seconds at the
>> start of migration is a win to them. What I said above has the
>> presumption that we both are looking at (the possibility of) a
>> generic/software way to optimize/reduce pinning overhead on the downtime
>> - say what can be done at QEMU level or host kernel to avoid pinning (at
>> SVQ switch) rather than put burden on every hardware vendor to implement
>> a separate ASID for SVQ.
>>
> Your assumption is right, let's talk in the example as I think it
> would be easier to continue this.
>
>>> After that, we can reuse the method we're using at
>>> this moment, looking for a new free hole for the new map. ew only need
>>> to pin the new map.
>> Consider this sequence:
>> - initially host device uses original GPA for guest memory mapping
>> - live migration starts off, QEMU's iova tree maps guest memory GPA 1:1
>> to IOVA
> GPA is already mapped & pinned, there is no need to map it again. all
> the remapping is done at the moment, and what I'm proposing is to not
> to do so.
Right, depending on the specific term but what I was saying is IOVA to 
GPA translation in QEMU's IOVA tree is 1:1 as is.
>
>> - SVQ new maps allocated from a free hole on iova tree that happen to
>> fall just above the IOVA range for guest memory GPA
>> - new memory hot plugged to guest while migration is still going on in
>> source host. although this hot plugged memory region sits right above
>> the guest memory blocks in guest memory space (GPA), it will get new map
>> in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova
>> tree. Since QEMU mediates and translates virtqueue access to memory via
>> SVQ (all guest memory maps to the same IOVA for SVQ), so far so good
>> - for some reason live migration fails before VM is able to be migrated
>> to the destination host, VM has to resume from the source host immediately
>> - since live migration had failed, QEMU will unmap SVQ vrings from the
>> IOVA tree and then stop shadowing virtqueue access
>> - the host vDPA device now has incorrect passthrough GPA mapping for the
>> part of newly plugged guest memory, as that belongs to the IOVA range
>> where SVQ translation is in use
>>
> I'm assuming we can also support <300ms downtime in that case of
> memory hotplug, and that the pin / unpin of SVQ regions is bearable.
>
> As a first step, we can also move the SVQ vrings IOVA in the event of
> hot-plugging memory. but that would require a restart of the device
> unless RING_RESET is supported. Would that improve the situation?
Hmmm, this may help somehow but restart happens to run conflict with our 
goal to remove any (device) reset in favor of suspend&resume for better 
downtime at LM migration source, and in this case SVQ vring movement as 
well. As mentioned in the meeting, device/ring reset usually falls on 
the hardware slow path, reset latency grows linearly with number of 
queues. It's still being evaluated what will work best on the specific 
hardware, but just want to let you aware that we'd like to get rid of 
any latent scalability blocker from the performance path.
> Another solution is to effectively be smarter in the case of a DMA
> remap, and not unpin memory that is going to be pin anyway.
Yes, that's the goal, though not sure what the "smarter" solution that 
could be. Does it look like what I said below with complex IOVA tree trick?

Thanks,
-Siwei
>
>> Although it might not be easy for SVQ vrings and plugged memory to
>> collide on the same GPA/IOVA range, this is something we should prevent
>> from happening in the first place I'd assume. You can say we only look
>> from higher IOVA ranges to map SVQ so that the lower range can be
>> reserved for latent hot plug memory, but this still needs complicated
>> implementation to deal with IOVA range fragmentation and special
>> collision prevention from breaking the 1:1 map to guest memory space.
>>
>> Regards,
>> -Siwei
>>
>>>>> Another option would be to move the SVQ vring to a new region, but I
>>>>> don't see any advantage on maintaining 1:1 mapping at that point.
>>>> See above. For pinning avoidance point of view (i.e. QEMU software
>>>> optimization on SVQ switching downtime), I think it's crucial to
>>>> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
>>>> not sure how complex it will be for QEMU to get it implemented in a
>>>> clean way.
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
>>>>>> build a fast path in the kernel via a new vhost IOTLB command, say
>>>>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
>>>>>> (passthrough) mappings and update to use the SVQ ones in a single batch,
>>>>>> while keeping the pages for guest memory always pinned (the kernel will
>>>>>> make this decision). This doesn't expose pinning to userspace, and can
>>>>>> also fix downtime issue.
>>>>>>
>>>>>>>      All
>>>>>>> of the "translations" still need to be done, to ensure the guest
>>>>>>> doesn't have access to SVQ vring. That way, qemu will not send all the
>>>>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
>>>>>>> unpin_user_page / pin_user_pages for all the guest memory.
>>>>>>>
>>>>>>> More optimizations include the batching of the SVQ vrings.
>>>>>> Nods.
>>>>>>
>>>>>>>>> One of the concerns was that it could delay VM initialization, and I
>>>>>>>>> didn't profile it but I think that may be the case.
>>>>>>>> Yes, that's the concern here - we should not introduce regression to
>>>>>>>> normal VM boot process/time. In case of large VM it's very easy to see
>>>>>>>> the side effect if we go this way.
>>>>>>>>
>>>>>>>>>       I'm not sure about
>>>>>>>>> the right fix but I think the change is easy to profile. If that is
>>>>>>>>> the case, we could:
>>>>>>>>> * use a flag (listener->address_space ?) and only register the
>>>>>>>>> listener in _init if waiting for a migration, do it in _start
>>>>>>>>> otherwise.
>>>>>>>> Just doing this alone won't help SVQ mode switch downtime, see the
>>>>>>>> reason stated above.
>>>>>>>>
>>>>>>>>> * something like io_uring, where the map can be done in parallel and
>>>>>>>>> we can fail _start if some of them fails.
>>>>>>>> This can alleviate the problem somehow, but still sub-optimal and not
>>>>>>>> scalable with larger size. I'd like zero or least pinning to be done at
>>>>>>>> the SVQ switch or migration time.
>>>>>>>>
>>>>>>> To reduce even further the pinning at SVQ time we would need to
>>>>>>> preallocate SVQ vrings before suspending the device.
>>>>>> Yep. Preallocate SVQ vrings at the start of migration, but before
>>>>>> suspending the device. This will work under the assumption that we can
>>>>>> reserve or "steal" some ranges from the GPA or GIOVA space...
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
>>>>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
>>>>>>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
>>>>>>>>>> call is not made mandatory hence only seen implemented in vhost-net
>>>>>>>>>> device today. One qemu instance could well exec(3) another new qemu
>>>>>>>>>> instance to live upgrade itself while keeping all emulated devices and
>>>>>>>>>> guest alive. The current vhost design simply prohibits this from happening.
>>>>>>>>>>
>>>>>>>>> Ok, I was not aware of this. Thanks for explaining it!
>>>>>>>>>
>>>>>>>>>>>>>         so all the guest memory can be pinned for all the guest / qemu
>>>>>>>>>>>>> lifecycle.
>>>>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
>>>>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
>>>>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
>>>>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
>>>>>>>>>>>> memory as needed without having to start the device, while awaiting any
>>>>>>>>>>>> incoming migration request. Though problem is, there's no existing vhost
>>>>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
>>>>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
>>>>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
>>>>>>>>>>>> to guest's life cycle?
>>>>>>>>>>>>
>>>>>>>>>>> I think that to pin or not pin memory maps should be a kernel
>>>>>>>>>>> decision, not to be driven by qemu.
>>>>>>>>>> It's kernel decision for sure. I am with this part.
>>>>>>>>>>
>>>>>>>>>>> I'm not against it if needed, but
>>>>>>>>>>> let me know if the current "clean at close" address your concerns.
>>>>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
>>>>>>>>>> vhost uAPI pair for explicit pinning request - which would live with
>>>>>>>>>> user mm's, or more precisely qemu instance's lifecycle.
>>>>>>>>>>
>>>>>>>>> Ok I see your problem better now, but I think it should be solved at
>>>>>>>>> kernel level. Does that live update need to forcefully unpin and pin
>>>>>>>>> the memory again,
>>>>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
>>>>>>>> downtime expectation. The exec(3)'d process should inherit the page
>>>>>>>> pinning and/or mlock accounting from the original QEMU process, while
>>>>>>>> keeping original page pinning intact. Physical page mappings for DMA can
>>>>>>>> be kept as is to avoid the need of reprogramming device, though in this
>>>>>>>> case the existing vhost iotlb entries should be updated to reflect the
>>>>>>>> new HVA in the exec(3)'d QEMU process.
>>>>>>>>
>>>>>>>>>       or that is just a consequence of how it works the
>>>>>>>>> memory listener right now?
>>>>>>>>>
>>>>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
>>>>>>>>> most natural way to me.
>>>>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
>>>>>>>> live update, and people are not clear what the exact semantics are by
>>>>>>>> using it (which part of the device state is being reset, and how
>>>>>>>> much)... In addition, people working on iommufd intended to make this a
>>>>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
>>>>>>>>
>>>>>>>> New uAPI to just change ownership of mm seems a better fit to me...
>>>>>>>>
>>>>>>> I'm not sure about the right solution here, but there are other
>>>>>>> proposals to batch ioctls. But maybe creating a new one fits better.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Another concern is the use_va stuff, originally it tags to the device
>>>>>>>>>>>> level and is made static at the time of device instantiation, which is
>>>>>>>>>>>> fine. But others to come just find a new home at per-group level or
>>>>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
>>>>>>>>>>>> needed for the latter use_va friends, as they are essentially tied to
>>>>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
>>>>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
>>>>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
>>>>>>>>>>>> the need of pinning for device?
>>>>>>>>>>>>
>>>>>>>>>>> I don't follow this. But I have the feeling that the subject of my
>>>>>>>>>>> original mail is way more accurate if I would have said just "memory
>>>>>>>>>>> maps".
>>>>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
>>>>>>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
>>>>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
>>>>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
>>>>>>>>>> warming up device at live migration destination.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I still consider the way to fix it is to actually delegate that to the
>>>>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
>>>>>>>>>>> not. But let me know if I missed something.
>>>>>>>>>> You can disregard this for now. I will discuss that further with you
>>>>>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> This has two main problems:
>>>>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
>>>>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
>>>>>>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
>>>>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
>>>>>>>>>>>>> done as an optimization on top.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any ideas or comments in this regard?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Reducing vdpa migration downtime because of memory pin / maps
  2023-07-20  0:48                         ` Si-Wei Liu
@ 2023-08-02 12:42                           ` Eugenio Perez Martin
  0 siblings, 0 replies; 27+ messages in thread
From: Eugenio Perez Martin @ 2023-08-02 12:42 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Jason Wang, Michael Tsirkin, Longpeng, Gonglei (Arei),
	Eli Cohen, Parav Pandit, Juan Quintela, David Gilbert,
	Dragos Tatulea

On Thu, Jul 20, 2023 at 2:49 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 7/19/2023 3:40 AM, Eugenio Perez Martin wrote:
> > On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> Hey,
> >>
> >> I am now back from the break. Sorry for the delayed response, please see
> >> in line.
> >>
> >> On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:
> >>> On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> On 6/7/23 01:08, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>>>> Sorry for reviving this old thread, I lost the best timing to follow up
> >>>>>>>>>>>> on this while I was on vacation. I have been working on this and found
> >>>>>>>>>>>> out some discrepancy, please see below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 4/5/23 04:37, Eugenio Perez Martin wrote:
> >>>>>>>>>>>>> Hi!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As mentioned in the last upstream virtio-networking meeting, one of
> >>>>>>>>>>>>> the factors that adds more downtime to migration is the handling of
> >>>>>>>>>>>>> the guest memory (pin, map, etc). At this moment this handling is
> >>>>>>>>>>>>> bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
> >>>>>>>>>>>>> destination device waits until all the guest memory / state is
> >>>>>>>>>>>>> migrated to start pinning all the memory.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The proposal is to bind it to the char device life cycle (open vs
> >>>>>>>>>>>>> close),
> >>>>>>>>>>>> Hmmm, really? If it's the life cycle for char device, the next guest /
> >>>>>>>>>>>> qemu launch on the same vhost-vdpa device node won't make it work.
> >>>>>>>>>>>>
> >>>>>>>>>>> Maybe my sentence was not accurate, but I think we're on the same page here.
> >>>>>>>>>>>
> >>>>>>>>>>> Two qemu instances opening the same char device at the same time are
> >>>>>>>>>>> not allowed, and vhost_vdpa_release clean all the maps. So the next
> >>>>>>>>>>> qemu that opens the char device should see a clean device anyway.
> >>>>>>>>>> I mean the pin can't be done at the time of char device open, where the
> >>>>>>>>>> user address space is not known/bound yet. The earliest point possible
> >>>>>>>>>> for pinning would be until the vhost_attach_mm() call from SET_OWNER is
> >>>>>>>>>> done.
> >>>>>>>>> Maybe we are deviating, let me start again.
> >>>>>>>>>
> >>>>>>>>> Using QEMU code, what I'm proposing is to modify the lifecycle of the
> >>>>>>>>> .listener member of struct vhost_vdpa.
> >>>>>>>>>
> >>>>>>>>> At this moment, the memory listener is registered at
> >>>>>>>>> vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
> >>>>>>>>> and is unregistered in both vhost_vdpa_reset_status and
> >>>>>>>>> vhost_vdpa_cleanup.
> >>>>>>>>>
> >>>>>>>>> My original proposal was just to move the memory listener registration
> >>>>>>>>> to the last vhost_vdpa_init, and remove the unregister from
> >>>>>>>>> vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
> >>>>>>>>> be the same, the device should not realize this change.
> >>>>>>>> This can address LM downtime latency for sure, but it won't help
> >>>>>>>> downtime during dynamic SVQ switch - which still needs to go through the
> >>>>>>>> full unmap/map cycle (that includes the slow part for pinning) from
> >>>>>>>> passthrough to SVQ mode. Be noted not every device could work with a
> >>>>>>>> separate ASID for SVQ descriptors. The fix should expect to work on
> >>>>>>>> normal vDPA vendor devices without a separate descriptor ASID, with
> >>>>>>>> platform IOMMU underneath or with on-chip IOMMU.
> >>>>>>>>
> >>>>>>> At this moment the SVQ switch is very inefficient mapping-wise, as it
> >>>>>>> unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
> >>>>>>> allocated in low regions of the iova space, and then the guest memory
> >>>>>>> is allocated in this new IOVA region incrementally.
> >>>>>> Yep. The key to build this fast path for SVQ switching I think is to
> >>>>>> maintain the identity mapping for the passthrough queues so that QEMU
> >>>>>> can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
> >>>>>> to GPA) while incrementally adding new mappings for SVQ vrings.
> >>>>>>
> >>>>>>> We can optimize that if we place SVQ in a free GPA area instead.
> >>>>>> Here's a question though: it might not be hard to find a free GPA range
> >>>>>> for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
> >>>>>> ranges), but I'm not sure if easy to find a free GIOVA range for the
> >>>>>> vIOMMU case - particularly this has to work in the same entire 64bit
> >>>>>> IOVA address ranges that (for now) QEMU won't be able to "reserve" a
> >>>>>> specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
> >>>>>> done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
> >>>>>> virito-iommu) so that we can call it out as a generic means for SVQ
> >>>>>> switching optimization?
> >>>>>>
> >>>>> In the case vIOMMU allocates a new block we will use the same algorithm as now:
> >>>>> * Find a new free IOVA chunk of the same size
> >>>>> * Map this new SVQ IOVA, that may or may not be the same as SVQ
> >>>>>
> >>>>> Since we must go through the translation phase to sanitize guest's
> >>>>> available descriptors anyway, it has zero added cost.
> >>>> Not sure I followed, this can work but doesn't seem able to reuse the
> >>>> old host kernel mappings for guest memory, hence still requires remap of
> >>>> the entire host IOVA ranges when SVQ IOVA comes along. I think by
> >>>> maintaining 1:1 identity map on guest memory, we don't have to bother
> >>>> tearing down existing HVA->HPA mappings in kernel thus save the
> >>>> expensive pinning calls at large. I don't clearly see under this scheme,
> >>>> how the new SVQ IOVA may work with potential conflict on IOVA space from
> >>>> hotplugged memory - in this case the 1:1 IOVA->GPA identity guest memory
> >>>> mapping can't be kept.
> >>>>
> >>> There is no need to maintain the 1:1 for memory mapped after the
> >>> pinning. The bigger reason to maintain them is to reduce the downtime
> >>> because of pinning.
> >> Yes, if guest users don't care about SVQ switching downtime there's no
> >> need to maintain 1:1, and SVQ translation like today should work for
> >> them. However most live migration users who care about downtime during
> >> live migration also care about the downtime for SVQ switching - you
> >> don't want to say that brownout time like 300ms or so in the mid of live
> >> migration at the cost of complete service loss of 4 to 5 seconds at the
> >> start of migration is a win to them. What I said above has the
> >> presumption that we both are looking at (the possibility of) a
> >> generic/software way to optimize/reduce pinning overhead on the downtime
> >> - say what can be done at QEMU level or host kernel to avoid pinning (at
> >> SVQ switch) rather than put burden on every hardware vendor to implement
> >> a separate ASID for SVQ.
> >>
> > Your assumption is right, let's talk in the example as I think it
> > would be easier to continue this.
> >
> >>> After that, we can reuse the method we're using at
> >>> this moment, looking for a new free hole for the new map. ew only need
> >>> to pin the new map.
> >> Consider this sequence:
> >> - initially host device uses original GPA for guest memory mapping
> >> - live migration starts off, QEMU's iova tree maps guest memory GPA 1:1
> >> to IOVA
> > GPA is already mapped & pinned, there is no need to map it again. all
> > the remapping is done at the moment, and what I'm proposing is to not
> > to do so.
> Right, depending on the specific term but what I was saying is IOVA to
> GPA translation in QEMU's IOVA tree is 1:1 as is.
> >
> >> - SVQ new maps allocated from a free hole on iova tree that happen to
> >> fall just above the IOVA range for guest memory GPA
> >> - new memory hot plugged to guest while migration is still going on in
> >> source host. although this hot plugged memory region sits right above
> >> the guest memory blocks in guest memory space (GPA), it will get new map
> >> in a separate range (not 1:1 mapped to GPA any more!) from the QEMU iova
> >> tree. Since QEMU mediates and translates virtqueue access to memory via
> >> SVQ (all guest memory maps to the same IOVA for SVQ), so far so good
> >> - for some reason live migration fails before VM is able to be migrated
> >> to the destination host, VM has to resume from the source host immediately
> >> - since live migration had failed, QEMU will unmap SVQ vrings from the
> >> IOVA tree and then stop shadowing virtqueue access
> >> - the host vDPA device now has incorrect passthrough GPA mapping for the
> >> part of newly plugged guest memory, as that belongs to the IOVA range
> >> where SVQ translation is in use
> >>
> > I'm assuming we can also support <300ms downtime in that case of
> > memory hotplug, and that the pin / unpin of SVQ regions is bearable.
> >
> > As a first step, we can also move the SVQ vrings IOVA in the event of
> > hot-plugging memory. but that would require a restart of the device
> > unless RING_RESET is supported. Would that improve the situation?
> Hmmm, this may help somehow but restart happens to run conflict with our
> goal to remove any (device) reset in favor of suspend&resume for better
> downtime at LM migration source, and in this case SVQ vring movement as
> well. As mentioned in the meeting, device/ring reset usually falls on
> the hardware slow path, reset latency grows linearly with number of
> queues. It's still being evaluated what will work best on the specific
> hardware, but just want to let you aware that we'd like to get rid of
> any latent scalability blocker from the performance path.

So, summarizing the flow for switching to SVQ:
a. Suspend all the device, so we can fetch vq state etc.
b. Get the vq descriptor table group with the new
VHOST_VDPA_GET_VRING_DESC_GROUP.
c. Set the vq group of all descriptors table to ASID 1
d. Reset and set the new address of each vq with _F_RING_RESET. I
think the standard mandates to set all the parameters again, but only
the vq address will be different.
e. re-enable all the vqs
f. Resume the device.

To reduce the overhead per vq, maybe we can also be smarter in the
qemu side of this switch and not change vring IOVA addresses. That
would allow us to get rid of steps d and e. Only the ASID of the
groups of the descriptor tables should be changed, and they might be
just 1 group.

The question now is, it is allowed to change a group ASID after
DRIVER_OK? I guess yes, the same way we can map and unmap memory after
it.

As a first baby step, maybe we can just add the different vring asid,
even resetting the whole device. This would allow us to validate both
device and qemu in this aspect.

Thanks!

> > Another solution is to effectively be smarter in the case of a DMA
> > remap, and not unpin memory that is going to be pin anyway.
> Yes, that's the goal, though not sure what the "smarter" solution that
> could be. Does it look like what I said below with complex IOVA tree trick?
>
> Thanks,
> -Siwei
> >
> >> Although it might not be easy for SVQ vrings and plugged memory to
> >> collide on the same GPA/IOVA range, this is something we should prevent
> >> from happening in the first place I'd assume. You can say we only look
> >> from higher IOVA ranges to map SVQ so that the lower range can be
> >> reserved for latent hot plug memory, but this still needs complicated
> >> implementation to deal with IOVA range fragmentation and special
> >> collision prevention from breaking the 1:1 map to guest memory space.
> >>
> >> Regards,
> >> -Siwei
> >>
> >>>>> Another option would be to move the SVQ vring to a new region, but I
> >>>>> don't see any advantage on maintaining 1:1 mapping at that point.
> >>>> See above. For pinning avoidance point of view (i.e. QEMU software
> >>>> optimization on SVQ switching downtime), I think it's crucial to
> >>>> maintain 1:1 mapping and move SVQ vring to a specific region. But I'm
> >>>> not sure how complex it will be for QEMU to get it implemented in a
> >>>> clean way.
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>> If this QEMU/vIOMMU "hack" is not universally feasible, I would rather
> >>>>>> build a fast path in the kernel via a new vhost IOTLB command, say
> >>>>>> INVALIDATE_AND_UPDATE_ALL, to atomically flush all existing
> >>>>>> (passthrough) mappings and update to use the SVQ ones in a single batch,
> >>>>>> while keeping the pages for guest memory always pinned (the kernel will
> >>>>>> make this decision). This doesn't expose pinning to userspace, and can
> >>>>>> also fix downtime issue.
> >>>>>>
> >>>>>>>      All
> >>>>>>> of the "translations" still need to be done, to ensure the guest
> >>>>>>> doesn't have access to SVQ vring. That way, qemu will not send all the
> >>>>>>> unmaps & maps, only the new ones. And vhost/vdpa does not need to call
> >>>>>>> unpin_user_page / pin_user_pages for all the guest memory.
> >>>>>>>
> >>>>>>> More optimizations include the batching of the SVQ vrings.
> >>>>>> Nods.
> >>>>>>
> >>>>>>>>> One of the concerns was that it could delay VM initialization, and I
> >>>>>>>>> didn't profile it but I think that may be the case.
> >>>>>>>> Yes, that's the concern here - we should not introduce regression to
> >>>>>>>> normal VM boot process/time. In case of large VM it's very easy to see
> >>>>>>>> the side effect if we go this way.
> >>>>>>>>
> >>>>>>>>>       I'm not sure about
> >>>>>>>>> the right fix but I think the change is easy to profile. If that is
> >>>>>>>>> the case, we could:
> >>>>>>>>> * use a flag (listener->address_space ?) and only register the
> >>>>>>>>> listener in _init if waiting for a migration, do it in _start
> >>>>>>>>> otherwise.
> >>>>>>>> Just doing this alone won't help SVQ mode switch downtime, see the
> >>>>>>>> reason stated above.
> >>>>>>>>
> >>>>>>>>> * something like io_uring, where the map can be done in parallel and
> >>>>>>>>> we can fail _start if some of them fails.
> >>>>>>>> This can alleviate the problem somehow, but still sub-optimal and not
> >>>>>>>> scalable with larger size. I'd like zero or least pinning to be done at
> >>>>>>>> the SVQ switch or migration time.
> >>>>>>>>
> >>>>>>> To reduce even further the pinning at SVQ time we would need to
> >>>>>>> preallocate SVQ vrings before suspending the device.
> >>>>>> Yep. Preallocate SVQ vrings at the start of migration, but before
> >>>>>> suspending the device. This will work under the assumption that we can
> >>>>>> reserve or "steal" some ranges from the GPA or GIOVA space...
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>>>>> Actually I think the counterpart vhost_detach_mm() only gets
> >>>>>>>>>> handled in vhost_vdpa_release() at device close time is a resulting
> >>>>>>>>>> artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
> >>>>>>>>>> call is not made mandatory hence only seen implemented in vhost-net
> >>>>>>>>>> device today. One qemu instance could well exec(3) another new qemu
> >>>>>>>>>> instance to live upgrade itself while keeping all emulated devices and
> >>>>>>>>>> guest alive. The current vhost design simply prohibits this from happening.
> >>>>>>>>>>
> >>>>>>>>> Ok, I was not aware of this. Thanks for explaining it!
> >>>>>>>>>
> >>>>>>>>>>>>>         so all the guest memory can be pinned for all the guest / qemu
> >>>>>>>>>>>>> lifecycle.
> >>>>>>>>>>>> I think to tie pinning to guest / qemu process life cycle makes more
> >>>>>>>>>>>> sense. Essentially this pinning part needs to be decoupled from the
> >>>>>>>>>>>> iotlb mapping abstraction layer, and can / should work as a standalone
> >>>>>>>>>>>> uAPI. Such that QEMU at the destination may launch and pin all guest's
> >>>>>>>>>>>> memory as needed without having to start the device, while awaiting any
> >>>>>>>>>>>> incoming migration request. Though problem is, there's no existing vhost
> >>>>>>>>>>>> uAPI that could properly serve as the vehicle for that. SET_OWNER /
> >>>>>>>>>>>> SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
> >>>>>>>>>>>> introducing a new but clean vhost uAPI for pinning guest pages, subject
> >>>>>>>>>>>> to guest's life cycle?
> >>>>>>>>>>>>
> >>>>>>>>>>> I think that to pin or not pin memory maps should be a kernel
> >>>>>>>>>>> decision, not to be driven by qemu.
> >>>>>>>>>> It's kernel decision for sure. I am with this part.
> >>>>>>>>>>
> >>>>>>>>>>> I'm not against it if needed, but
> >>>>>>>>>>> let me know if the current "clean at close" address your concerns.
> >>>>>>>>>> To better facilitate QEMU exec (live update) case, I propose we add new
> >>>>>>>>>> vhost uAPI pair for explicit pinning request - which would live with
> >>>>>>>>>> user mm's, or more precisely qemu instance's lifecycle.
> >>>>>>>>>>
> >>>>>>>>> Ok I see your problem better now, but I think it should be solved at
> >>>>>>>>> kernel level. Does that live update need to forcefully unpin and pin
> >>>>>>>>> the memory again,
> >>>>>>>> No, it should avoid the unpin&pin cycle, otherwise it'd defeat the
> >>>>>>>> downtime expectation. The exec(3)'d process should inherit the page
> >>>>>>>> pinning and/or mlock accounting from the original QEMU process, while
> >>>>>>>> keeping original page pinning intact. Physical page mappings for DMA can
> >>>>>>>> be kept as is to avoid the need of reprogramming device, though in this
> >>>>>>>> case the existing vhost iotlb entries should be updated to reflect the
> >>>>>>>> new HVA in the exec(3)'d QEMU process.
> >>>>>>>>
> >>>>>>>>>       or that is just a consequence of how it works the
> >>>>>>>>> memory listener right now?
> >>>>>>>>>
> >>>>>>>>> Why not extend the RESET_OWNER to the rest of devices? It seems the
> >>>>>>>>> most natural way to me.
> >>>>>>>> Not sure, I think RESET_OWNER might be too heavy weighted to implement
> >>>>>>>> live update, and people are not clear what the exact semantics are by
> >>>>>>>> using it (which part of the device state is being reset, and how
> >>>>>>>> much)... In addition, people working on iommufd intended to make this a
> >>>>>>>> "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:
> >>>>>>>>
> >>>>>>>> https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/
> >>>>>>>>
> >>>>>>>> New uAPI to just change ownership of mm seems a better fit to me...
> >>>>>>>>
> >>>>>>> I'm not sure about the right solution here, but there are other
> >>>>>>> proposals to batch ioctls. But maybe creating a new one fits better.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> Another concern is the use_va stuff, originally it tags to the device
> >>>>>>>>>>>> level and is made static at the time of device instantiation, which is
> >>>>>>>>>>>> fine. But others to come just find a new home at per-group level or
> >>>>>>>>>>>> per-vq level struct. Hard to tell whether or not pinning is actually
> >>>>>>>>>>>> needed for the latter use_va friends, as they are essentially tied to
> >>>>>>>>>>>> the virtio life cycle or feature negotiation. While guest / Qemu starts
> >>>>>>>>>>>> way earlier than that. Perhaps just ignore those sub-device level use_va
> >>>>>>>>>>>> usages? Presumably !use_va at the device level is sufficient to infer
> >>>>>>>>>>>> the need of pinning for device?
> >>>>>>>>>>>>
> >>>>>>>>>>> I don't follow this. But I have the feeling that the subject of my
> >>>>>>>>>>> original mail is way more accurate if I would have said just "memory
> >>>>>>>>>>> maps".
> >>>>>>>>>> I think the iotlb layer in vhost-vdpa just provides the abstraction for
> >>>>>>>>>> mapping, not pinning. Although in some case mapping implicitly relies on
> >>>>>>>>>> pinning for DMA purpose, it doesn't have to tie to that in uAPI
> >>>>>>>>>> semantics. We can do explicit on-demand pinning for cases for e.g.
> >>>>>>>>>> warming up device at live migration destination.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I still consider the way to fix it is to actually delegate that to the
> >>>>>>>>>>> kernel vdpa, so it can choose if a particular ASID needs the pin or
> >>>>>>>>>>> not. But let me know if I missed something.
> >>>>>>>>>> You can disregard this for now. I will discuss that further with you
> >>>>>>>>>> guys while bind_mm and per-group use_va stuffs are landed.
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>> -Siwei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> -Siwei
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> This has two main problems:
> >>>>>>>>>>>>> * At this moment the reset semantics forces the vdpa device to unmap
> >>>>>>>>>>>>> all the memory. So this change needs a vhost vdpa feature flag.
> >>>>>>>>>>>>> * This may increase the initialization time. Maybe we can delay it if
> >>>>>>>>>>>>> qemu is not the destination of a LM. Anyway I think this should be
> >>>>>>>>>>>>> done as an optimization on top.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Any ideas or comments in this regard?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-08-02 12:44 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-05 11:37 Reducing vdpa migration downtime because of memory pin / maps Eugenio Perez Martin
2023-04-10  2:14 ` Jason Wang
2023-04-10  3:16   ` longpeng2--- via
2023-04-10  3:21     ` Jason Wang
2023-04-10  9:04       ` Eugenio Perez Martin
2023-04-11  2:25         ` Jason Wang
2023-04-11  6:28           ` Eugenio Perez Martin
2023-04-11  6:36             ` Jason Wang
2023-04-11 12:33 ` Eugenio Perez Martin
2023-04-12  5:56   ` Jason Wang
2023-04-12  6:18     ` Jason Wang
2023-04-13  7:27       ` Eugenio Perez Martin
2023-06-06 22:44 ` Si-Wei Liu
2023-06-07  8:08   ` Eugenio Perez Martin
2023-06-08 22:40     ` Si-Wei Liu
2023-06-09  3:18       ` Jason Wang
2023-06-09 14:32       ` Eugenio Perez Martin
2023-06-27  6:36         ` Si-Wei Liu
2023-07-05 18:03           ` Eugenio Perez Martin
2023-07-06  0:13             ` Si-Wei Liu
2023-07-06  5:46               ` Eugenio Perez Martin
2023-07-08  9:14                 ` Si-Wei Liu
2023-07-10  6:04                   ` Eugenio Perez Martin
2023-07-17 19:56                     ` Si-Wei Liu
2023-07-19 10:40                       ` Eugenio Perez Martin
2023-07-20  0:48                         ` Si-Wei Liu
2023-08-02 12:42                           ` Eugenio Perez Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.