qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Si-Wei Liu <si-wei.liu@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>, Eli Cohen <eli@mellanox.com>
Cc: qemu-devel@nongnu.org, Liuxiangdong <liuxiangdong5@huawei.com>,
	Zhu Lingshan <lingshan.zhu@intel.com>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	alvaro.karsz@solid-run.com, Shannon Nelson <snelson@pensando.io>,
	Laurent Vivier <lvivier@redhat.com>,
	Harpreet Singh Anand <hanand@xilinx.com>,
	Gautam Dawar <gdawar@xilinx.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>, Cindy Lu <lulu@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Parav Pandit <parav@mellanox.com>
Subject: Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
Date: Fri, 3 Feb 2023 18:03:56 -0800	[thread overview]
Message-ID: <ab75ec37-763b-2055-01e9-10e87ef6e956@oracle.com> (raw)
In-Reply-To: <CAJaqyWf5mNxjeAskVjLTFu08LpPxiHffhrtknKB75aUAEayEcQ@mail.gmail.com>



On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:
> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote:
>>> This allows net to restart the device backend to configure SVQ on it.
>>>
>>> Ideally, these changes should not be net specific. However, the vdpa net
>>> backend is the one with enough knowledge to configure everything because
>>> of some reasons:
>>> * Queues might need to be shadowed or not depending on its kind (control
>>>     vs data).
>>> * Queues need to share the same map translations (iova tree).
>>>
>>> Because of that it is cleaner to restart the whole net backend and
>>> configure again as expected, similar to how vhost-kernel moves between
>>> userspace and passthrough.
>>>
>>> If more kinds of devices need dynamic switching to SVQ we can create a
>>> callback struct like VhostOps and move most of the code there.
>>> VhostOps cannot be reused since all vdpa backend share them, and to
>>> personalize just for networking would be too heavy.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 84 insertions(+)
>>>
>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>>> index 5d7ad6e4d7..f38532b1df 100644
>>> --- a/net/vhost-vdpa.c
>>> +++ b/net/vhost-vdpa.c
>>> @@ -26,6 +26,8 @@
>>>    #include <err.h>
>>>    #include "standard-headers/linux/virtio_net.h"
>>>    #include "monitor/monitor.h"
>>> +#include "migration/migration.h"
>>> +#include "migration/misc.h"
>>>    #include "migration/blocker.h"
>>>    #include "hw/virtio/vhost.h"
>>>
>>> @@ -33,6 +35,7 @@
>>>    typedef struct VhostVDPAState {
>>>        NetClientState nc;
>>>        struct vhost_vdpa vhost_vdpa;
>>> +    Notifier migration_state;
>>>        Error *migration_blocker;
>>>        VHostNetState *vhost_net;
>>>
>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
>>>        return DO_UPCAST(VhostVDPAState, nc, nc0);
>>>    }
>>>
>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
>>> +{
>>> +    struct vhost_vdpa *v = &s->vhost_vdpa;
>>> +    VirtIONet *n;
>>> +    VirtIODevice *vdev;
>>> +    int data_queue_pairs, cvq, r;
>>> +    NetClientState *peer;
>>> +
>>> +    /* We are only called on the first data vqs and only if x-svq is not set */
>>> +    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
>>> +        return;
>>> +    }
>>> +
>>> +    vdev = v->dev->vdev;
>>> +    n = VIRTIO_NET(vdev);
>>> +    if (!n->vhost_started) {
>>> +        return;
>>> +    }
>>> +
>>> +    if (enable) {
>>> +        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
>>> +    }
>>> +    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
>>> +    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
>>> +                                  n->max_ncs - n->max_queue_pairs : 0;
>>> +    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
>>> +
>>> +    peer = s->nc.peer;
>>> +    for (int i = 0; i < data_queue_pairs + cvq; i++) {
>>> +        VhostVDPAState *vdpa_state;
>>> +        NetClientState *nc;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            nc = qemu_get_peer(peer, i);
>>> +        } else {
>>> +            nc = qemu_get_peer(peer, n->max_queue_pairs);
>>> +        }
>>> +
>>> +        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
>>> +        vdpa_state->vhost_vdpa.shadow_data = enable;
>>> +
>>> +        if (i < data_queue_pairs) {
>>> +            /* Do not override CVQ shadow_vqs_enabled */
>>> +            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
>>> +        }
>>> +    }
>>> +
>>> +    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
>> As the first revision, this method (vhost_net_stop followed by
>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention
>> that this method implies substantial blackout time for mode switching on
>> real hardware - get a full cycle of device reset of getting memory
>> mappings torn down, unpin & repin same set of pages, and set up new
>> mapping would take very significant amount of time, especially for a
>> large VM. Maybe we can do:
>>
> Right, I think this is something that deserves optimization in the future.
>
> Note that we must replace the mappings anyway, with all passthrough
> queues stopped.
Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq 
keep mapping to the same GPA where passthrough data virtqueues were 
associated with across switch (so that the mode switch is transparent to 
the guest)? For platform IOMMU the mapping and remapping cost is 
inevitable, though I wonder for the on-chip IOMMU case could it take 
some fast path to just replace IOVA in place without destroying and 
setting up all mapping entries, if the same GPA is going to be used for 
the data rings (copy Eli for his input).

>   This is because SVQ vrings are not in the guest space.
> The pin can be skipped though, I think that's a low hand fruit here.
Yes, that's right. For a large VM pining overhead usually overweighs the 
mapping cost. It would be a great amount of time saving if pin can be 
skipped.

>
> If any, we can track guest's IOVA and add SVQ vrings in a hole. If
> guest's IOVA layout changes, we can translate it then to a new
> location. That way we only need one map operation in the worst case.
> I'm omitting the lookup time here, but I still should be worth it.
>
> But as you mention I think it is not worth complicating this series
> and we can think about it on top.
Yes, agreed. I'll just let you aware of the need of this optimization 
for real hardware device.

>   We can start building it on top of
> your suggestions for sure.
>
>> 1) replace reset with the RESUME feature that was just added to the
>> vhost-vdpa ioctls in kernel
> We cannot change vring addresses just with a SUSPEND / RESUME.
I wonder if we can make SUSPEND (via some flag or new backend feature is 
fine) accept updating internal state like the vring addresses, while 
defer applying it to the device until RESUME? That way we don't lose a 
lot of other states that otherwise would need to re-instantiate at large 
with _F_RING_RESET or device reset.

>
> We could do it with the VIRTIO_F_RING_RESET feature though. Would it
> be advantageous to the device?
>
>> 2) add new vdpa ioctls to allow iova range rebound to new virtual
>> address for QEMU's shadow vq or back to device's vq
> Actually, if the device supports ASID we can allocate ASID 1 to that
> purpose. At this moment only CVQ vrings and control buffers are there
> when the device is passthrough.
Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle 
the mapping for the passthrough VQ. This will help the general IOMMU case.

>
> But this doesn't solve the problem if we need to send all SVQ
> translation to the device on-chip IOMMU, doesn't it? We must clear all
> of it and send the new one to the device anyway.
>
>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode
>> on the fly instead of getting through the whole reset+restart cycle
>>
> I think this is the same as 1, isn't it?
I mean do all three together: 1,2 in kernel and 3 in QEMU.

>
>> I suspect the same idea could even be used to address high live
>> migration downtime seen on hardware vdpa device. What do you think?
>>
> I think this is a great start for sure! Some questions:
> a) Is the time on reprogramming on-chip IOMMU comparable to program
> regular IOMMU?
I would think this largely depends on the hardware implementation of 
on-chip IOMMU, the performance characteristics of which is very device 
specific. Some times driver software implementation and API for on-chip 
MMU also matters. Which would require vendor specific work to optimize 
based on the specific use case.

>   If it is the case it should be easier to find vdpa
> devices with support for _F_RESET soon.
> b) Not to merge on master, but it is possible to add an artificial
> delay on vdpa_sim that simulates the properties of the delay of IOMMU?
> In that line, have you observed if it is linear with the size of the
> memory, with the number of maps, other factors..?
As I said this is very device specific and hard to quantify, but I agree 
it's a good idea to simulate the delay and measure the effect. For the 
on-chip MMU device I'm looking, large proportion of the time was spent 
on software side in allocating a range of memory for hosting mapping 
entries (don't know how to quantify this part but the allocation time is 
not a constant nor linear to the size of memory), walking all iotlb 
entries passed down from vdpa layer and building corresponding memory 
key objects for a range of pages. For each iotlb entry the time to build 
memory mapping looks grow linearly with the size of memory. Not sure if 
there's room to improve, I'll let the owner to clarify.

Thanks,
-Siwei





>
> Thanks!
>
>> Thanks,
>> -Siwei
>>
>>> +    if (unlikely(r < 0)) {
>>> +        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
>>> +    }
>>> +}
>>> +
>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
>>> +{
>>> +    MigrationState *migration = data;
>>> +    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
>>> +                                     migration_state);
>>> +
>>> +    switch (migration->state) {
>>> +    case MIGRATION_STATUS_SETUP:
>>> +        vhost_vdpa_net_log_global_enable(s, true);
>>> +        return;
>>> +
>>> +    case MIGRATION_STATUS_CANCELLING:
>>> +    case MIGRATION_STATUS_CANCELLED:
>>> +    case MIGRATION_STATUS_FAILED:
>>> +        vhost_vdpa_net_log_global_enable(s, false);
>>> +        return;
>>> +    };
>>> +}
>>> +
>>>    static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
>>>    {
>>>        struct vhost_vdpa *v = &s->vhost_vdpa;
>>>
>>> +    if (v->feature_log) {
>>> +        add_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>        if (v->shadow_vqs_enabled) {
>>>            v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
>>>                                               v->iova_range.last);
>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)
>>>
>>>        assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
>>>
>>> +    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
>>> +        remove_migration_state_change_notifier(&s->migration_state);
>>> +    }
>>> +
>>>        dev = s->vhost_vdpa.dev;
>>>        if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
>>>            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
>>>        s->vhost_vdpa.device_fd = vdpa_device_fd;
>>>        s->vhost_vdpa.index = queue_pair_index;
>>>        s->always_svq = svq;
>>> +    s->migration_state.notify = vdpa_net_migration_state_notifier;
>>>        s->vhost_vdpa.shadow_vqs_enabled = svq;
>>>        s->vhost_vdpa.iova_range = iova_range;
>>>        s->vhost_vdpa.shadow_data = svq;



  reply	other threads:[~2023-02-04  2:05 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-12 17:24 [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 01/13] vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check Eugenio Pérez
2023-01-13  3:12   ` Jason Wang
2023-01-13  6:42     ` Eugenio Perez Martin
2023-01-16  3:01       ` Jason Wang
2023-01-12 17:24 ` [RFC v2 02/13] vdpa net: move iova tree creation from init to start Eugenio Pérez
2023-01-13  3:53   ` Jason Wang
2023-01-13  7:28     ` Eugenio Perez Martin
2023-01-16  3:05       ` Jason Wang
2023-01-16  9:14         ` Eugenio Perez Martin
2023-01-17  4:30           ` Jason Wang
2023-01-12 17:24 ` [RFC v2 03/13] vdpa: copy cvq shadow_data from data vqs, not from x-svq Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 04/13] vdpa: rewind at get_base, not set_base Eugenio Pérez
2023-01-13  4:09   ` Jason Wang
2023-01-13  7:40     ` Eugenio Perez Martin
2023-01-16  3:32       ` Jason Wang
2023-01-16  9:53         ` Eugenio Perez Martin
2023-01-17  4:38           ` Jason Wang
2023-01-17  6:57             ` Eugenio Perez Martin
2023-01-12 17:24 ` [RFC v2 05/13] vdpa net: add migration blocker if cannot migrate cvq Eugenio Pérez
2023-01-13  4:24   ` Jason Wang
2023-01-13  7:46     ` Eugenio Perez Martin
2023-01-16  3:34       ` Jason Wang
2023-01-16  5:23         ` Michael S. Tsirkin
2023-01-16  9:33           ` Eugenio Perez Martin
2023-01-17  5:42             ` Jason Wang
2023-01-12 17:24 ` [RFC v2 06/13] vhost: delay set_vring_ready after DRIVER_OK Eugenio Pérez
2023-01-13  4:36   ` Jason Wang
2023-01-13  8:19     ` Eugenio Perez Martin
2023-01-13  9:51       ` Stefano Garzarella
2023-01-13 10:03         ` Eugenio Perez Martin
2023-01-13 10:37           ` Stefano Garzarella
2023-01-17 15:15           ` Maxime Coquelin
2023-01-16  6:36       ` Jason Wang
2023-01-16 16:16         ` Eugenio Perez Martin
2023-01-17  5:36           ` Jason Wang
2023-01-12 17:24 ` [RFC v2 07/13] vdpa: " Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 08/13] vdpa: Negotiate _F_SUSPEND feature Eugenio Pérez
2023-01-13  4:39   ` Jason Wang
2023-01-13  8:45     ` Eugenio Perez Martin
2023-01-16  6:48       ` Jason Wang
2023-01-16 16:17         ` Eugenio Perez Martin
2023-01-12 17:24 ` [RFC v2 09/13] vdpa: add feature_log parameter to vhost_vdpa Eugenio Pérez
2023-01-12 17:24 ` [RFC v2 10/13] vdpa net: allow VHOST_F_LOG_ALL Eugenio Pérez
2023-01-13  4:42   ` Jason Wang
2023-01-12 17:24 ` [RFC v2 11/13] vdpa: add vdpa net migration state notifier Eugenio Pérez
2023-01-13  4:54   ` Jason Wang
2023-01-13  9:00     ` Eugenio Perez Martin
2023-01-16  6:51       ` Jason Wang
2023-01-16 15:21         ` Eugenio Perez Martin
2023-01-17  9:58       ` Dr. David Alan Gilbert
2023-01-17 10:23         ` Eugenio Perez Martin
2023-01-17 12:54           ` Dr. David Alan Gilbert
2023-02-02  1:52   ` Si-Wei Liu
2023-02-02 15:28     ` Eugenio Perez Martin
2023-02-04  2:03       ` Si-Wei Liu [this message]
2023-02-13  9:47         ` Eugenio Perez Martin
2023-02-13 22:36           ` Si-Wei Liu
2023-02-14 18:51             ` Eugenio Perez Martin
2023-02-12 14:31     ` Eli Cohen
2023-01-12 17:24 ` [RFC v2 12/13] vdpa: preemptive kick at enable Eugenio Pérez
2023-01-13  2:31   ` Jason Wang
2023-01-13  3:25     ` Zhu, Lingshan
2023-01-13  3:39       ` Jason Wang
2023-01-13  9:06         ` Eugenio Perez Martin
2023-01-16  7:02           ` Jason Wang
2023-02-02 16:55             ` Eugenio Perez Martin
2023-02-02  0:56           ` Si-Wei Liu
2023-02-02 16:53             ` Eugenio Perez Martin
2023-02-04 11:04               ` Si-Wei Liu
2023-02-05 10:00                 ` Michael S. Tsirkin
2023-02-06  5:08                   ` Si-Wei Liu
2023-01-12 17:24 ` [RFC v2 13/13] vdpa: Conditionally expose _F_LOG in vhost_net devices Eugenio Pérez
2023-02-02  1:00 ` [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration Si-Wei Liu
2023-02-02 11:27   ` Eugenio Perez Martin
2023-02-03  5:08     ` Si-Wei Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab75ec37-763b-2055-01e9-10e87ef6e956@oracle.com \
    --to=si-wei.liu@oracle.com \
    --cc=alvaro.karsz@solid-run.com \
    --cc=arei.gonglei@huawei.com \
    --cc=cohuck@redhat.com \
    --cc=eli@mellanox.com \
    --cc=eperezma@redhat.com \
    --cc=gdawar@xilinx.com \
    --cc=hanand@xilinx.com \
    --cc=jasowang@redhat.com \
    --cc=lingshan.zhu@intel.com \
    --cc=liuxiangdong5@huawei.com \
    --cc=lulu@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=mst@redhat.com \
    --cc=parav@mellanox.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    --cc=snelson@pensando.io \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).