Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

From: Eugenio Perez Martin <eperezma@redhat.com>
To: Jonah Palmer <jonah.palmer@oracle.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, raphael@enfabrica.net,
	 kwolf@redhat.com, hreitz@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com,  fam@euphon.net, stefanha@redhat.com,
	qemu-block@nongnu.org,  schalla@marvell.com, leiyang@redhat.com,
	virtio-fs@lists.linux.dev,  si-wei.liu@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support
Date: Tue, 26 Mar 2024 19:34:22 +0100	[thread overview]
Message-ID: <CAJaqyWcHokNf97uwE0=S56CK9cBpB54sF8cdW7+BhFc5VzBRUQ@mail.gmail.com> (raw)
In-Reply-To: <d435919b-ec5e-41d5-8bbc-354d027f67d0@oracle.com>

On Tue, Mar 26, 2024 at 5:49 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >>
> >> On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>>>
> >>>> The goal of these patches is to add support to a variety of virtio and
> >>>> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> >>>> indicates that all buffers are used by the device in the same order in
> >>>> which they were made available by the driver.
> >>>>
> >>>> These patches attempt to implement a generalized, non-device-specific
> >>>> solution to support this feature.
> >>>>
> >>>> The core feature behind this solution is a buffer mechanism in the form
> >>>> of GLib's GHashTable. The decision behind using a hash table was to
> >>>> leverage their ability for quick lookup, insertion, and removal
> >>>> operations. Given that our keys are simply numbers of an ordered
> >>>> sequence, a hash table seemed like the best choice for a buffer
> >>>> mechanism.
> >>>>
> >>>> ---------------------
> >>>>
> >>>> The strategy behind this implementation is as follows:
> >>>>
> >>>> We know that buffers that are popped from the available ring and enqueued
> >>>> for further processing will always done in the same order in which they
> >>>> were made available by the driver. Given this, we can note their order
> >>>> by assigning the resulting VirtQueueElement a key. This key is a number
> >>>> in a sequence that represents the order in which they were popped from
> >>>> the available ring, relative to the other VirtQueueElements.
> >>>>
> >>>> For example, given 3 "elements" that were popped from the available
> >>>> ring, we assign a key value to them which represents their order (elem0
> >>>> is popped first, then elem1, then lastly elem2):
> >>>>
> >>>>        elem2   --  elem1   --  elem0   ---> Enqueue for processing
> >>>>       (key: 2)    (key: 1)    (key: 0)
> >>>>
> >>>> Then these elements are enqueued for further processing by the host.
> >>>>
> >>>> While most devices will return these completed elements in the same
> >>>> order in which they were enqueued, some devices may not (e.g.
> >>>> virtio-blk). To guarantee that these elements are put on the used ring
> >>>> in the same order in which they were enqueued, we can use a buffering
> >>>> mechanism that keeps track of the next expected sequence number of an
> >>>> element.
> >>>>
> >>>> In other words, if the completed element does not have a key value that
> >>>> matches the next expected sequence number, then we know this element is
> >>>> not in-order and we must stash it away in a hash table until an order
> >>>> can be made. The element's key value is used as the key for placing it
> >>>> in the hash table.
> >>>>
> >>>> If the completed element has a key value that matches the next expected
> >>>> sequence number, then we know this element is in-order and we can push
> >>>> it on the used ring. Then we increment the next expected sequence number
> >>>> and check if the hash table contains an element at this key location.
> >>>>
> >>>> If so, we retrieve this element, push it to the used ring, delete the
> >>>> key-value pair from the hash table, increment the next expected sequence
> >>>> number, and check the hash table again for an element at this new key
> >>>> location. This process is repeated until we're unable to find an element
> >>>> in the hash table to continue the order.
> >>>>
> >>>> So, for example, say the 3 elements we enqueued were completed in the
> >>>> following order: elem1, elem2, elem0. The next expected sequence number
> >>>> is 0:
> >>>>
> >>>>       exp-seq-num = 0:
> >>>>
> >>>>        elem1   --> elem1.key == exp-seq-num ? --> No, stash it
> >>>>       (key: 1)                                         |
> >>>>                                                        |
> >>>>                                                        v
> >>>>                                                  ================
> >>>>                                                  |key: 1 - elem1|
> >>>>                                                  ================
> >>>>       ---------------------
> >>>>       exp-seq-num = 0:
> >>>>
> >>>>        elem2   --> elem2.key == exp-seq-num ? --> No, stash it
> >>>>       (key: 2)                                         |
> >>>>                                                        |
> >>>>                                                        v
> >>>>                                                  ================
> >>>>                                                  |key: 1 - elem1|
> >>>>                                                  |--------------|
> >>>>                                                  |key: 2 - elem2|
> >>>>                                                  ================
> >>>>       ---------------------
> >>>>       exp-seq-num = 0:
> >>>>
> >>>>        elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
> >>>>       (key: 0)
> >>>>
> >>>>       exp-seq-num = 1:
> >>>>
> >>>>       lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
> >>>>                                                remove elem from table
> >>>>                                                        |
> >>>>                                                        v
> >>>>                                                  ================
> >>>>                                                  |key: 2 - elem2|
> >>>>                                                  ================
> >>>>
> >>>>       exp-seq-num = 2:
> >>>>
> >>>>       lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
> >>>>                                                remove elem from table
> >>>>                                                        |
> >>>>                                                        v
> >>>>                                                  ================
> >>>>                                                  |   *empty*    |
> >>>>                                                  ================
> >>>>
> >>>>       exp-seq-num = 3:
> >>>>
> >>>>       lookup(table, exp-seq-num) != NULL ? --> No, done
> >>>>       ---------------------
> >>>>
> >>>
> >>> I think to use a hashtable to handle this has an important drawback:
> >>> it hurts performance on the devices that are using right in-order
> >>> because of hash calculus, to benefit devices that are using it badly
> >>> by using descriptors out of order. We should use data structs that are
> >>> as free as possible for the first, and we don't care to worse the
> >>> experience of the devices that enable in_order and they shouldn't.
> >>>
> >>
> >> Right, because if descriptors are coming in in-order, we still search
> >> the (empty) hash table.
> >>
> >> Hmm... what if we introduced a flag to see if we actually should bother
> >> searching the hash table? That way we avoid the cost of searching when
> >> we really don't need to.
> >>
> >>> So I suggest reusing vq->used_elems array vq. At each used descriptor
> >>> written in the used ring, you know the next head is elem->index +
> >>> elem->ndescs, so you can check if that element has been filled or not.
> >>> If used, it needs to be flushed too. If not used, just return.
> >>>
> >>> Of course virtqueue_flush also needs to take this into account.
> >>>
> >>> What do you think, does it make sense to you?
> >>>
> >>
> >> I'm having a bit of trouble understanding the suggestion here. Would you
> >> mind elaborating a bit more for me on this?
> >>
> >> For example, say elem0, elem1, and elem2 were enqueued in-order (elem0
> >> being first, elem2 last) and then elem2 finishes first, elem1 second,
> >> and elem0 third. Given that these elements finish out-of-order, how
> >> would you handle these out-of-order elements using your suggestion?
> >>
> >
> > virtqueue_fill is called first with elem2. So vq->used_elems[2 %
> > vq->num] is filled with the needed information of the descriptor:
> > index, len and ndescs. idx function parameter is ignored.
> >
> > Optionally, virtqueue_push is called. It checks if
> > vq->used_elems[vq->used_idx] is valid. valid can be elem->in_num +
> > elem->out_num > 0, and reset them on every used ring write. If it is
> > not valid, this is a no-op. Currently, it is not valid.
> >
> > Same process for elem1.
> >
> > virtqueue_fill is the same for elem0. But now virtqueue_flush gets
> > interesting, as it detects vq->used_elems[0] is used. It scans for the
> > first not-used element, and it finds it is vq->used_elems[3]. So it
> > needs to write an used elem with id = 2 and the corresponding length.
> >
> > Maybe it is interesting to implement ways to improve the look for the
> > last used descriptor, but if any I'd go for a bitmap and always on top
> > of the basis series.
> >
> > The algorithm has not been tested, so maybe I've missed something.
> >
> > Thanks!
> >
>
> Thank you for taking the time to clarify for this for me, I appreciate it.
>
> I spent some time yesterday and this morning working this over in my
> head. I believe I understand what you're trying to do here and it makes
> more sense than employing a data structure like a hash table for this
> kind of job. However, I have a few questions regarding this implementation.
>
> So, one question is on the reuse of the VirtQueue's used_elems array.
> Wont reusing this array cause issues with packed VQ operations, since it
> also uses this array? If we want to stick with using this array
> specifically, perhaps we may need to rewrite its logic if the device has
> negotiated the in_order feature? E.g.
>
> virtqueue_packed_flush (...) {
>     if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER) {
>        // new logic
>     } else {
>       // current logic
>     }
> }
> -----------
>

That's right.

> Regarding this paragraph:
>
> "virtqueue_fill is called first with elem2. So vq->used_elems[2 %
> vq->num] is filled with the needed information of the descriptor:
> index, len and ndescs. idx function parameter is ignored."
>
> This looks exactly like virtqueue_packed_fill except for the idx
> parameter we'd pass in (sequence_num % vq->vring.num).
>
> In any case, regardless of whether this element being passed in is
> considered to be in-order or not, we still add this element to
> vq->used_elems in virtqueue_fill. Ok, got it.
>
> Then you say "Optionally, virtqueue_push is called". I assume by
> "optionally" you mean we need to know if this is a single-shot operation
> or a batched operation. A single-shot operation would call for
> virtqueue_push whereas a batched operation would just use
> virtqueue_fill. If this is what you meant by that then ok, I understand
> that too.
>

Totally correct.

> However, I think before we start considering whether or not we need to
> call virtqueue_push or continue with virtqueue_fill, we first should
> know whether or not this element is in-order. And I think to do that we
> should use the check you mentioned:
>
> if (vq->used_elems[vq->used_idx].in_num +
> vq->used_elems[vq->used_idx].out_num > 0)
>
> or perhaps:
>
> if (vq->used_elems[vq->used_idx] != NULL)
>
> If the element is found not to be in-order, I assume we return and we
> are done with the handling of this element for now.
>
> Now my confusion with this part comes from calling virtqueue_push inside
> of the virtqueue_fill function. Wouldn't calling virtqueue_push inside
> of virtqueue_fill present some kind of recursive execution path? Unless
> I'm missing something here, this probably isn't something we need to do,
> right?

Maybe I explained something wrong, but virtqueue_fill should not call
virtqueue_push. It's up to the caller (virtio-net, virtio-blk, etc) to
call one or another. Can you elaborate?

> -----------
>
> Lastly, when execution reaches virtqueue_flush, what would define an
> element as unused? Perhaps...
>
> if (vq->used_elems[i] == NULL)
>

used_elems is not an array of pointers but an array of
VirtQueueElement so we cannot do this way.

> or
>
> if (vq->used_elems[i].in_num + vq->used_elems[i].out_num > 0)
>

Right, I propose to reset both in_num = out_num = 0.

Thanks!

> Thanks Eugenio!
>
> >> Thanks :)
> >>
> >>> Thanks!
> >>>
> >>>
> >>>> Jonah Palmer (8):
> >>>>     virtio: Define InOrderVQElement
> >>>>     virtio: Create/destroy/reset VirtQueue In-Order hash table
> >>>>     virtio: Define order variables
> >>>>     virtio: Implement in-order handling for virtio devices
> >>>>     virtio-net: in-order handling
> >>>>     vhost-svq: in-order handling
> >>>>     vhost/vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
> >>>>     virtio: Add VIRTIO_F_IN_ORDER property definition
> >>>>
> >>>>    hw/block/vhost-user-blk.c          |   1 +
> >>>>    hw/net/vhost_net.c                 |   2 +
> >>>>    hw/net/virtio-net.c                |   6 +-
> >>>>    hw/scsi/vhost-scsi.c               |   1 +
> >>>>    hw/scsi/vhost-user-scsi.c          |   1 +
> >>>>    hw/virtio/vhost-shadow-virtqueue.c |  15 ++++-
> >>>>    hw/virtio/vhost-user-fs.c          |   1 +
> >>>>    hw/virtio/vhost-user-vsock.c       |   1 +
> >>>>    hw/virtio/virtio.c                 | 103 ++++++++++++++++++++++++++++-
> >>>>    include/hw/virtio/virtio.h         |  20 +++++-
> >>>>    net/vhost-vdpa.c                   |   1 +
> >>>>    11 files changed, 145 insertions(+), 7 deletions(-)
> >>>>
> >>>> --
> >>>> 2.39.3
> >>>>
> >>>
> >>
> >
>