All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
@ 2017-02-08 13:37 ` Christian Borntraeger
  2017-02-09 17:43   ` Michael S. Tsirkin
       [not found]   ` <20170209181955-mutt-send-email-mst@kernel.org>
  2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 92+ messages in thread
From: Christian Borntraeger @ 2017-02-08 13:37 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote:
[...]
> * Prototype
> 
> A partial prototype can be found under
> tools/virtio/ringtest/ring.c
> 
> Test run:
> [mst@tuck ringtest]$ time ./ring 
> real    0m0.399s
> user    0m0.791s
> sys     0m0.000s
> [mst@tuck ringtest]$ time ./virtio_ring_0_9
> real    0m0.503s
> user    0m0.999s
> sys     0m0.000s

I see similar improvements on s390, so I think this would be a very nice 
improvement.

[...] 
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>       might become Essential Claims.
> 

I have trouble parsing that legal stuff. Do I read that right, that
these claims can be implemented as part of a virtio implementation without
any worries (e.g. non open source HW implementation or non open source
hypervisor)?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
  2017-02-08 13:37 ` packed ring layout proposal v2 Christian Borntraeger
@ 2017-02-08 17:41 ` Paolo Bonzini
  2017-02-08 19:59   ` Michael S. Tsirkin
       [not found]   ` <20170208214435-mutt-send-email-mst@kernel.org>
  2017-02-22  4:27 ` packed ring layout proposal - todo list Michael S. Tsirkin
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 92+ messages in thread
From: Paolo Bonzini @ 2017-02-08 17:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization



On 08/02/2017 04:20, Michael S. Tsirkin wrote:
> * Scatter/gather support
> 
> We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT       1
> 
> Unlike virtio 1.0, all descriptors must have distinct ID values.
> 
> Also unlike virtio 1.0, use of this flag will be an optional feature
> (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

I would still prefer that we had  _either_ single-direct or
multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT.  I can propose
my idea for this in a separate message.

> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.

I'm still not sure how this would be useful.  It cannot be mandatory to
set the bit, I think, because you don't know when the host/guest is
going to read descriptors.  So both host and guest always have to look
ahead one element in any case.

> * Non power-of-2 ring sizes
> 
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.

Power of 2 ring sizes are required in order to ignore the high bits of
the indices.  With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.  Alternatively you can do this:

> * Event index would be in the range 0 to 2 * Queue Size
> (to detect wrap arounds) and wrap to 0 after that.
> 
> The assumption is that each side maintains an internal
> descriptor counter 0 to 2 * Queue Size that wraps to 0.
> In that case, interrupt triggers when counter reaches
> the given value.

but it seems more complicated than just forcing power-of-2 and ignoring
the high bits.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
@ 2017-02-08 19:59   ` Michael S. Tsirkin
       [not found]   ` <20170208214435-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-08 19:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: virtio-dev, virtualization

On Wed, Feb 08, 2017 at 06:41:40PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/02/2017 04:20, Michael S. Tsirkin wrote:
> > * Scatter/gather support
> > 
> > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT       1
> > 
> > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > 
> > Also unlike virtio 1.0, use of this flag will be an optional feature
> > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> 
> I would still prefer that we had  _either_ single-direct or
> multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT.  I can propose
> my idea for this in a separate message.

All it costs us spec-wise is a single bit :) 

The cost of indirect is an extra cache miss.

We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.

> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> 
> I'm still not sure how this would be useful.


So this is used at least by virtio-net mergeable buffers to combine
many buffers into a single packet.

Similarly, on transmit linux sometimes supplies packets in batches
(XMIT_MORE flag) if the other side processes them it seems nice to tell
it: there's more to come soon, if you see this it is wise to poll now.

That's why I kind of felt it's better as a standard bit.

>  It cannot be mandatory to
> set the bit, I think, because you don't know when the host/guest is
> going to read descriptors.  So both host and guest always have to look
> ahead one element in any case.

Right but the point is what to do if you find nothing there?
If you saw VRING_DESC_F_BATCH_NEXT it's a hint that
you should poll, there's more to come soon.

> > * Non power-of-2 ring sizes
> > 
> > As the ring simply wraps around, there's no reason to
> > require ring size to be power of two.
> > It can be made a separate feature though.
> 
> Power of 2 ring sizes are required in order to ignore the high bits of
> the indices.  With non-power-of-2 sizes you are forced to keep the
> indices less than the ring size.

Right. So

	if (unlikely(idx++ > size))
		idx = 0;

OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.

>  Alternatively you can do this:
> 
> > * Event index would be in the range 0 to 2 * Queue Size
> > (to detect wrap arounds) and wrap to 0 after that.
> > 
> > The assumption is that each side maintains an internal
> > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > In that case, interrupt triggers when counter reaches
> > the given value.
> 
> but it seems more complicated than just forcing power-of-2 and ignoring
> the high bits.
> 
> Thanks,
> 
> Paolo

Absolutely power of 2 lets you save a branch.
At this stage I'm just recording all the ideas
and then as a next step we can micro-benchmark prototypes
and compare.

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]   ` <20170208214435-mutt-send-email-mst@kernel.org>
@ 2017-02-09 15:48     ` Paolo Bonzini
  2017-02-09 16:11       ` Cornelia Huck
                         ` (3 more replies)
  0 siblings, 4 replies; 92+ messages in thread
From: Paolo Bonzini @ 2017-02-09 15:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization



On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> We couldn't decide what's better for everyone in 1.0 days and I doubt
> we'll be able to now, but yes, benchmarking is needed to make
> sire it's required. Very easy to remove or not to use/support in
> drivers/devices though.

Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).

The drivers' part adds to the implementation burden, which is why I
wanted to remove it.  Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.

>>> * Non power-of-2 ring sizes
>>>
>>> As the ring simply wraps around, there's no reason to
>>> require ring size to be power of two.
>>> It can be made a separate feature though.
>>
>> Power of 2 ring sizes are required in order to ignore the high bits of
>> the indices.  With non-power-of-2 sizes you are forced to keep the
>> indices less than the ring size.
> 
> Right. So
> 
> 	if (unlikely(idx++ > size))
> 		idx = 0;
> 
> OTOH ring size that's twice larger than necessary
> because of power of two requirements wastes cache.

I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.

If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
don't have any problem with it.

Paolo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-02-09 15:48     ` Paolo Bonzini
@ 2017-02-09 16:11       ` Cornelia Huck
  2017-02-09 18:24       ` Michael S. Tsirkin
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 92+ messages in thread
From: Cornelia Huck @ 2017-02-09 16:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: virtio-dev, virtualization, Michael S. Tsirkin

On Thu, 9 Feb 2017 16:48:53 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.

I think this (INDIRECT mandatory, NEXT optional) makes sense. But we
really need some benchmarking.

> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > 	if (unlikely(idx++ > size))
> > 		idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v2
  2017-02-08 13:37 ` packed ring layout proposal v2 Christian Borntraeger
@ 2017-02-09 17:43   ` Michael S. Tsirkin
       [not found]   ` <20170209181955-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-09 17:43 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: virtio-dev, virtualization

On Wed, Feb 08, 2017 at 02:37:57PM +0100, Christian Borntraeger wrote:
> On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote:
> [...]
> > * Prototype
> > 
> > A partial prototype can be found under
> > tools/virtio/ringtest/ring.c
> > 
> > Test run:
> > [mst@tuck ringtest]$ time ./ring 
> > real    0m0.399s
> > user    0m0.791s
> > sys     0m0.000s
> > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > real    0m0.503s
> > user    0m0.999s
> > sys     0m0.000s
> 
> I see similar improvements on s390, so I think this would be a very nice 
> improvement.
> 
> [...] 
> > Note: should this proposal be accepted and approved, one or more
> >       claims disclosed to the TC admin and listed on the Virtio TC
> >       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >       might become Essential Claims.
> > 
> 
> I have trouble parsing that legal stuff. Do I read that right, that
> these claims can be implemented as part of a virtio implementation without
> any worries (e.g. non open source HW implementation or non open source
> hypervisor)?

By that legal stuff do you mean the IPR or the Note?

Not representing Red Hat here, and definitely not legal advice, below is
just my personal understanding of the IPR requirements.

Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy:
https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode

As far as I can tell, the hardware and software question is covered
by that policy, since it states:

	Covered Product - includes only those specific portions of a
	product (hardware, software or combinations thereof)

Also as far as I can tell IPR does not mention source at all, so I think
virtio IPR would apply to open and closed source software equally.

The Note is included to satisfy the disclosure requirements.

Does this answer the question?

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-02-09 15:48     ` Paolo Bonzini
  2017-02-09 16:11       ` Cornelia Huck
@ 2017-02-09 18:24       ` Michael S. Tsirkin
       [not found]       ` <20170209202203-mutt-send-email-mst@kernel.org>
       [not found]       ` <20170209171105.075a9d9c.cornelia.huck@de.ibm.com>
  3 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-09 18:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: virtio-dev, virtualization

On Thu, Feb 09, 2017 at 04:48:53PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.
> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > 	if (unlikely(idx++ > size))
> > 		idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.

> If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
> don't have any problem with it.
> 
> Paolo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v2
       [not found]   ` <20170209181955-mutt-send-email-mst@kernel.org>
@ 2017-02-09 18:27     ` Christian Borntraeger
  0 siblings, 0 replies; 92+ messages in thread
From: Christian Borntraeger @ 2017-02-09 18:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On 02/09/2017 06:43 PM, Michael S. Tsirkin wrote:
[...]
>> [...] 
>>> Note: should this proposal be accepted and approved, one or more
>>>       claims disclosed to the TC admin and listed on the Virtio TC
>>>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>>>       might become Essential Claims.
>>>
>>
>> I have trouble parsing that legal stuff. Do I read that right, that
>> these claims can be implemented as part of a virtio implementation without
>> any worries (e.g. non open source HW implementation or non open source
>> hypervisor)?
> 
> By that legal stuff do you mean the IPR or the Note?

Both in combination.
 
> Not representing Red Hat here, and definitely not legal advice, below is
> just my personal understanding of the IPR requirements.
> 
> Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy:
> https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode
> 
> As far as I can tell, the hardware and software question is covered
> by that policy, since it states:
> 
> 	Covered Product - includes only those specific portions of a
> 	product (hardware, software or combinations thereof)
> 
> Also as far as I can tell IPR does not mention source at all, so I think
> virtio IPR would apply to open and closed source software equally.
> 
> The Note is included to satisfy the disclosure requirements.
> 
> Does this answer the question?

Yes, thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]       ` <20170209202203-mutt-send-email-mst@kernel.org>
@ 2017-02-10 11:32         ` Paolo Bonzini
       [not found]         ` <c229269b-1702-ffec-62e8-002c7c142904@redhat.com>
  1 sibling, 0 replies; 92+ messages in thread
From: Paolo Bonzini @ 2017-02-10 11:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization



On 09/02/2017 19:24, Michael S. Tsirkin wrote:
>> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
>> the complication and the gratuitous difference with 1.0.
>
> I thought originally there's a reason 1.0 rings had to be powers of two
> but now I don't see why. OK, we can make it a feature flag later if we
> want to.

The reason is that it allows indices to be free running.  This is an 
example of QEMU code that requires that:

            nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
            /* Check it isn't doing strange things with descriptor numbers. */
            if (nheads > vdev->vq[i].vring.num) {
                error_report("VQ %d size 0x%x Guest index 0x%x "
                             "inconsistent with Host index 0x%x: delta 0x%x",
                             i, vdev->vq[i].vring.num,
                             vring_avail_idx(&vdev->vq[i]),
                             vdev->vq[i].last_avail_idx, nheads);
                return -1;
            }

Paolo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]         ` <c229269b-1702-ffec-62e8-002c7c142904@redhat.com>
@ 2017-02-10 15:20           ` Michael S. Tsirkin
  2017-02-10 16:17             ` Paolo Bonzini
  0 siblings, 1 reply; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-10 15:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: virtio-dev, virtualization

On Fri, Feb 10, 2017 at 12:32:49PM +0100, Paolo Bonzini wrote:
> 
> 
> On 09/02/2017 19:24, Michael S. Tsirkin wrote:
> >> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> >> the complication and the gratuitous difference with 1.0.
> >
> > I thought originally there's a reason 1.0 rings had to be powers of two
> > but now I don't see why. OK, we can make it a feature flag later if we
> > want to.
> 
> The reason is that it allows indices to be free running.

Well what I meant is that with qsize not a power of 2 you can still do
this but have to do everything mod N*qsize as opposed to mod 2^16. So
you need a branch there - easiest to do if you do signed math.

int nheads = avail - last_avail;
/*Check and handle index wrap-around */
if (unlikely(nheads < 0)) {
	nheads += N_qsize;
}

if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
	error_report(...);
	return -1;
}

This can only catch bugs if N > 1

>  This is an 
> example of QEMU code that requires that:
> 
>             nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
>             /* Check it isn't doing strange things with descriptor numbers. */
>             if (nheads > vdev->vq[i].vring.num) {
>                 error_report("VQ %d size 0x%x Guest index 0x%x "
>                              "inconsistent with Host index 0x%x: delta 0x%x",
>                              i, vdev->vq[i].vring.num,
>                              vring_avail_idx(&vdev->vq[i]),
>                              vdev->vq[i].last_avail_idx, nheads);
>                 return -1;
>             }
> 
> Paolo

Same thing here, this never triggers if vring.num == 2^16

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-02-10 15:20           ` Michael S. Tsirkin
@ 2017-02-10 16:17             ` Paolo Bonzini
  0 siblings, 0 replies; 92+ messages in thread
From: Paolo Bonzini @ 2017-02-10 16:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization



----- Original Message -----
> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: virtio-dev@lists.oasis-open.org, virtualization@lists.linux-foundation.org
> Sent: Friday, February 10, 2017 4:20:17 PM
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Fri, Feb 10, 2017 at 12:32:49PM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 09/02/2017 19:24, Michael S. Tsirkin wrote:
> > >> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > >> the complication and the gratuitous difference with 1.0.
> > >
> > > I thought originally there's a reason 1.0 rings had to be powers of two
> > > but now I don't see why. OK, we can make it a feature flag later if we
> > > want to.
> > 
> > The reason is that it allows indices to be free running.
> 
> Well what I meant is that with qsize not a power of 2 you can still do
> this but have to do everything mod N*qsize as opposed to mod 2^16. So
> you need a branch there - easiest to do if you do signed math.
> 
> int nheads = avail - last_avail;
> /*Check and handle index wrap-around */
> if (unlikely(nheads < 0)) {
> 	nheads += N_qsize;
> }
> 
> if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
> 	error_report(...);
> 	return -1;
> }
> 
> This can only catch bugs if N > 1

Agreed.

> > This is an example of QEMU code that requires that:
> 
> Same thing here, this never triggers if vring.num == 2^16

Free running indices require the counter range to be bigger than the
size of the vring (otherwise you confuse an empty vring with a full
one), so the maximum size of the vring in virtio <=1.0 is 2^15.

Paolo

^ permalink raw reply	[flat|nested] 92+ messages in thread

* packed ring layout proposal - todo list
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
  2017-02-08 13:37 ` packed ring layout proposal v2 Christian Borntraeger
  2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
@ 2017-02-22  4:27 ` Michael S. Tsirkin
       [not found] ` <20170222054336-mutt-send-email-mst@kernel.org>
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-22  4:27 UTC (permalink / raw)
  To: virtio-dev; +Cc: virtualization

Here is an attempt to summarise the list of things
we need to do with respect to this proposal.


Stage 1: update prototype and finalize proposal

At this stage we need to update the prototype under
tools/virtio/ringtest/ring.c to make it match
latest proposal, adding in verious options under discussion.
Then we will measure performance. While more ideas are welcome
there won't be useful without ability to test!

Tasks:

- update tools/virtio/ringtest/ring.c to proposal, adding
  options as listed below.
- compare performance with and without indirect buffers
  issue: how to estimate cost of memory allocations?
- batching descriptors - is it worth it?
- three ways to suppress interrupts/exits
  which works better?
  issue: how to estimate cost of interrupts/exits?


More ideas:

- current tests all indicate cache synchronizations due to r/w descriptors
  cost as much as read-only/write-only descriptors,
  and there are less of these synchronizations.
  Disagree? Write a patch and benchmark.
- some devices could use smaller descriptors. For example,
  if you don't need length, id (e.g. using in-order, fixed length) or flags, you can
  use a single 64 bit pointer as a descriptor. This can sometimes work
  e.g. for networking rx rings.
  Not clear whether this gains/costs us anything. Disagree? Write a patch and benchmark.

Stage 2: prototype guest/host drivers

At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.

Tasks:

- implement vhost-user support in dpdk
- implement virtio support in dpdk
- implement minimal stub in qemu
- test performance. Possibly revisit questions from stage 2
  if any are left open


Stage 3: complete guest/host drivers

At this stage we need to add linux support in virtio and vhost,
and complete qemu support.

Tasks:

- implement vhost support in kernel
- implement virtio support in kernel
- complete virtio support in qemu
- test performance. Possibly revisit questions from stage 2
  if any are left open



Stage X: Finalizing the spec

When are we ready to put out a spec draft? Surely not before Stage 1 is
done. Surely no later than stage 3.  A driver could be wish for some
party to productize an implementation.

Interested? Join TC and start discussion on timing and which
features should be included.


-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal - todo list
       [not found] ` <20170222054336-mutt-send-email-mst@kernel.org>
@ 2017-02-22  9:19   ` Gray, Mark D
       [not found]   ` <738D45BC1F695740A983F43CFE1B7EA94E93CA7E@IRSMSX108.ger.corp.intel.com>
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 92+ messages in thread
From: Gray, Mark D @ 2017-02-22  9:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org] On Behalf Of Michael S. Tsirkin
> 
> Here is an attempt to summarise the list of things we need to do with respect
> to this proposal.

Thanks for outlining this, it makes it a lot clearer.

> 
> 
> Stage 1: update prototype and finalize proposal
> 
> At this stage we need to update the prototype under
> tools/virtio/ringtest/ring.c to make it match latest proposal, adding in verious

Would a DPDK based prototype be good aswell or would you rather 
everything be prototyped here?

> options under discussion.
> Then we will measure performance. While more ideas are welcome there
> won't be useful without ability to test!
> 
> Tasks:
> 
> - update tools/virtio/ringtest/ring.c to proposal, adding
>   options as listed below.
> - compare performance with and without indirect buffers
>   issue: how to estimate cost of memory allocations?
> - batching descriptors - is it worth it?
> - three ways to suppress interrupts/exits
>   which works better?
>   issue: how to estimate cost of interrupts/exits?
> 
> 

We need to ensure we have a proposal that is suitable for
implementation in hardware.

> More ideas:
> 
> - current tests all indicate cache synchronizations due to r/w descriptors
>   cost as much as read-only/write-only descriptors,
>   and there are less of these synchronizations.
>   Disagree? Write a patch and benchmark.
> - some devices could use smaller descriptors. For example,
>   if you don't need length, id (e.g. using in-order, fixed length) or flags, you
> can
>   use a single 64 bit pointer as a descriptor. This can sometimes work
>   e.g. for networking rx rings.
>   Not clear whether this gains/costs us anything. Disagree? Write a patch and
> benchmark.
> 
> Stage 2: prototype guest/host drivers
> 
> At this stage we need real guest and host drivers to be able to test real life
> performance.
> I suggest dpdk drivers + munimal hack in qemu to pass features around.
> 
> Tasks:
> 
> - implement vhost-user support in dpdk
> - implement virtio support in dpdk
> - implement minimal stub in qemu
> - test performance. Possibly revisit questions from stage 2
>   if any are left open
> 
> 
> Stage 3: complete guest/host drivers
> 
> At this stage we need to add linux support in virtio and vhost, and complete
> qemu support.
> 
> Tasks:
> 
> - implement vhost support in kernel
> - implement virtio support in kernel
> - complete virtio support in qemu
> - test performance. Possibly revisit questions from stage 2
>   if any are left open
> 
> 
> 
> Stage X: Finalizing the spec
> 
> When are we ready to put out a spec draft? Surely not before Stage 1 is
> done. Surely no later than stage 3.  A driver could be wish for some party to
> productize an implementation.
> 
> Interested? Join TC and start discussion on timing and which features should
> be included.
> 
> 
> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (3 preceding siblings ...)
       [not found] ` <20170222054336-mutt-send-email-mst@kernel.org>
@ 2017-02-22 14:46 ` Chien, Roger S
  2017-02-28  5:02 ` Yuanhan Liu
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Chien, Roger S @ 2017-02-22 14:46 UTC (permalink / raw)
  To: virtio-dev; +Cc: virtualization

Here are some feedbacks w.r.t virtio v1.1 proposal from FPGA implementation view point. 

1) As the Gary presented, depends on flags to determine ownership of descriptors is not efficiency. And for QPI FPGA platform, it may also lead to race condition because current design only support whole cache line read/write and no partial  byte enabled. Even with partial byte enable, it will cause the CPU/FPGA to invalidate the whole cache line even when you just toggle one bit (For RX, return the descriptor to SW....)

2) S/G is a nightmare to HW, because next descriptor is only available when you have it. So the HW is unable to parallelize descriptor fetching & buffer fetching. Indirect table is better (compare to indirect linked list). But it even better if we can put chained descriptors inside the original descriptor table and using a flag to indicate that their buffers belong to the a large piece (but may be conflict to flexibility to increase/decrease descriptors in run time)

3) Better to keep power of 2 ring size from HW viewpoint. But not so painful if we need to support non power of 2 ring size. (Single AND mask versus comparator)


-----Original Message-----
From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 8, 2017 11:20 AM
To: virtio-dev@lists.oasis-open.org
Cc: virtualization@lists.linux-foundation.org
Subject: [virtio-dev] packed ring layout proposal v2

This is an update from v1 version.
Changes:
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features

---

Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure, replacing the used and available ring, index and descriptor buffer.

* Descriptor ring:

Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW clear.  Flags are always set/cleared last.

#define DESC_HW 0x0080

struct desc {
        __le64 addr;
        __le32 len;
        __le16 index;
        __le16 flags;
};

When DESC_HW is set, descriptor belongs to device. When it is clear, it belongs to the driver.

We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE      2

* Scatter/gather support

We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:

/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT       1

Unlike virtio 1.0, all descriptors must have distinct ID values.

Also unlike virtio 1.0, use of this flag will be an optional feature (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

* Indirect buffers

Can be marked like in virtio 1.0:

/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT   4

Unlike virtio 1.0, this is a table, not a list:
struct indirect_descriptor_table {
        /* The actual descriptors (16 bytes each) */
        struct virtq_desc desc[len / 16]; };

The first descriptor is located at start of the indirect descriptor table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect table. Others should be set to 0 and are ignored.  id is also set to 0 and should be ignored.

virtio 1.0 seems to allow a s/g entry followed by an indirect descriptor. This does not seem useful, so we do not allow that anymore.

This support would be an optional feature, same as in virtio 1.0

* Batching descriptors:

virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing the used/avail index by values > 1.  We can support this by chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.

#define VRING_DESC_F_BATCH_NEXT 0x0010

Batching works for both driver and device descriptors.



* Processing descriptors in and out of order

Device processing all descriptors in order can simply flip the DESC_HW bit as it is done with descriptors.

Device can write descriptors out in order as they are used, overwriting descriptors that are there.

Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor submitted.

Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor submitted.

* Device specific descriptor flags
We have a lot of unused space in the descriptor.  This can be put to good use by reserving some flag bits for device use.
For example, network device can set a bit to request that header in the descriptor is suppressed (in case it's all 0s anyway). This reduces cache utilization.

Note: this feature can be supported in virtio 1.0 as well, as we have unused bits in both descriptor and used ring there.

* Descriptor length in device descriptors

virtio 1.0 places strict requirements on descriptor length. For example it must be 0 in used ring of TX VQ of a network device since nothing is written.  In practice guests do not seem to use this, so we can simplify devices a bit by removing this requirement - if length is unused it should be ignored by driver.

Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.

* Writing at an offset

Some devices might want to write into some descriptors at an offset, the offset would be in config space, and a descriptor flag could indicate this:

#define VRING_DESC_F_OFFSET 0x0020

How exactly to use the offset would be device specific, for example it can be used to align beginning of packet in the 1st buffer for mergeable buffers to cache line boundary while also aligning rest of buffers.

* Non power-of-2 ring sizes

As the ring simply wraps around, there's no reason to require ring size to be power of two.
It can be made a separate feature though.


* Interrupt/event suppression

virtio 1.0 has two mechanisms for suppression but only one can be used at a time. we pack them together in a structure - one for interrupts, one for notifications:

struct event {
	__le16 idx;
	__le16 flags;
}

Both fields would be optional, with a feature bit:
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS

* Flags can be used like in virtio 1.0, by storing a special value there:

#define VRING_F_EVENT_ENABLE  0x0

#define VRING_F_EVENT_DISABLE 0x1

* Event index would be in the range 0 to 2 * Queue Size (to detect wrap arounds) and wrap to 0 after that.

The assumption is that each side maintains an internal descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches the given value.

* If both features are negotiated, a special flags value can be used to switch to event idx:

#define VRING_F_EVENT_IDX     0x2


* Prototype

A partial prototype can be found under
tools/virtio/ringtest/ring.c

Test run:
[mst@tuck ringtest]$ time ./ring 
real    0m0.399s
user    0m0.791s
sys     0m0.000s
[mst@tuck ringtest]$ time ./virtio_ring_0_9
real    0m0.503s
user    0m0.999s
sys     0m0.000s

It is planned to update it to this interface. Future changes and enhancements can (and should) be tested against this prototype.

* Feature sets
In particular are we going overboard with feature bits?  It becomes hard to support all combinations in drivers and devices.  Maybe we should document reasonable feature sets to be supported for each device.

* Known issues/ideas

This layout is optimized for host/guest communication, in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are unlikely to work with PCI hardware implementations anyway (e.g. due to use of SMP barriers for ordering).

Suggestions for improving this are welcome but need to be tested to make sure our main use case does not regress.  Of course some improvements might be made optional, but if we add too many of these driver becomes unmanageable.

---

Note: should this proposal be accepted and approved, one or more
      claims disclosed to the TC admin and listed on the Virtio TC
      IPR page https://www.oasis-open.org/committees/virtio/ipr.php
      might become Essential Claims.

--
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
       [not found]   ` <738D45BC1F695740A983F43CFE1B7EA94E93CA7E@IRSMSX108.ger.corp.intel.com>
@ 2017-02-22 15:13     ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-22 15:13 UTC (permalink / raw)
  To: Gray, Mark D; +Cc: virtio-dev, virtualization

On Wed, Feb 22, 2017 at 09:19:41AM +0000, Gray, Mark D wrote:
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> > open.org] On Behalf Of Michael S. Tsirkin
> > 
> > Here is an attempt to summarise the list of things we need to do with respect
> > to this proposal.
> 
> Thanks for outlining this, it makes it a lot clearer.
> 
> > 
> > 
> > Stage 1: update prototype and finalize proposal
> > 
> > At this stage we need to update the prototype under
> > tools/virtio/ringtest/ring.c to make it match latest proposal, adding in verious
> 
> Would a DPDK based prototype be good aswell or would you rather 
> everything be prototyped here?

DPDK is good.

You would then
- code up current proposal
- code up an alternative you are suggesting
- compare

tools/virtio/ringtest/ is only there to make prototyping
easier. I like keeping it alive for this purpose but it
is not a must.



> > options under discussion.
> > Then we will measure performance. While more ideas are welcome there
> > won't be useful without ability to test!
> > 
> > Tasks:
> > 
> > - update tools/virtio/ringtest/ring.c to proposal, adding
> >   options as listed below.
> > - compare performance with and without indirect buffers
> >   issue: how to estimate cost of memory allocations?
> > - batching descriptors - is it worth it?
> > - three ways to suppress interrupts/exits
> >   which works better?
> >   issue: how to estimate cost of interrupts/exits?
> > 
> > 
> 
> We need to ensure we have a proposal that is suitable for
> implementation in hardware.

Right. Obviously the system has to work well as a whole though, e.g. if
you save express bandwidth speeding up hardware but pay cache misses
slowing down software it's not clear you have won anything at all.


> > More ideas:
> > 
> > - current tests all indicate cache synchronizations due to r/w descriptors
> >   cost as much as read-only/write-only descriptors,
> >   and there are less of these synchronizations.
> >   Disagree? Write a patch and benchmark.
> > - some devices could use smaller descriptors. For example,
> >   if you don't need length, id (e.g. using in-order, fixed length) or flags, you
> > can
> >   use a single 64 bit pointer as a descriptor. This can sometimes work
> >   e.g. for networking rx rings.
> >   Not clear whether this gains/costs us anything. Disagree? Write a patch and
> > benchmark.
> > 
> > Stage 2: prototype guest/host drivers
> > 
> > At this stage we need real guest and host drivers to be able to test real life
> > performance.
> > I suggest dpdk drivers + munimal hack in qemu to pass features around.
> > 
> > Tasks:
> > 
> > - implement vhost-user support in dpdk
> > - implement virtio support in dpdk
> > - implement minimal stub in qemu
> > - test performance. Possibly revisit questions from stage 2
> >   if any are left open
> > 
> > 
> > Stage 3: complete guest/host drivers
> > 
> > At this stage we need to add linux support in virtio and vhost, and complete
> > qemu support.
> > 
> > Tasks:
> > 
> > - implement vhost support in kernel
> > - implement virtio support in kernel
> > - complete virtio support in qemu
> > - test performance. Possibly revisit questions from stage 2
> >   if any are left open
> > 
> > 
> > 
> > Stage X: Finalizing the spec
> > 
> > When are we ready to put out a spec draft? Surely not before Stage 1 is
> > done. Surely no later than stage 3.  A driver could be wish for some party to
> > productize an implementation.
> > 
> > Interested? Join TC and start discussion on timing and which features should
> > be included.
> > 
> > 
> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]       ` <20170209171105.075a9d9c.cornelia.huck@de.ibm.com>
@ 2017-02-22 16:43         ` Michael S. Tsirkin
       [not found]         ` <20170222181333-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-02-22 16:43 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Paolo Bonzini, virtualization, virtio-dev

On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > >>> * Non power-of-2 ring sizes
> > >>>
> > >>> As the ring simply wraps around, there's no reason to
> > >>> require ring size to be power of two.
> > >>> It can be made a separate feature though.
> > >>
> > >> Power of 2 ring sizes are required in order to ignore the high bits of
> > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > >> indices less than the ring size.
> > > 
> > > Right. So
> > > 
> > > 	if (unlikely(idx++ > size))
> > > 		idx = 0;
> > > 
> > > OTOH ring size that's twice larger than necessary
> > > because of power of two requirements wastes cache.
> > 
> > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > the complication and the gratuitous difference with 1.0.
> 
> I agree. I don't think dropping the power of 2 requirement buys us so
> much that it makes up for the added complexity.

I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.

However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.

This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.

We can fix this by making event suppression adjacent in memory, e.g.:


[interrupt suppress]
[descriptor ring]
[kick suppress]

If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.

I conclude that this is an optimization that needs to be
benchmarked.

I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.

For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.

So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.



-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
       [not found] ` <20170222054336-mutt-send-email-mst@kernel.org>
  2017-02-22  9:19   ` [virtio-dev] " Gray, Mark D
       [not found]   ` <738D45BC1F695740A983F43CFE1B7EA94E93CA7E@IRSMSX108.ger.corp.intel.com>
@ 2017-02-28  4:29   ` Yuanhan Liu
       [not found]   ` <20170228042943.GH18844@yliu-dev.sh.intel.com>
  3 siblings, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-02-28  4:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

Hi Michael,

Again, as usual, sorry for being late :/

On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> Stage 2: prototype guest/host drivers
> 
> At this stage we need real guest and host drivers
> to be able to test real life performance.
> I suggest dpdk drivers + munimal hack in qemu to
> pass features around.
> 

I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.

If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.


> Tasks:
> 
> - implement vhost-user support in dpdk
> - implement virtio support in dpdk
> - implement minimal stub in qemu

I didn't hack the QEMU, instead, I hacked the DPDK virtio-user, yet
another virtio-net emulation. It's simpler and quicker for me.

And here is my plan on virtio 1.1:

- Look deeper inside the virtio net performance issues (WIP)

  It's basically a job about digging the DPDK vhost/virtio code
  deeper, something like how exactly the cache acts while Tx/Rx
  pkts, what can be optimized by implementation, and what could
  be improved with the help of spec extension.

  Please note that I often got interrupted on this task: it didn't
  go smooth as I would have expected.


- Try to accelerate vhost/virtio with vector instructions.

  Something I will look into when above item is done. Currently,
  I thought of two items may help the vector implementation:

  * what kind of vring and desc layout could make the vector
    implementation easier.

  * what kind of hint we need from virtio spec for (dynamically)
    enabling the vector path.

  Besides that, I don't have too much clue yet.

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2017-02-22 14:46 ` [virtio-dev] packed ring layout proposal v2 Chien, Roger S
@ 2017-02-28  5:02 ` Yuanhan Liu
  2017-02-28  5:47 ` [RFC] packed (virtio-net) headers Yuanhan Liu
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-02-28  5:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Wed, Feb 08, 2017 at 05:20:14AM +0200, Michael S. Tsirkin wrote:
> This is an update from v1 version.
> Changes:
> - update event suppression mechanism
> - separate options for indirect and direct s/g
> - lots of new features
> 
> ---
> 
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.

May I know what's the index intended for? Back referencing a pkt buffer?

> #define DESC_HW 0x0080
> 
> struct desc {
>         __le64 addr;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };

...
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.

Honestly, I don't think it's an efficient way for batching. Neither the
DESC_F_NEXT nor the BATCH_NEXT tells us how many new descs are there
for processing: it's just a hint that there is one more. And you have
to follow the link one by one.

I was thinking, maybe we could sub-divide some fields of desc, thus
we could introduce more. For example, 32 bits "len" maybe too much,
at least to virtio-net: the biggest pkt size is 64K, which is 16 bits.
If we use 16 bits for len, we could use the extra 16 bits for telling
how telling the batch number.

	struct desc {
	        __le64 addr;
	        __le16 len;
	        __le16 batch;
	        __le16 index;
	        __le16 flags;
	};

Only the heading desc need set the batch count and DESC_HW flag. With
the two used together, we don't have to set/clear the DESC_HW flag on
driver/device.

If 64K is small enough for other devices (say, BLK), we may re-allocate
the bits to something like "24 : 8", whereas 24 for len (16M at most)
and 8 for batch.  OTOH, 8 bit of batch should be pretty enough, judging
that the default vring size is 256 and a typical burst size normally is
way less than that.

That said, if it's "16: 16" and if we use only 8 bits for batch, we
could still have another 8 bit for anything else, say the number of
desc for a single pkt. With that, the num_buffers of mergeable Rx
header could be replaced.  More importantly, we could reduce a cache
line write if non offload are supported in mergeable Rx path. 

	struct desc {
	        __le64 addr;
	        __le16 len;
	        __le8  batch;
	        __le8  num_buffers;
	        __le16 index;
	        __le16 flags;
	};

> * Device specific descriptor flags
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.

Good proposal. But I think it could only work with Tx, where the driver
knows whether the headers are all 0s while filling the desc. But for
Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
it still requires filling an header desc for storing it.

Maybe we could introduce a global feature? When that's negotiated, no
header desc need filled and processed? I'm thinking this could also
help the vector implementation I mentioned in another email.

> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.

Agreed.

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC] packed (virtio-net) headers
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2017-02-28  5:02 ` Yuanhan Liu
@ 2017-02-28  5:47 ` Yuanhan Liu
       [not found] ` <20170228050218.GI18844@yliu-dev.sh.intel.com>
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-02-28  5:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

Hi,

For virtio-net, we use 2 descs for representing a (small) pkt. One for
virtio-net header and another one for the pkt data. And it has two issues:

- the desc buffer for storing pkt data is halfed

  Though we later introduced 2 more options to overcome this: ANYLAY_OUT
  and indirect desc. The indirect desc has another issue: it introdues
  an extra cache line visit.

- virtio-net header could be scattered

  Assume the ANYLAY_OUT case, whereas the headered is prepened before
  each mbuf (or skb in kernel). In DPDK, a burst recevice in vhost pmd
  means 32 different cache visit for virtio header.

  For the legacy layout and indirect desc, the cache issue could somehone
  diminished a bit: we could arrange the virtio header in a same memory
  block and let the header desc point to the right one.

  But it's still not good enough: the virtio-net headers aren't accessed
  in batch: they have to be accessed one by one (by reading the desc).
  That said, it's still not that good for cache utilization.


And I'm proposing packed header:

- put all virtio-net header in a memory block.

  A burst size of 32 pkts need only access (32 * 12) / 64 = 6 cache lines.
  While before, it could be 32 cache lines.

- introduce a header desc to reference above memory block.

  desc->addr = starting addr of net headers mem block
  desc->len  = size of all net virtio net headers (burst size * header size)

Thus, in a burst size of 32, we only need 33 descs: one for headers and
others for store corresponding pkt data. More importantly, we could use
the "len" field for computing the batch size. We then could load the
virtio net headers at once; we could also prefetch all the descs at once.

Note it could also be adapted to virtio 0.95 and 1.0. I also made a simple
prototype with DPDK (yet again, it's Tx path only), I saw an impressive
boost (about 30%) in a mirco benchmark.

I think such proposal may should also help other devices, too, if they
also have a small header for each data.

Thoughts?

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found] ` <20170228050218.GI18844@yliu-dev.sh.intel.com>
@ 2017-03-01  1:02   ` Michael S. Tsirkin
       [not found]   ` <20170301024951-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-01  1:02 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: virtio-dev, virtualization

On Tue, Feb 28, 2017 at 01:02:18PM +0800, Yuanhan Liu wrote:
> On Wed, Feb 08, 2017 at 05:20:14AM +0200, Michael S. Tsirkin wrote:
> > This is an update from v1 version.
> > Changes:
> > - update event suppression mechanism
> > - separate options for indirect and direct s/g
> > - lots of new features
> > 
> > ---
> > 
> > Performance analysis of this is in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> 
> May I know what's the index intended for? Back referencing a pkt buffer?

Yes and generally identify which descriptor completed. Recall that
even though vhost net completes in order at the moment,
virtio rings serve devices (e.g. storage) that complete out of order.

> > #define DESC_HW 0x0080
> > 
> > struct desc {
> >         __le64 addr;
> >         __le32 len;
> >         __le16 index;
> >         __le16 flags;
> > };
> 
> ...
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> 
> Honestly, I don't think it's an efficient way for batching. Neither the
> DESC_F_NEXT nor the BATCH_NEXT tells us how many new descs are there
> for processing: it's just a hint that there is one more. And you have
> to follow the link one by one.
> 
> I was thinking, maybe we could sub-divide some fields of desc, thus
> we could introduce more. For example, 32 bits "len" maybe too much,
> at least to virtio-net: the biggest pkt size is 64K, which is 16 bits.
> If we use 16 bits for len, we could use the extra 16 bits for telling
> how telling the batch number.
> 
> 	struct desc {
> 	        __le64 addr;
> 	        __le16 len;
> 	        __le16 batch;
> 	        __le16 index;
> 	        __le16 flags;
> 	};
> 
> Only the heading desc need set the batch count and DESC_HW flag. With
> the two used together, we don't have to set/clear the DESC_HW flag on
> driver/device.
> If 64K is small enough for other devices (say, BLK), we may re-allocate
> the bits to something like "24 : 8", whereas 24 for len (16M at most)
> and 8 for batch.  OTOH, 8 bit of batch should be pretty enough, judging
> that the default vring size is 256 and a typical burst size normally is
> way less than that.
> 
> That said, if it's "16: 16" and if we use only 8 bits for batch, we
> could still have another 8 bit for anything else, say the number of
> desc for a single pkt. With that, the num_buffers of mergeable Rx
> header could be replaced.  More importantly, we could reduce a cache
> line write if non offload are supported in mergeable Rx path. 

Why do you bother with mergeable Rx without offloads?  Make each buffer
MTU sized and it'll fit without merging.  Linux used not to, it only
started doing this to save memory aggressively. I don't think
DPDK cares about this.

> 
> 	struct desc {
> 	        __le64 addr;
> 	        __le16 len;
> 	        __le8  batch;
> 	        __le8  num_buffers;
> 	        __le16 index;
> 	        __le16 flags;
> 	};

Interesting. How about a benchmark for these ideas?

> 
> > * Device specific descriptor flags
> > We have a lot of unused space in the descriptor.  This can be put to
> > good use by reserving some flag bits for device use.
> > For example, network device can set a bit to request
> > that header in the descriptor is suppressed
> > (in case it's all 0s anyway). This reduces cache utilization.
> 
> Good proposal. But I think it could only work with Tx, where the driver
> knows whether the headers are all 0s while filling the desc. But for
> Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> it still requires filling an header desc for storing it.

I don't see why - I don't think drivers pre-fill buffers in header for RX
right now. Why would they start?

> Maybe we could introduce a global feature? When that's negotiated, no
> header desc need filled and processed? I'm thinking this could also
> help the vector implementation I mentioned in another email.

It's possible of course - it's a subset of what I said.
Though it makes it less useful in the general case.

> > Note: this feature can be supported in virtio 1.0 as well,
> > as we have unused bits in both descriptor and used ring there.
> 
> Agreed.
> 
> 	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
       [not found]   ` <20170228042943.GH18844@yliu-dev.sh.intel.com>
@ 2017-03-01  1:07     ` Michael S. Tsirkin
  2017-03-08  7:09       ` Yuanhan Liu
       [not found]       ` <20170308070948.GC18844@yliu-dev.sh.intel.com>
  0 siblings, 2 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-01  1:07 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: virtio-dev, virtualization

On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> Hi Michael,
> 
> Again, as usual, sorry for being late :/
> 
> On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > Stage 2: prototype guest/host drivers
> > 
> > At this stage we need real guest and host drivers
> > to be able to test real life performance.
> > I suggest dpdk drivers + munimal hack in qemu to
> > pass features around.
> > 
> 
> I have already done that in last Nov. I made a very rough (yet hacky)
> version (only with Tx path) in one day while companying my wife in
> hospital.

Any performance data?

> If someone are interested in, I could share the code soon. I could
> even cleanup the code a bit if necessary.

Especially if you don't have time to benchmark, I think sharing it
might help.

> > Tasks:
> > 
> > - implement vhost-user support in dpdk
> > - implement virtio support in dpdk
> > - implement minimal stub in qemu
> 
> I didn't hack the QEMU, instead, I hacked the DPDK virtio-user, yet
> another virtio-net emulation. It's simpler and quicker for me.

Sure, I merely meant a stub for negotiating new feature bits
between host and guest. But I guess an environment set
the same way in host and guest would serve too.

> And here is my plan on virtio 1.1:
> 
> - Look deeper inside the virtio net performance issues (WIP)
> 
>   It's basically a job about digging the DPDK vhost/virtio code
>   deeper, something like how exactly the cache acts while Tx/Rx
>   pkts, what can be optimized by implementation, and what could
>   be improved with the help of spec extension.
> 
>   Please note that I often got interrupted on this task: it didn't
>   go smooth as I would have expected.
> 
> 
> - Try to accelerate vhost/virtio with vector instructions.

That's interesting. What kind of optimizations would you say
do vector instructions enable, and why?


>   Something I will look into when above item is done. Currently,
>   I thought of two items may help the vector implementation:
> 
>   * what kind of vring and desc layout could make the vector
>     implementation easier.
> 
>   * what kind of hint we need from virtio spec for (dynamically)
>     enabling the vector path.
> 
>   Besides that, I don't have too much clue yet.
> 
> 	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC] packed (virtio-net) headers
       [not found] ` <20170228054719.GJ18844@yliu-dev.sh.intel.com>
@ 2017-03-01  1:28   ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-01  1:28 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: virtio-dev, virtualization

On Tue, Feb 28, 2017 at 01:47:19PM +0800, Yuanhan Liu wrote:
> Hi,
> 
> For virtio-net, we use 2 descs for representing a (small) pkt. One for
> virtio-net header and another one for the pkt data. And it has two issues:
> 
> - the desc buffer for storing pkt data is halfed
> 
>   Though we later introduced 2 more options to overcome this: ANYLAY_OUT
>   and indirect desc. The indirect desc has another issue: it introdues
>   an extra cache line visit.

So if we don't care about this part, we could maybe just add
a descriptor flag that puts the whole header in the descriptor.

> - virtio-net header could be scattered
> 
>   Assume the ANYLAY_OUT case, whereas the headered is prepened before
>   each mbuf (or skb in kernel). In DPDK, a burst recevice in vhost pmd
>   means 32 different cache visit for virtio header.
> 
>   For the legacy layout and indirect desc, the cache issue could somehone
>   diminished a bit: we could arrange the virtio header in a same memory
>   block and let the header desc point to the right one.
> 
>   But it's still not good enough: the virtio-net headers aren't accessed
>   in batch: they have to be accessed one by one (by reading the desc).
>   That said, it's still not that good for cache utilization.
> 
> 
> And I'm proposing packed header:
> 
> - put all virtio-net header in a memory block.
> 
>   A burst size of 32 pkts need only access (32 * 12) / 64 = 6 cache lines.
>   While before, it could be 32 cache lines.
> 
> - introduce a header desc to reference above memory block.
> 
>   desc->addr = starting addr of net headers mem block
>   desc->len  = size of all net virtio net headers (burst size * header size)
> 
> Thus, in a burst size of 32, we only need 33 descs: one for headers and
> others for store corresponding pkt data. More importantly, we could use
> the "len" field for computing the batch size. We then could load the
> virtio net headers at once; we could also prefetch all the descs at once.
> 
> Note it could also be adapted to virtio 0.95 and 1.0. I also made a simple
> prototype with DPDK (yet again, it's Tx path only), I saw an impressive
> boost (about 30%) in a mirco benchmark.
> 
> I think such proposal may should also help other devices, too, if they
> also have a small header for each data.
> 
> Thoughts?
> 
> 	--yliu

That's great. An alternative might be to add an array of headers parallel
to array of descriptors and indexed by head. A bit in the descriptor
would then be enough to mark such a header as valid.

It's also an alternative way to pass in batches for virtio 1.1.

This has an advantage that it helps non-batched workloads as well
if enough packets end up in the ring, but maybe this
predicts on the CPU in a worse way. Worth benchmarking?

I hope above thoughts are helpful, but -

code walks - if you can show real gains I'd be inclined
to say let's go with it. You don't necessarily need to implement and
benchmark all possible ideas others can come up with :)
(though that's just me not speaking for anyone else -
 we'll have to put it on the TC ballot of course)

--  
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]   ` <20170301024951-mutt-send-email-mst@kernel.org>
@ 2017-03-01  3:57     ` Yuanhan Liu
       [not found]     ` <20170301035715.GP18844@yliu-dev.sh.intel.com>
  1 sibling, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-03-01  3:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Wed, Mar 01, 2017 at 03:02:29AM +0200, Michael S. Tsirkin wrote:
> > > * Descriptor ring:
> > > 
> > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > clear.  Flags are always set/cleared last.
> > 
> > May I know what's the index intended for? Back referencing a pkt buffer?
> 
> Yes and generally identify which descriptor completed. Recall that
> even though vhost net completes in order at the moment,
> virtio rings serve devices (e.g. storage) that complete out of order.

I see, and thanks.

> > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > could still have another 8 bit for anything else, say the number of
> > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > header could be replaced.  More importantly, we could reduce a cache
> > line write if non offload are supported in mergeable Rx path. 
> 
> Why do you bother with mergeable Rx without offloads?

Oh, my bad. I actually meant "without offloads __being used__". Just
assume the pkt size is 64B and no offloads are used. When mergeable
Rx is negotiated (which is the default case), num_buffers has to be
set 1. That means an extra cache line write. For the case of non
mergeable, the cache line write could be avoid by a trick like what
the following patch did:

   http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe

It basically tries to avoid writing 0 if the value is already 0:
the case when no offloads are used.

So while writing this email, I was thinking maybe we could not set
num_buffers to 1 when there is only one desc (let it be 0 and let
num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
do that now, thus I checked the DPDK code and found it's Okay.

    896                 seg_num = header->num_buffers;
    897
    898                 if (seg_num == 0)
    899                         seg_num = 1;


I then also checked linux kernel code, and found it's not okay as
the code depends on the value being set correctly:

==> 365         u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
    366         struct page *page = virt_to_head_page(buf);
    367         int offset = buf - page_address(page);
    368         unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
    369
    370         struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
    371                                                truesize);
    372         struct sk_buff *curr_skb = head_skb;
    373
    374         if (unlikely(!curr_skb))
    375                 goto err_skb;
==> 376         while (--num_buf) {

That means, if we want to do that, it needs an extra feature flag
(either a global feature flag or a desc flag), something like
ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
(virtio 0.95/1.0 won't benifit from it though).

Does it make sense to you?

> Make each buffer
> MTU sized and it'll fit without merging.  Linux used not to, it only
> started doing this to save memory aggressively. I don't think
> DPDK cares about this.
> 
> > 
> > 	struct desc {
> > 	        __le64 addr;
> > 	        __le16 len;
> > 	        __le8  batch;
> > 	        __le8  num_buffers;
> > 	        __le16 index;
> > 	        __le16 flags;
> > 	};
> 
> Interesting. How about a benchmark for these ideas?

Sure, I would like to benchmark it. It won't take long to me. But
currently, I was still focusing on analysising the performance behaviour
of virtio 0.95/1.0 (when I get some time), to see what's not good for
performance and what's can be improved.

Besides that, as said, I often got interrupted. Moreoever, DPDK v17.05
code freeze deadline is coming. So it's just a remind that I may don't
have time for it recently. Sorry for that.

> > > * Device specific descriptor flags
> > > We have a lot of unused space in the descriptor.  This can be put to
> > > good use by reserving some flag bits for device use.
> > > For example, network device can set a bit to request
> > > that header in the descriptor is suppressed
> > > (in case it's all 0s anyway). This reduces cache utilization.
> > 
> > Good proposal. But I think it could only work with Tx, where the driver
> > knows whether the headers are all 0s while filling the desc. But for
> > Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> > it still requires filling an header desc for storing it.
> 
> I don't see why - I don't think drivers pre-fill buffers in header for RX
> right now. Why would they start?

Again, my bad, I meant "prepare" but not "pre-fill". Let me try to explain
it again. I'm thinking:

- For Tx, when the header is all 0s, the header could be discarded. We
  could use one desc for transfering a packet (with a flag NO_HEADER
  or HEADER_ALL_ZERO bit set)

- For Rx, the header is filled in the device (or vhost) side. And the
  driver has to prepare the header desc for each pkt, because the Rx
  driver has no idea whether it will be all 0s.

  That means, the header could not be discarded.

If such a global feature is negotiated, we could also discard the header
desc as well.

	--yliu

> > Maybe we could introduce a global feature? When that's negotiated, no
> > header desc need filled and processed? I'm thinking this could also
> > help the vector implementation I mentioned in another email.
> 
> It's possible of course - it's a subset of what I said.
> Though it makes it less useful in the general case.
> 
> > > Note: this feature can be supported in virtio 1.0 as well,
> > > as we have unused bits in both descriptor and used ring there.
> > 
> > Agreed.
> > 
> > 	--yliu
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]     ` <20170301035715.GP18844@yliu-dev.sh.intel.com>
@ 2017-03-01  4:14       ` Michael S. Tsirkin
  2017-03-01  4:57         ` Yuanhan Liu
  0 siblings, 1 reply; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-01  4:14 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: virtio-dev, virtualization

On Wed, Mar 01, 2017 at 11:57:15AM +0800, Yuanhan Liu wrote:
> On Wed, Mar 01, 2017 at 03:02:29AM +0200, Michael S. Tsirkin wrote:
> > > > * Descriptor ring:
> > > > 
> > > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > > clear.  Flags are always set/cleared last.
> > > 
> > > May I know what's the index intended for? Back referencing a pkt buffer?
> > 
> > Yes and generally identify which descriptor completed. Recall that
> > even though vhost net completes in order at the moment,
> > virtio rings serve devices (e.g. storage) that complete out of order.
> 
> I see, and thanks.
> 
> > > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > > could still have another 8 bit for anything else, say the number of
> > > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > > header could be replaced.  More importantly, we could reduce a cache
> > > line write if non offload are supported in mergeable Rx path. 
> > 
> > Why do you bother with mergeable Rx without offloads?
> 
> Oh, my bad. I actually meant "without offloads __being used__". Just
> assume the pkt size is 64B and no offloads are used. When mergeable
> Rx is negotiated (which is the default case), num_buffers has to be
> set 1. That means an extra cache line write. For the case of non
> mergeable, the cache line write could be avoid by a trick like what
> the following patch did:
> 
>    http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
> 
> It basically tries to avoid writing 0 if the value is already 0:
> the case when no offloads are used.
> So while writing this email, I was thinking maybe we could not set
> num_buffers to 1 when there is only one desc (let it be 0 and let
> num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
> do that now, thus I checked the DPDK code and found it's Okay.
> 
>     896                 seg_num = header->num_buffers;
>     897
>     898                 if (seg_num == 0)
>     899                         seg_num = 1;
> 
> 
> I then also checked linux kernel code, and found it's not okay as
> the code depends on the value being set correctly:
> 
> ==> 365         u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>     366         struct page *page = virt_to_head_page(buf);
>     367         int offset = buf - page_address(page);
>     368         unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
>     369
>     370         struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
>     371                                                truesize);
>     372         struct sk_buff *curr_skb = head_skb;
>     373
>     374         if (unlikely(!curr_skb))
>     375                 goto err_skb;
> ==> 376         while (--num_buf) {
> 
> That means, if we want to do that, it needs an extra feature flag
> (either a global feature flag or a desc flag), something like
> ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
> (virtio 0.95/1.0 won't benifit from it though).
> 
> Does it make sense to you?

Right and then we could use a descriptor flag "header is all 0s".
For virtio 1.0 we could put these in the used ring instead.

> 
> > Make each buffer
> > MTU sized and it'll fit without merging.  Linux used not to, it only
> > started doing this to save memory aggressively. I don't think
> > DPDK cares about this.
> > 
> > > 
> > > 	struct desc {
> > > 	        __le64 addr;
> > > 	        __le16 len;
> > > 	        __le8  batch;
> > > 	        __le8  num_buffers;
> > > 	        __le16 index;
> > > 	        __le16 flags;
> > > 	};
> > 
> > Interesting. How about a benchmark for these ideas?
> 
> Sure, I would like to benchmark it. It won't take long to me. But
> currently, I was still focusing on analysising the performance behaviour
> of virtio 0.95/1.0 (when I get some time), to see what's not good for
> performance and what's can be improved.
> 
> Besides that, as said, I often got interrupted. Moreoever, DPDK v17.05
> code freeze deadline is coming. So it's just a remind that I may don't
> have time for it recently. Sorry for that.
> 
> > > > * Device specific descriptor flags
> > > > We have a lot of unused space in the descriptor.  This can be put to
> > > > good use by reserving some flag bits for device use.
> > > > For example, network device can set a bit to request
> > > > that header in the descriptor is suppressed
> > > > (in case it's all 0s anyway). This reduces cache utilization.
> > > 
> > > Good proposal. But I think it could only work with Tx, where the driver
> > > knows whether the headers are all 0s while filling the desc. But for
> > > Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
> > > it still requires filling an header desc for storing it.
> > 
> > I don't see why - I don't think drivers pre-fill buffers in header for RX
> > right now. Why would they start?
> 
> Again, my bad, I meant "prepare" but not "pre-fill". Let me try to explain
> it again. I'm thinking:
> 
> - For Tx, when the header is all 0s, the header could be discarded. We
>   could use one desc for transfering a packet (with a flag NO_HEADER
>   or HEADER_ALL_ZERO bit set)
> 
> - For Rx, the header is filled in the device (or vhost) side. And the
>   driver has to prepare the header desc for each pkt, because the Rx
>   driver has no idea whether it will be all 0s.
> 
>   That means, the header could not be discarded.
> 
> If such a global feature is negotiated, we could also discard the header
> desc as well.
> 
> 	--yliu

Right and again, flags could be added to the used ring to pass extra
info.

> > > Maybe we could introduce a global feature? When that's negotiated, no
> > > header desc need filled and processed? I'm thinking this could also
> > > help the vector implementation I mentioned in another email.
> > 
> > It's possible of course - it's a subset of what I said.
> > Though it makes it less useful in the general case.
> > 
> > > > Note: this feature can be supported in virtio 1.0 as well,
> > > > as we have unused bits in both descriptor and used ring there.
> > > 
> > > Agreed.
> > > 
> > > 	--yliu
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-03-01  4:14       ` Michael S. Tsirkin
@ 2017-03-01  4:57         ` Yuanhan Liu
  0 siblings, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-03-01  4:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Wed, Mar 01, 2017 at 06:14:21AM +0200, Michael S. Tsirkin wrote:
> > > > That said, if it's "16: 16" and if we use only 8 bits for batch, we
> > > > could still have another 8 bit for anything else, say the number of
> > > > desc for a single pkt. With that, the num_buffers of mergeable Rx
> > > > header could be replaced.  More importantly, we could reduce a cache
> > > > line write if non offload are supported in mergeable Rx path. 
> > > 
> > > Why do you bother with mergeable Rx without offloads?
> > 
> > Oh, my bad. I actually meant "without offloads __being used__". Just
> > assume the pkt size is 64B and no offloads are used. When mergeable
> > Rx is negotiated (which is the default case), num_buffers has to be
> > set 1. That means an extra cache line write. For the case of non
> > mergeable, the cache line write could be avoid by a trick like what
> > the following patch did:
> > 
> >    http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
> > 
> > It basically tries to avoid writing 0 if the value is already 0:
> > the case when no offloads are used.
> > So while writing this email, I was thinking maybe we could not set
> > num_buffers to 1 when there is only one desc (let it be 0 and let
> > num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
> > do that now, thus I checked the DPDK code and found it's Okay.
> > 
> >     896                 seg_num = header->num_buffers;
> >     897
> >     898                 if (seg_num == 0)
> >     899                         seg_num = 1;
> > 
> > 
> > I then also checked linux kernel code, and found it's not okay as
> > the code depends on the value being set correctly:
> > 
> > ==> 365         u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
> >     366         struct page *page = virt_to_head_page(buf);
> >     367         int offset = buf - page_address(page);
> >     368         unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
> >     369
> >     370         struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
> >     371                                                truesize);
> >     372         struct sk_buff *curr_skb = head_skb;
> >     373
> >     374         if (unlikely(!curr_skb))
> >     375                 goto err_skb;
> > ==> 376         while (--num_buf) {
> > 
> > That means, if we want to do that, it needs an extra feature flag
> > (either a global feature flag or a desc flag), something like
> > ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
> > (virtio 0.95/1.0 won't benifit from it though).
> > 
> > Does it make sense to you?
> 
> Right and then we could use a descriptor flag "header is all 0s".
> For virtio 1.0 we could put these in the used ring instead.

Great.

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]         ` <20170222181333-mutt-send-email-mst@kernel.org>
@ 2017-03-07 15:53           ` Cornelia Huck
  2017-03-07 20:33             ` Michael S. Tsirkin
       [not found]             ` <20170307223057-mutt-send-email-mst@kernel.org>
  0 siblings, 2 replies; 92+ messages in thread
From: Cornelia Huck @ 2017-03-07 15:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paolo Bonzini, virtualization, virtio-dev

On Wed, 22 Feb 2017 18:43:05 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > >>> * Non power-of-2 ring sizes
> > > >>>
> > > >>> As the ring simply wraps around, there's no reason to
> > > >>> require ring size to be power of two.
> > > >>> It can be made a separate feature though.
> > > >>
> > > >> Power of 2 ring sizes are required in order to ignore the high bits of
> > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > >> indices less than the ring size.
> > > > 
> > > > Right. So
> > > > 
> > > > 	if (unlikely(idx++ > size))
> > > > 		idx = 0;
> > > > 
> > > > OTOH ring size that's twice larger than necessary
> > > > because of power of two requirements wastes cache.
> > > 
> > > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > > the complication and the gratuitous difference with 1.0.
> > 
> > I agree. I don't think dropping the power of 2 requirement buys us so
> > much that it makes up for the added complexity.
> 
> I recalled why I came up with this. The issue is cache associativity.
> Recall that besides the ring we have event suppression
> structures - if we are lucky and things run at the same speed
> everything can work by polling keeping events disabled, then
> event suppression structures are never written to, they are read-only.
> 
> However if ring and event suppression share a cache line ring accesses
> have a chance to push the event suppression out of cache, causing
> misses on read.
> 
> This can happen if they are at the same offset in the set.
> E.g. with L1 cache 4Kbyte sets are common, so same offset
> within a 4K page.
> 
> We can fix this by making event suppression adjacent in memory, e.g.:
> 
> 
> [interrupt suppress]
> [descriptor ring]
> [kick suppress]
> 
> If this whole structure fits in a single set, ring accesses will
> not push kick or interrupt suppress out of cache.
> Specific layout can be left for drivers, but as set size is
> a power of two this might require a non-power of two ring size.
> 
> I conclude that this is an optimization that needs to be
> benchmarked.

This makes sense. But wouldn't the optimum layout not depend on the
platform?

> 
> I also note that the generic description does not have to force
> powers of two *even if devices actually require it*.
> I would be inclined to word the text in a way that makes
> relaxing the restriction easier.
> 
> For example, we can say "free running 16 bit index" and this forces a
> power of two, but we can also say "free running index wrapping to 0
> after (N*queue-size - 1) with N chosen such that the value fits in 16
> bit" and this is exactly the same if queue size is a power of 2.
> 
> So we can add text saying "ring size MUST be a power of two"
> and later it will be easy to relax just by adding a feature bit.

A later feature bit sounds good.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-03-07 15:53           ` Cornelia Huck
@ 2017-03-07 20:33             ` Michael S. Tsirkin
       [not found]             ` <20170307223057-mutt-send-email-mst@kernel.org>
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-07 20:33 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Paolo Bonzini, virtualization, virtio-dev

On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:
> On Wed, 22 Feb 2017 18:43:05 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > > >>> * Non power-of-2 ring sizes
> > > > >>>
> > > > >>> As the ring simply wraps around, there's no reason to
> > > > >>> require ring size to be power of two.
> > > > >>> It can be made a separate feature though.
> > > > >>
> > > > >> Power of 2 ring sizes are required in order to ignore the high bits of
> > > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > > >> indices less than the ring size.
> > > > > 
> > > > > Right. So
> > > > > 
> > > > > 	if (unlikely(idx++ > size))
> > > > > 		idx = 0;
> > > > > 
> > > > > OTOH ring size that's twice larger than necessary
> > > > > because of power of two requirements wastes cache.
> > > > 
> > > > I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> > > > the complication and the gratuitous difference with 1.0.
> > > 
> > > I agree. I don't think dropping the power of 2 requirement buys us so
> > > much that it makes up for the added complexity.
> > 
> > I recalled why I came up with this. The issue is cache associativity.
> > Recall that besides the ring we have event suppression
> > structures - if we are lucky and things run at the same speed
> > everything can work by polling keeping events disabled, then
> > event suppression structures are never written to, they are read-only.
> > 
> > However if ring and event suppression share a cache line ring accesses
> > have a chance to push the event suppression out of cache, causing
> > misses on read.
> > 
> > This can happen if they are at the same offset in the set.
> > E.g. with L1 cache 4Kbyte sets are common, so same offset
> > within a 4K page.
> > 
> > We can fix this by making event suppression adjacent in memory, e.g.:
> > 
> > 
> > [interrupt suppress]
> > [descriptor ring]
> > [kick suppress]
> > 
> > If this whole structure fits in a single set, ring accesses will
> > not push kick or interrupt suppress out of cache.
> > Specific layout can be left for drivers, but as set size is
> > a power of two this might require a non-power of two ring size.
> > 
> > I conclude that this is an optimization that needs to be
> > benchmarked.
> 
> This makes sense. But wouldn't the optimum layout not depend on the
> platform?

There's generally a tradeoff between performance and portability.
Whether it's worth it would need to be tested.
Further, it might be better to have platform-specific optimization
tied to a given platform rather than a feature bit.

> > 
> > I also note that the generic description does not have to force
> > powers of two *even if devices actually require it*.
> > I would be inclined to word the text in a way that makes
> > relaxing the restriction easier.
> > 
> > For example, we can say "free running 16 bit index" and this forces a
> > power of two, but we can also say "free running index wrapping to 0
> > after (N*queue-size - 1) with N chosen such that the value fits in 16
> > bit" and this is exactly the same if queue size is a power of 2.
> > 
> > So we can add text saying "ring size MUST be a power of two"
> > and later it will be easy to relax just by adding a feature bit.
> 
> A later feature bit sounds good.

No need to delay benchmarking if someone has the time though :)

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
  2017-03-01  1:07     ` Michael S. Tsirkin
@ 2017-03-08  7:09       ` Yuanhan Liu
       [not found]       ` <20170308070948.GC18844@yliu-dev.sh.intel.com>
  1 sibling, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-03-08  7:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, Maxime Coquelin, virtualization

On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > Hi Michael,
> > 
> > Again, as usual, sorry for being late :/
> > 
> > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > Stage 2: prototype guest/host drivers
> > > 
> > > At this stage we need real guest and host drivers
> > > to be able to test real life performance.
> > > I suggest dpdk drivers + munimal hack in qemu to
> > > pass features around.
> > > 
> > 
> > I have already done that in last Nov. I made a very rough (yet hacky)
> > version (only with Tx path) in one day while companying my wife in
> > hospital.
> 
> Any performance data?

A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.

> > If someone are interested in, I could share the code soon. I could
> > even cleanup the code a bit if necessary.
> 
> Especially if you don't have time to benchmark, I think sharing it
> might help.

Here it is (check the README-virtio-1.1 for howto):

    git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1

> > - Try to accelerate vhost/virtio with vector instructions.
> 
> That's interesting. What kind of optimizations would you say
> do vector instructions enable, and why?

If we have made the cache impact being minimum, the left thing could
be optimized is the instruction cycles. SIMD instructions (like AVX)
then should help on this.

	--yliu

> >   Something I will look into when above item is done. Currently,
> >   I thought of two items may help the vector implementation:
> > 
> >   * what kind of vring and desc layout could make the vector
> >     implementation easier.
> > 
> >   * what kind of hint we need from virtio spec for (dynamically)
> >     enabling the vector path.
> > 
> >   Besides that, I don't have too much clue yet.
> > 
> > 	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
       [not found]       ` <20170308070948.GC18844@yliu-dev.sh.intel.com>
@ 2017-03-08  7:56         ` Yuanhan Liu
       [not found]         ` <20170308075624.GF18844@yliu-dev.sh.intel.com>
  1 sibling, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-03-08  7:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, Maxime Coquelin, virtualization

On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > Hi Michael,
> > > 
> > > Again, as usual, sorry for being late :/
> > > 
> > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > Stage 2: prototype guest/host drivers
> > > > 
> > > > At this stage we need real guest and host drivers
> > > > to be able to test real life performance.
> > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > pass features around.
> > > > 
> > > 
> > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > version (only with Tx path) in one day while companying my wife in
> > > hospital.
> > 
> > Any performance data?
> 
> A straightfoward implementation only brings 10% performance boost in a
> txonly micro benchmarking. But I'm sure there are still plenty of room
> for improvement.
> 
> > > If someone are interested in, I could share the code soon. I could
> > > even cleanup the code a bit if necessary.
> > 
> > Especially if you don't have time to benchmark, I think sharing it
> > might help.
> 
> Here it is (check the README-virtio-1.1 for howto):
> 
>     git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1

Well, I was told it maybe not proper to share code like this way. So
this channel is closed. I will check how to find a proper way. Sorry
for the inconvenience!

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
       [not found]         ` <20170308075624.GF18844@yliu-dev.sh.intel.com>
@ 2017-03-29 12:39           ` Michael S. Tsirkin
  2017-04-01  7:30             ` Yuanhan Liu
  0 siblings, 1 reply; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-03-29 12:39 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: virtio-dev, Maxime Coquelin, virtualization

On Wed, Mar 08, 2017 at 03:56:24PM +0800, Yuanhan Liu wrote:
> On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> > On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > > Hi Michael,
> > > > 
> > > > Again, as usual, sorry for being late :/
> > > > 
> > > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > > Stage 2: prototype guest/host drivers
> > > > > 
> > > > > At this stage we need real guest and host drivers
> > > > > to be able to test real life performance.
> > > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > > pass features around.
> > > > > 
> > > > 
> > > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > > version (only with Tx path) in one day while companying my wife in
> > > > hospital.
> > > 
> > > Any performance data?
> > 
> > A straightfoward implementation only brings 10% performance boost in a
> > txonly micro benchmarking. But I'm sure there are still plenty of room
> > for improvement.
> > 
> > > > If someone are interested in, I could share the code soon. I could
> > > > even cleanup the code a bit if necessary.
> > > 
> > > Especially if you don't have time to benchmark, I think sharing it
> > > might help.
> > 
> > Here it is (check the README-virtio-1.1 for howto):
> > 
> >     git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1
> 
> Well, I was told it maybe not proper to share code like this way. So
> this channel is closed. I will check how to find a proper way. Sorry
> for the inconvenience!
> 
> 	--yliu

Where you going to re-post it?
I don't see what the issue could be frankly.
Care to elaborate?

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal - todo list
  2017-03-29 12:39           ` Michael S. Tsirkin
@ 2017-04-01  7:30             ` Yuanhan Liu
  0 siblings, 0 replies; 92+ messages in thread
From: Yuanhan Liu @ 2017-04-01  7:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, Maxime Coquelin, virtualization

On Wed, Mar 29, 2017 at 03:39:20PM +0300, Michael S. Tsirkin wrote:
> On Wed, Mar 08, 2017 at 03:56:24PM +0800, Yuanhan Liu wrote:
> > On Wed, Mar 08, 2017 at 03:09:48PM +0800, Yuanhan Liu wrote:
> > > On Wed, Mar 01, 2017 at 03:07:29AM +0200, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 28, 2017 at 12:29:43PM +0800, Yuanhan Liu wrote:
> > > > > Hi Michael,
> > > > > 
> > > > > Again, as usual, sorry for being late :/
> > > > > 
> > > > > On Wed, Feb 22, 2017 at 06:27:11AM +0200, Michael S. Tsirkin wrote:
> > > > > > Stage 2: prototype guest/host drivers
> > > > > > 
> > > > > > At this stage we need real guest and host drivers
> > > > > > to be able to test real life performance.
> > > > > > I suggest dpdk drivers + munimal hack in qemu to
> > > > > > pass features around.
> > > > > > 
> > > > > 
> > > > > I have already done that in last Nov. I made a very rough (yet hacky)
> > > > > version (only with Tx path) in one day while companying my wife in
> > > > > hospital.
> > > > 
> > > > Any performance data?
> > > 
> > > A straightfoward implementation only brings 10% performance boost in a
> > > txonly micro benchmarking. But I'm sure there are still plenty of room
> > > for improvement.
> > > 
> > > > > If someone are interested in, I could share the code soon. I could
> > > > > even cleanup the code a bit if necessary.
> > > > 
> > > > Especially if you don't have time to benchmark, I think sharing it
> > > > might help.
> > > 
> > > Here it is (check the README-virtio-1.1 for howto):
> > > 
> > >     git://fridaybit.com/git/dpdk.git  virtio-1.1-v0.1
> > 
> > Well, I was told it maybe not proper to share code like this way. So
> > this channel is closed. I will check how to find a proper way. Sorry
> > for the inconvenience!
> > 
> > 	--yliu
> 
> Where you going to re-post it?

It's at

    git://dpdk.org/next/dpdk-next-virtio    for-testing

> I don't see what the issue could be frankly.
> Care to elaborate?

Honestly, I don't know, neither.

	--yliu

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]             ` <20170307223057-mutt-send-email-mst@kernel.org>
  2017-07-10 16:27               ` Amnon Ilan
@ 2017-07-10 16:27               ` Amnon Ilan
  1 sibling, 0 replies; 92+ messages in thread
From: Amnon Ilan @ 2017-07-10 16:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, Lior Narkis
  Cc: Cornelia Huck, Paolo Bonzini, virtio-dev, virtualization


+Lior

----- Original Message -----
> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: "Cornelia Huck" <cornelia.huck@de.ibm.com>
> Cc: "Paolo Bonzini" <pbonzini@redhat.com>, virtualization@lists.linux-foundation.org, virtio-dev@lists.oasis-open.org
> Sent: Tuesday, March 7, 2017 10:33:57 PM
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:
> > On Wed, 22 Feb 2017 18:43:05 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > > > >>> * Non power-of-2 ring sizes
> > > > > >>>
> > > > > >>> As the ring simply wraps around, there's no reason to
> > > > > >>> require ring size to be power of two.
> > > > > >>> It can be made a separate feature though.
> > > > > >>
> > > > > >> Power of 2 ring sizes are required in order to ignore the high
> > > > > >> bits of
> > > > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > > > >> indices less than the ring size.
> > > > > > 
> > > > > > Right. So
> > > > > > 
> > > > > > 	if (unlikely(idx++ > size))
> > > > > > 		idx = 0;
> > > > > > 
> > > > > > OTOH ring size that's twice larger than necessary
> > > > > > because of power of two requirements wastes cache.
> > > > > 
> > > > > I don't know.  Power of 2 ring size is pretty standard, I'd rather
> > > > > avoid
> > > > > the complication and the gratuitous difference with 1.0.
> > > > 
> > > > I agree. I don't think dropping the power of 2 requirement buys us so
> > > > much that it makes up for the added complexity.
> > > 
> > > I recalled why I came up with this. The issue is cache associativity.
> > > Recall that besides the ring we have event suppression
> > > structures - if we are lucky and things run at the same speed
> > > everything can work by polling keeping events disabled, then
> > > event suppression structures are never written to, they are read-only.
> > > 
> > > However if ring and event suppression share a cache line ring accesses
> > > have a chance to push the event suppression out of cache, causing
> > > misses on read.
> > > 
> > > This can happen if they are at the same offset in the set.
> > > E.g. with L1 cache 4Kbyte sets are common, so same offset
> > > within a 4K page.
> > > 
> > > We can fix this by making event suppression adjacent in memory, e.g.:
> > > 
> > > 
> > > [interrupt suppress]
> > > [descriptor ring]
> > > [kick suppress]
> > > 
> > > If this whole structure fits in a single set, ring accesses will
> > > not push kick or interrupt suppress out of cache.
> > > Specific layout can be left for drivers, but as set size is
> > > a power of two this might require a non-power of two ring size.
> > > 
> > > I conclude that this is an optimization that needs to be
> > > benchmarked.
> > 
> > This makes sense. But wouldn't the optimum layout not depend on the
> > platform?
> 
> There's generally a tradeoff between performance and portability.
> Whether it's worth it would need to be tested.
> Further, it might be better to have platform-specific optimization
> tied to a given platform rather than a feature bit.
> 
> > > 
> > > I also note that the generic description does not have to force
> > > powers of two *even if devices actually require it*.
> > > I would be inclined to word the text in a way that makes
> > > relaxing the restriction easier.
> > > 
> > > For example, we can say "free running 16 bit index" and this forces a
> > > power of two, but we can also say "free running index wrapping to 0
> > > after (N*queue-size - 1) with N chosen such that the value fits in 16
> > > bit" and this is exactly the same if queue size is a power of 2.
> > > 
> > > So we can add text saying "ring size MUST be a power of two"
> > > and later it will be easy to relax just by adding a feature bit.
> > 
> > A later feature bit sounds good.
> 
> No need to delay benchmarking if someone has the time though :)
> 
> --
> MST
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
       [not found]             ` <20170307223057-mutt-send-email-mst@kernel.org>
@ 2017-07-10 16:27               ` Amnon Ilan
  2017-07-10 16:27               ` Amnon Ilan
  1 sibling, 0 replies; 92+ messages in thread
From: Amnon Ilan @ 2017-07-10 16:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, Lior Narkis
  Cc: Cornelia Huck, Paolo Bonzini, virtualization, virtio-dev


+Lior

----- Original Message -----
> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: "Cornelia Huck" <cornelia.huck@de.ibm.com>
> Cc: "Paolo Bonzini" <pbonzini@redhat.com>, virtualization@lists.linux-foundation.org, virtio-dev@lists.oasis-open.org
> Sent: Tuesday, March 7, 2017 10:33:57 PM
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Tue, Mar 07, 2017 at 04:53:53PM +0100, Cornelia Huck wrote:
> > On Wed, 22 Feb 2017 18:43:05 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Thu, Feb 09, 2017 at 05:11:05PM +0100, Cornelia Huck wrote:
> > > > > >>> * Non power-of-2 ring sizes
> > > > > >>>
> > > > > >>> As the ring simply wraps around, there's no reason to
> > > > > >>> require ring size to be power of two.
> > > > > >>> It can be made a separate feature though.
> > > > > >>
> > > > > >> Power of 2 ring sizes are required in order to ignore the high
> > > > > >> bits of
> > > > > >> the indices.  With non-power-of-2 sizes you are forced to keep the
> > > > > >> indices less than the ring size.
> > > > > > 
> > > > > > Right. So
> > > > > > 
> > > > > > 	if (unlikely(idx++ > size))
> > > > > > 		idx = 0;
> > > > > > 
> > > > > > OTOH ring size that's twice larger than necessary
> > > > > > because of power of two requirements wastes cache.
> > > > > 
> > > > > I don't know.  Power of 2 ring size is pretty standard, I'd rather
> > > > > avoid
> > > > > the complication and the gratuitous difference with 1.0.
> > > > 
> > > > I agree. I don't think dropping the power of 2 requirement buys us so
> > > > much that it makes up for the added complexity.
> > > 
> > > I recalled why I came up with this. The issue is cache associativity.
> > > Recall that besides the ring we have event suppression
> > > structures - if we are lucky and things run at the same speed
> > > everything can work by polling keeping events disabled, then
> > > event suppression structures are never written to, they are read-only.
> > > 
> > > However if ring and event suppression share a cache line ring accesses
> > > have a chance to push the event suppression out of cache, causing
> > > misses on read.
> > > 
> > > This can happen if they are at the same offset in the set.
> > > E.g. with L1 cache 4Kbyte sets are common, so same offset
> > > within a 4K page.
> > > 
> > > We can fix this by making event suppression adjacent in memory, e.g.:
> > > 
> > > 
> > > [interrupt suppress]
> > > [descriptor ring]
> > > [kick suppress]
> > > 
> > > If this whole structure fits in a single set, ring accesses will
> > > not push kick or interrupt suppress out of cache.
> > > Specific layout can be left for drivers, but as set size is
> > > a power of two this might require a non-power of two ring size.
> > > 
> > > I conclude that this is an optimization that needs to be
> > > benchmarked.
> > 
> > This makes sense. But wouldn't the optimum layout not depend on the
> > platform?
> 
> There's generally a tradeoff between performance and portability.
> Whether it's worth it would need to be tested.
> Further, it might be better to have platform-specific optimization
> tied to a given platform rather than a feature bit.
> 
> > > 
> > > I also note that the generic description does not have to force
> > > powers of two *even if devices actually require it*.
> > > I would be inclined to word the text in a way that makes
> > > relaxing the restriction easier.
> > > 
> > > For example, we can say "free running 16 bit index" and this forces a
> > > power of two, but we can also say "free running index wrapping to 0
> > > after (N*queue-size - 1) with N chosen such that the value fits in 16
> > > bit" and this is exactly the same if queue size is a power of 2.
> > > 
> > > So we can add text saying "ring size MUST be a power of two"
> > > and later it will be easy to relax just by adding a feature bit.
> > 
> > A later feature bit sounds good.
> 
> No need to delay benchmarking if someone has the time though :)
> 
> --
> MST
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2017-07-16  6:00 ` [virtio-dev] packed ring layout proposal v2 Lior Narkis
@ 2017-07-16  6:00 ` Lior Narkis
  2017-09-11  7:47 ` [virtio-dev] Re: packed ring layout proposal v3 Jason Wang
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Lior Narkis @ 2017-07-16  6:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, February 08, 2017 5:20 AM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v2
> 
> This is an update from v1 version.
> Changes:
> - update event suppression mechanism
> - separate options for indirect and direct s/g
> - lots of new features
> 
> ---
> 
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.
> 
> #define DESC_HW 0x0080
> 
> struct desc {
>         __le64 addr;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
> 
> When DESC_HW is set, descriptor belongs to device. When it is clear,
> it belongs to the driver.
> 
> We can use 1 bit to set direction
> /* This marks a buffer as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2
> 

A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. 
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue number.

I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ.
So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.
CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.
The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.

One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.

> * Scatter/gather support
> 
> We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT       1
> 
> Unlike virtio 1.0, all descriptors must have distinct ID values.
> 
> Also unlike virtio 1.0, use of this flag will be an optional feature
> (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> 
> * Indirect buffers
> 
> Can be marked like in virtio 1.0:
> 
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
> 
> Unlike virtio 1.0, this is a table, not a list:
> struct indirect_descriptor_table {
>         /* The actual descriptors (16 bytes each) */
>         struct virtq_desc desc[len / 16];
> };
> 
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  id is also set to 0
> and should be ignored.
> 
> virtio 1.0 seems to allow a s/g entry followed by
> an indirect descriptor. This does not seem useful,
> so we do not allow that anymore.
> 
> This support would be an optional feature, same as in virtio 1.0
> 
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.
> 
> 
> 
> * Processing descriptors in and out of order
> 
> Device processing all descriptors in order can simply flip
> the DESC_HW bit as it is done with descriptors.
> 
> Device can write descriptors out in order as they are used, overwriting
> descriptors that are there.
> 
> Device must not use a descriptor until DESC_HW is set.
> It is only required to look at the first descriptor
> submitted.
> 
> Driver must not overwrite a descriptor until DESC_HW is clear.
> It is only required to look at the first descriptor
> submitted.
> 
> * Device specific descriptor flags
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
> 
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.
> 
> * Descriptor length in device descriptors
> 
> virtio 1.0 places strict requirements on descriptor length. For example
> it must be 0 in used ring of TX VQ of a network device since nothing is
> written.  In practice guests do not seem to use this, so we can simplify
> devices a bit by removing this requirement - if length is unused it
> should be ignored by driver.
> 
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
> 
> * Writing at an offset
> 
> Some devices might want to write into some descriptors
> at an offset, the offset would be in config space,
> and a descriptor flag could indicate this:
> 
> #define VRING_DESC_F_OFFSET 0x0020
> 
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.
> 
> * Non power-of-2 ring sizes
> 
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
> 
> 
> * Interrupt/event suppression
> 
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
> 
> struct event {
> 	__le16 idx;
> 	__le16 flags;
> }
> 
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
> 
> * Flags can be used like in virtio 1.0, by storing a special
> value there:
> 
> #define VRING_F_EVENT_ENABLE  0x0
> 
> #define VRING_F_EVENT_DISABLE 0x1
> 
> * Event index would be in the range 0 to 2 * Queue Size
> (to detect wrap arounds) and wrap to 0 after that.
> 
> The assumption is that each side maintains an internal
> descriptor counter 0 to 2 * Queue Size that wraps to 0.
> In that case, interrupt triggers when counter reaches
> the given value.
> 
> * If both features are negotiated, a special flags value
> can be used to switch to event idx:
> 
> #define VRING_F_EVENT_IDX     0x2
> 
> 
> * Prototype
> 
> A partial prototype can be found under
> tools/virtio/ringtest/ring.c
> 
> Test run:
> [mst@tuck ringtest]$ time ./ring
> real    0m0.399s
> user    0m0.791s
> sys     0m0.000s
> [mst@tuck ringtest]$ time ./virtio_ring_0_9
> real    0m0.503s
> user    0m0.999s
> sys     0m0.000s
> 
> It is planned to update it to this interface. Future changes and
> enhancements can (and should) be tested against this prototype.
> 
> * Feature sets
> In particular are we going overboard with feature bits?  It becomes hard
> to support all combinations in drivers and devices.  Maybe we should
> document reasonable feature sets to be supported for each device.
> 
> * Known issues/ideas
> 
> This layout is optimized for host/guest communication,
> in a sense even more aggressively than virtio 1.0.
> It might be suboptimal for PCI hardware implementations.
> However, one notes that current virtio pci drivers are
> unlikely to work with PCI hardware implementations anyway
> (e.g. due to use of SMP barriers for ordering).
> 
> Suggestions for improving this are welcome but need to be tested to make
> sure our main use case does not regress.  Of course some improvements
> might be made optional, but if we add too many of these driver becomes
> unmanageable.
> 
> ---
> 
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w.oasis-
> open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
>       might become Essential Claims.
> 
> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v2
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (8 preceding siblings ...)
       [not found] ` <20170228054719.GJ18844@yliu-dev.sh.intel.com>
@ 2017-07-16  6:00 ` Lior Narkis
  2017-07-18 16:23     ` Michael S. Tsirkin
  2017-07-16  6:00 ` Lior Narkis
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 92+ messages in thread
From: Lior Narkis @ 2017-07-16  6:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, February 08, 2017 5:20 AM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v2
> 
> This is an update from v1 version.
> Changes:
> - update event suppression mechanism
> - separate options for indirect and direct s/g
> - lots of new features
> 
> ---
> 
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
> 
> * Descriptor ring:
> 
> Guest adds descriptors with unique index values and DESC_HW set in flags.
> Host overwrites used descriptors with correct len, index, and DESC_HW
> clear.  Flags are always set/cleared last.
> 
> #define DESC_HW 0x0080
> 
> struct desc {
>         __le64 addr;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
> 
> When DESC_HW is set, descriptor belongs to device. When it is clear,
> it belongs to the driver.
> 
> We can use 1 bit to set direction
> /* This marks a buffer as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2
> 

A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. 
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue number.

I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ.
So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.
CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.
The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.

One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.

> * Scatter/gather support
> 
> We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> 
> /* This marks a buffer as continuing via the next field. */
> #define VRING_DESC_F_NEXT       1
> 
> Unlike virtio 1.0, all descriptors must have distinct ID values.
> 
> Also unlike virtio 1.0, use of this flag will be an optional feature
> (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> 
> * Indirect buffers
> 
> Can be marked like in virtio 1.0:
> 
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
> 
> Unlike virtio 1.0, this is a table, not a list:
> struct indirect_descriptor_table {
>         /* The actual descriptors (16 bytes each) */
>         struct virtq_desc desc[len / 16];
> };
> 
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  id is also set to 0
> and should be ignored.
> 
> virtio 1.0 seems to allow a s/g entry followed by
> an indirect descriptor. This does not seem useful,
> so we do not allow that anymore.
> 
> This support would be an optional feature, same as in virtio 1.0
> 
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.  We can support this by
> chaining a list of descriptors through a bit the flags field.
> To allow use together with s/g, a different bit will be used.
> 
> #define VRING_DESC_F_BATCH_NEXT 0x0010
> 
> Batching works for both driver and device descriptors.
> 
> 
> 
> * Processing descriptors in and out of order
> 
> Device processing all descriptors in order can simply flip
> the DESC_HW bit as it is done with descriptors.
> 
> Device can write descriptors out in order as they are used, overwriting
> descriptors that are there.
> 
> Device must not use a descriptor until DESC_HW is set.
> It is only required to look at the first descriptor
> submitted.
> 
> Driver must not overwrite a descriptor until DESC_HW is clear.
> It is only required to look at the first descriptor
> submitted.
> 
> * Device specific descriptor flags
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
> 
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.
> 
> * Descriptor length in device descriptors
> 
> virtio 1.0 places strict requirements on descriptor length. For example
> it must be 0 in used ring of TX VQ of a network device since nothing is
> written.  In practice guests do not seem to use this, so we can simplify
> devices a bit by removing this requirement - if length is unused it
> should be ignored by driver.
> 
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
> 
> * Writing at an offset
> 
> Some devices might want to write into some descriptors
> at an offset, the offset would be in config space,
> and a descriptor flag could indicate this:
> 
> #define VRING_DESC_F_OFFSET 0x0020
> 
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.
> 
> * Non power-of-2 ring sizes
> 
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
> 
> 
> * Interrupt/event suppression
> 
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
> 
> struct event {
> 	__le16 idx;
> 	__le16 flags;
> }
> 
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
> 
> * Flags can be used like in virtio 1.0, by storing a special
> value there:
> 
> #define VRING_F_EVENT_ENABLE  0x0
> 
> #define VRING_F_EVENT_DISABLE 0x1
> 
> * Event index would be in the range 0 to 2 * Queue Size
> (to detect wrap arounds) and wrap to 0 after that.
> 
> The assumption is that each side maintains an internal
> descriptor counter 0 to 2 * Queue Size that wraps to 0.
> In that case, interrupt triggers when counter reaches
> the given value.
> 
> * If both features are negotiated, a special flags value
> can be used to switch to event idx:
> 
> #define VRING_F_EVENT_IDX     0x2
> 
> 
> * Prototype
> 
> A partial prototype can be found under
> tools/virtio/ringtest/ring.c
> 
> Test run:
> [mst@tuck ringtest]$ time ./ring
> real    0m0.399s
> user    0m0.791s
> sys     0m0.000s
> [mst@tuck ringtest]$ time ./virtio_ring_0_9
> real    0m0.503s
> user    0m0.999s
> sys     0m0.000s
> 
> It is planned to update it to this interface. Future changes and
> enhancements can (and should) be tested against this prototype.
> 
> * Feature sets
> In particular are we going overboard with feature bits?  It becomes hard
> to support all combinations in drivers and devices.  Maybe we should
> document reasonable feature sets to be supported for each device.
> 
> * Known issues/ideas
> 
> This layout is optimized for host/guest communication,
> in a sense even more aggressively than virtio 1.0.
> It might be suboptimal for PCI hardware implementations.
> However, one notes that current virtio pci drivers are
> unlikely to work with PCI hardware implementations anyway
> (e.g. due to use of SMP barriers for ordering).
> 
> Suggestions for improving this are welcome but need to be tested to make
> sure our main use case does not regress.  Of course some improvements
> might be made optional, but if we add too many of these driver becomes
> unmanageable.
> 
> ---
> 
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> w.oasis-
> open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
>       might become Essential Claims.
> 
> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-07-16  6:00 ` [virtio-dev] packed ring layout proposal v2 Lior Narkis
@ 2017-07-18 16:23     ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-07-18 16:23 UTC (permalink / raw)
  To: Lior Narkis; +Cc: virtio-dev, virtualization

On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Wednesday, February 08, 2017 5:20 AM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v2
> > 
> > This is an update from v1 version.
> > Changes:
> > - update event suppression mechanism
> > - separate options for indirect and direct s/g
> > - lots of new features
> > 
> > ---
> > 
> > Performance analysis of this is in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> > 
> > #define DESC_HW 0x0080
> > 
> > struct desc {
> >         __le64 addr;
> >         __le32 len;
> >         __le16 index;
> >         __le16 flags;
> > };
> > 
> > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > it belongs to the driver.
> > 
> > We can use 1 bit to set direction
> > /* This marks a buffer as write-only (otherwise read-only). */
> > #define VRING_DESC_F_WRITE      2
> > 
> 
> A valid bit per descriptor does not let the device do an efficient prefetch.
> An alternative is to use a producer index(PI).
> Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. 
> There are few options for the device to acquire the PI.
> Most efficient will be to write the PI in the doorbell together with the queue number.

Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.

> I would like to raise the need for a Completion Queue(CQ).
> Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ.
> So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.

This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.

> CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.

That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch. I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.

> The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.

Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.

I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?

It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.

No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.

> One application for multiple WQs with a single CQ is Quality of Service(QoS).
> A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.

A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.

As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?


> > * Scatter/gather support
> > 
> > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT       1
> > 
> > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > 
> > Also unlike virtio 1.0, use of this flag will be an optional feature
> > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > 
> > * Indirect buffers
> > 
> > Can be marked like in virtio 1.0:
> > 
> > /* This means the buffer contains a table of buffer descriptors. */
> > #define VRING_DESC_F_INDIRECT   4
> > 
> > Unlike virtio 1.0, this is a table, not a list:
> > struct indirect_descriptor_table {
> >         /* The actual descriptors (16 bytes each) */
> >         struct virtq_desc desc[len / 16];
> > };
> > 
> > The first descriptor is located at start of the indirect descriptor
> > table, additional indirect descriptors come immediately afterwards.
> > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > table. Others should be set to 0 and are ignored.  id is also set to 0
> > and should be ignored.
> > 
> > virtio 1.0 seems to allow a s/g entry followed by
> > an indirect descriptor. This does not seem useful,
> > so we do not allow that anymore.
> > 
> > This support would be an optional feature, same as in virtio 1.0
> > 
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> > 
> > 
> > 
> > * Processing descriptors in and out of order
> > 
> > Device processing all descriptors in order can simply flip
> > the DESC_HW bit as it is done with descriptors.
> > 
> > Device can write descriptors out in order as they are used, overwriting
> > descriptors that are there.
> > 
> > Device must not use a descriptor until DESC_HW is set.
> > It is only required to look at the first descriptor
> > submitted.
> > 
> > Driver must not overwrite a descriptor until DESC_HW is clear.
> > It is only required to look at the first descriptor
> > submitted.
> > 
> > * Device specific descriptor flags
> > We have a lot of unused space in the descriptor.  This can be put to
> > good use by reserving some flag bits for device use.
> > For example, network device can set a bit to request
> > that header in the descriptor is suppressed
> > (in case it's all 0s anyway). This reduces cache utilization.
> > 
> > Note: this feature can be supported in virtio 1.0 as well,
> > as we have unused bits in both descriptor and used ring there.
> > 
> > * Descriptor length in device descriptors
> > 
> > virtio 1.0 places strict requirements on descriptor length. For example
> > it must be 0 in used ring of TX VQ of a network device since nothing is
> > written.  In practice guests do not seem to use this, so we can simplify
> > devices a bit by removing this requirement - if length is unused it
> > should be ignored by driver.
> > 
> > Some devices use identically-sized buffers in all descriptors.
> > Ignoring length for driver descriptors there could be an option too.
> > 
> > * Writing at an offset
> > 
> > Some devices might want to write into some descriptors
> > at an offset, the offset would be in config space,
> > and a descriptor flag could indicate this:
> > 
> > #define VRING_DESC_F_OFFSET 0x0020
> > 
> > How exactly to use the offset would be device specific,
> > for example it can be used to align beginning of packet
> > in the 1st buffer for mergeable buffers to cache line boundary
> > while also aligning rest of buffers.
> > 
> > * Non power-of-2 ring sizes
> > 
> > As the ring simply wraps around, there's no reason to
> > require ring size to be power of two.
> > It can be made a separate feature though.
> > 
> > 
> > * Interrupt/event suppression
> > 
> > virtio 1.0 has two mechanisms for suppression but only
> > one can be used at a time. we pack them together
> > in a structure - one for interrupts, one for notifications:
> > 
> > struct event {
> > 	__le16 idx;
> > 	__le16 flags;
> > }
> > 
> > Both fields would be optional, with a feature bit:
> > VIRTIO_F_EVENT_IDX
> > VIRTIO_F_EVENT_FLAGS
> > 
> > * Flags can be used like in virtio 1.0, by storing a special
> > value there:
> > 
> > #define VRING_F_EVENT_ENABLE  0x0
> > 
> > #define VRING_F_EVENT_DISABLE 0x1
> > 
> > * Event index would be in the range 0 to 2 * Queue Size
> > (to detect wrap arounds) and wrap to 0 after that.
> > 
> > The assumption is that each side maintains an internal
> > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > In that case, interrupt triggers when counter reaches
> > the given value.
> > 
> > * If both features are negotiated, a special flags value
> > can be used to switch to event idx:
> > 
> > #define VRING_F_EVENT_IDX     0x2
> > 
> > 
> > * Prototype
> > 
> > A partial prototype can be found under
> > tools/virtio/ringtest/ring.c
> > 
> > Test run:
> > [mst@tuck ringtest]$ time ./ring
> > real    0m0.399s
> > user    0m0.791s
> > sys     0m0.000s
> > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > real    0m0.503s
> > user    0m0.999s
> > sys     0m0.000s
> > 
> > It is planned to update it to this interface. Future changes and
> > enhancements can (and should) be tested against this prototype.
> > 
> > * Feature sets
> > In particular are we going overboard with feature bits?  It becomes hard
> > to support all combinations in drivers and devices.  Maybe we should
> > document reasonable feature sets to be supported for each device.
> > 
> > * Known issues/ideas
> > 
> > This layout is optimized for host/guest communication,
> > in a sense even more aggressively than virtio 1.0.
> > It might be suboptimal for PCI hardware implementations.
> > However, one notes that current virtio pci drivers are
> > unlikely to work with PCI hardware implementations anyway
> > (e.g. due to use of SMP barriers for ordering).
> > 
> > Suggestions for improving this are welcome but need to be tested to make
> > sure our main use case does not regress.  Of course some improvements
> > might be made optional, but if we add too many of these driver becomes
> > unmanageable.
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >       claims disclosed to the TC admin and listed on the Virtio TC
> >       IPR page
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > w.oasis-
> > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> >       might become Essential Claims.
> > 
> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
@ 2017-07-18 16:23     ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-07-18 16:23 UTC (permalink / raw)
  To: Lior Narkis; +Cc: virtio-dev, virtualization

On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Wednesday, February 08, 2017 5:20 AM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v2
> > 
> > This is an update from v1 version.
> > Changes:
> > - update event suppression mechanism
> > - separate options for indirect and direct s/g
> > - lots of new features
> > 
> > ---
> > 
> > Performance analysis of this is in my kvm forum 2016 presentation.
> > The idea is to have a r/w descriptor in a ring structure,
> > replacing the used and available ring, index and descriptor
> > buffer.
> > 
> > * Descriptor ring:
> > 
> > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > Host overwrites used descriptors with correct len, index, and DESC_HW
> > clear.  Flags are always set/cleared last.
> > 
> > #define DESC_HW 0x0080
> > 
> > struct desc {
> >         __le64 addr;
> >         __le32 len;
> >         __le16 index;
> >         __le16 flags;
> > };
> > 
> > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > it belongs to the driver.
> > 
> > We can use 1 bit to set direction
> > /* This marks a buffer as write-only (otherwise read-only). */
> > #define VRING_DESC_F_WRITE      2
> > 
> 
> A valid bit per descriptor does not let the device do an efficient prefetch.
> An alternative is to use a producer index(PI).
> Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly. 
> There are few options for the device to acquire the PI.
> Most efficient will be to write the PI in the doorbell together with the queue number.

Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.

> I would like to raise the need for a Completion Queue(CQ).
> Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ.
> So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.

This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.

> CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.

That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch. I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.

> The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.

Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.

I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?

It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.

No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.

> One application for multiple WQs with a single CQ is Quality of Service(QoS).
> A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.

A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.

As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?


> > * Scatter/gather support
> > 
> > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > 
> > /* This marks a buffer as continuing via the next field. */
> > #define VRING_DESC_F_NEXT       1
> > 
> > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > 
> > Also unlike virtio 1.0, use of this flag will be an optional feature
> > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > 
> > * Indirect buffers
> > 
> > Can be marked like in virtio 1.0:
> > 
> > /* This means the buffer contains a table of buffer descriptors. */
> > #define VRING_DESC_F_INDIRECT   4
> > 
> > Unlike virtio 1.0, this is a table, not a list:
> > struct indirect_descriptor_table {
> >         /* The actual descriptors (16 bytes each) */
> >         struct virtq_desc desc[len / 16];
> > };
> > 
> > The first descriptor is located at start of the indirect descriptor
> > table, additional indirect descriptors come immediately afterwards.
> > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > table. Others should be set to 0 and are ignored.  id is also set to 0
> > and should be ignored.
> > 
> > virtio 1.0 seems to allow a s/g entry followed by
> > an indirect descriptor. This does not seem useful,
> > so we do not allow that anymore.
> > 
> > This support would be an optional feature, same as in virtio 1.0
> > 
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > incrementing the used/avail index by values > 1.  We can support this by
> > chaining a list of descriptors through a bit the flags field.
> > To allow use together with s/g, a different bit will be used.
> > 
> > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > 
> > Batching works for both driver and device descriptors.
> > 
> > 
> > 
> > * Processing descriptors in and out of order
> > 
> > Device processing all descriptors in order can simply flip
> > the DESC_HW bit as it is done with descriptors.
> > 
> > Device can write descriptors out in order as they are used, overwriting
> > descriptors that are there.
> > 
> > Device must not use a descriptor until DESC_HW is set.
> > It is only required to look at the first descriptor
> > submitted.
> > 
> > Driver must not overwrite a descriptor until DESC_HW is clear.
> > It is only required to look at the first descriptor
> > submitted.
> > 
> > * Device specific descriptor flags
> > We have a lot of unused space in the descriptor.  This can be put to
> > good use by reserving some flag bits for device use.
> > For example, network device can set a bit to request
> > that header in the descriptor is suppressed
> > (in case it's all 0s anyway). This reduces cache utilization.
> > 
> > Note: this feature can be supported in virtio 1.0 as well,
> > as we have unused bits in both descriptor and used ring there.
> > 
> > * Descriptor length in device descriptors
> > 
> > virtio 1.0 places strict requirements on descriptor length. For example
> > it must be 0 in used ring of TX VQ of a network device since nothing is
> > written.  In practice guests do not seem to use this, so we can simplify
> > devices a bit by removing this requirement - if length is unused it
> > should be ignored by driver.
> > 
> > Some devices use identically-sized buffers in all descriptors.
> > Ignoring length for driver descriptors there could be an option too.
> > 
> > * Writing at an offset
> > 
> > Some devices might want to write into some descriptors
> > at an offset, the offset would be in config space,
> > and a descriptor flag could indicate this:
> > 
> > #define VRING_DESC_F_OFFSET 0x0020
> > 
> > How exactly to use the offset would be device specific,
> > for example it can be used to align beginning of packet
> > in the 1st buffer for mergeable buffers to cache line boundary
> > while also aligning rest of buffers.
> > 
> > * Non power-of-2 ring sizes
> > 
> > As the ring simply wraps around, there's no reason to
> > require ring size to be power of two.
> > It can be made a separate feature though.
> > 
> > 
> > * Interrupt/event suppression
> > 
> > virtio 1.0 has two mechanisms for suppression but only
> > one can be used at a time. we pack them together
> > in a structure - one for interrupts, one for notifications:
> > 
> > struct event {
> > 	__le16 idx;
> > 	__le16 flags;
> > }
> > 
> > Both fields would be optional, with a feature bit:
> > VIRTIO_F_EVENT_IDX
> > VIRTIO_F_EVENT_FLAGS
> > 
> > * Flags can be used like in virtio 1.0, by storing a special
> > value there:
> > 
> > #define VRING_F_EVENT_ENABLE  0x0
> > 
> > #define VRING_F_EVENT_DISABLE 0x1
> > 
> > * Event index would be in the range 0 to 2 * Queue Size
> > (to detect wrap arounds) and wrap to 0 after that.
> > 
> > The assumption is that each side maintains an internal
> > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > In that case, interrupt triggers when counter reaches
> > the given value.
> > 
> > * If both features are negotiated, a special flags value
> > can be used to switch to event idx:
> > 
> > #define VRING_F_EVENT_IDX     0x2
> > 
> > 
> > * Prototype
> > 
> > A partial prototype can be found under
> > tools/virtio/ringtest/ring.c
> > 
> > Test run:
> > [mst@tuck ringtest]$ time ./ring
> > real    0m0.399s
> > user    0m0.791s
> > sys     0m0.000s
> > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > real    0m0.503s
> > user    0m0.999s
> > sys     0m0.000s
> > 
> > It is planned to update it to this interface. Future changes and
> > enhancements can (and should) be tested against this prototype.
> > 
> > * Feature sets
> > In particular are we going overboard with feature bits?  It becomes hard
> > to support all combinations in drivers and devices.  Maybe we should
> > document reasonable feature sets to be supported for each device.
> > 
> > * Known issues/ideas
> > 
> > This layout is optimized for host/guest communication,
> > in a sense even more aggressively than virtio 1.0.
> > It might be suboptimal for PCI hardware implementations.
> > However, one notes that current virtio pci drivers are
> > unlikely to work with PCI hardware implementations anyway
> > (e.g. due to use of SMP barriers for ordering).
> > 
> > Suggestions for improving this are welcome but need to be tested to make
> > sure our main use case does not regress.  Of course some improvements
> > might be made optional, but if we add too many of these driver becomes
> > unmanageable.
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >       claims disclosed to the TC admin and listed on the Virtio TC
> >       IPR page
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > w.oasis-
> > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> >       might become Essential Claims.
> > 
> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v2
  2017-07-18 16:23     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-07-19  7:41     ` Lior Narkis
  -1 siblings, 0 replies; 92+ messages in thread
From: Lior Narkis @ 2017-07-19  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Tuesday, July 18, 2017 7:23 PM
> To: Lior Narkis <liorn@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > > -----Original Message-----
> > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org]
> > > On Behalf Of Michael S. Tsirkin
> > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > To: virtio-dev@lists.oasis-open.org
> > > Cc: virtualization@lists.linux-foundation.org
> > > Subject: [virtio-dev] packed ring layout proposal v2
> > >
> > > This is an update from v1 version.
> > > Changes:
> > > - update event suppression mechanism
> > > - separate options for indirect and direct s/g
> > > - lots of new features
> > >
> > > ---
> > >
> > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > The idea is to have a r/w descriptor in a ring structure,
> > > replacing the used and available ring, index and descriptor
> > > buffer.
> > >
> > > * Descriptor ring:
> > >
> > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > clear.  Flags are always set/cleared last.
> > >
> > > #define DESC_HW 0x0080
> > >
> > > struct desc {
> > >         __le64 addr;
> > >         __le32 len;
> > >         __le16 index;
> > >         __le16 flags;
> > > };
> > >
> > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > it belongs to the driver.
> > >
> > > We can use 1 bit to set direction
> > > /* This marks a buffer as write-only (otherwise read-only). */
> > > #define VRING_DESC_F_WRITE      2
> > >
> >
> > A valid bit per descriptor does not let the device do an efficient prefetch.
> > An alternative is to use a producer index(PI).
> > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> by the device, the device knows how much work it has outstanding, so it can
> do the prefetch accordingly.
> > There are few options for the device to acquire the PI.
> > Most efficient will be to write the PI in the doorbell together with the queue
> number.
> 
> Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> Or just the PI if we don't need the queue number.
> 
> > I would like to raise the need for a Completion Queue(CQ).
> > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> connected to a single CQ.
> > So when the device completes the work on the descriptor, it writes a
> Completion Queue Entry (CQE) to the CQ.
> 
> This seems similar to the design we currently have with a separate used
> ring. It seems to underperform writing into the available ring
> at least on a microbenchmark. The reason seems to be that
> more cache lines need to get invalidated and re-fetched.


Few points on that:
Each PCIe write will cause invalidation to a cache line.
Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.


> 
> > CQEs are continuous in memory so prefetching by the driver is efficient,
> although the device might complete work descriptors out of order.
> 
> That's not different from proposal you are replying to - descriptors
> can be used and written out in any order, they are contigious
> so driver can prefetch. 

Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.
With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
This is why it is better for prefetching.

> I'd like to add that attempts to
> add prefetch e.g. for the virtio used ring never showed any
> conclusive gains - some workloads would get minor a speedup, others
> lose a bit of performance. I do not think anyone looked
> deeply into why that was the case. It would be easy for you to add this
> code to current virtio drivers and/or devices and try it out.

Noted.
I will say though that mlx5 uses prefetch and gets good performance because of it...

> 
> > The interrupt handler is connected to the CQ, so an allocation of a single CQ
> per core, with a single interrupt handler is possible although this core might be
> using multiple WQs.
> 
> Sending a single interrupt from multiple rings might save some
> cycles. event index/interrupt disable are currently in
> RAM so access is very cheap for the guest.
> If you are going to share, just disable all interrupts
> when you start processing.
> 
> I was wondering how do people want to do this in hardware
> in fact. Are you going to read event index after each descriptor?

Not sure I got you here.
Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?

> 
> It might make sense to move event index/flags into device memory. And
> once you do this, writing these out might become slower, and then some
> kind of sharing of the event index might help.
> 
> No one suggested anything like this so far - maybe it's less of an issue
> than I think, e.g. because of interrupt coalescing (which virtio also
> does not have, but could be added if there is interest) or maybe people
> just mostly care about polling performance.
> 
> > One application for multiple WQs with a single CQ is Quality of Service(QoS).
> > A user can open a WQ per QoS value(pcp value for example), and the device
> will schedule the work accordingly.
> 
> A ring per QOS level might make sense e.g. in a network device. I don't
> see why it's helpful to write out completed entries into a separate
> ring for that though.

I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.

Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ, 
So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.

> 
> As we don't have QOS support in virtio net at all right now,
> would that be best discussed once we have a working prototype
> and everyone can see what the problem is?

Understood.
Although, I think the layout should not change frequently.

> 
> 
> > > * Scatter/gather support
> > >
> > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > >
> > > /* This marks a buffer as continuing via the next field. */
> > > #define VRING_DESC_F_NEXT       1
> > >
> > > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > >
> > > Also unlike virtio 1.0, use of this flag will be an optional feature
> > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > >
> > > * Indirect buffers
> > >
> > > Can be marked like in virtio 1.0:
> > >
> > > /* This means the buffer contains a table of buffer descriptors. */
> > > #define VRING_DESC_F_INDIRECT   4
> > >
> > > Unlike virtio 1.0, this is a table, not a list:
> > > struct indirect_descriptor_table {
> > >         /* The actual descriptors (16 bytes each) */
> > >         struct virtq_desc desc[len / 16];
> > > };
> > >
> > > The first descriptor is located at start of the indirect descriptor
> > > table, additional indirect descriptors come immediately afterwards.
> > > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > > table. Others should be set to 0 and are ignored.  id is also set to 0
> > > and should be ignored.
> > >
> > > virtio 1.0 seems to allow a s/g entry followed by
> > > an indirect descriptor. This does not seem useful,
> > > so we do not allow that anymore.
> > >
> > > This support would be an optional feature, same as in virtio 1.0
> > >
> > > * Batching descriptors:
> > >
> > > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > > incrementing the used/avail index by values > 1.  We can support this by
> > > chaining a list of descriptors through a bit the flags field.
> > > To allow use together with s/g, a different bit will be used.
> > >
> > > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > >
> > > Batching works for both driver and device descriptors.
> > >
> > >
> > >
> > > * Processing descriptors in and out of order
> > >
> > > Device processing all descriptors in order can simply flip
> > > the DESC_HW bit as it is done with descriptors.
> > >
> > > Device can write descriptors out in order as they are used, overwriting
> > > descriptors that are there.
> > >
> > > Device must not use a descriptor until DESC_HW is set.
> > > It is only required to look at the first descriptor
> > > submitted.
> > >
> > > Driver must not overwrite a descriptor until DESC_HW is clear.
> > > It is only required to look at the first descriptor
> > > submitted.
> > >
> > > * Device specific descriptor flags
> > > We have a lot of unused space in the descriptor.  This can be put to
> > > good use by reserving some flag bits for device use.
> > > For example, network device can set a bit to request
> > > that header in the descriptor is suppressed
> > > (in case it's all 0s anyway). This reduces cache utilization.
> > >
> > > Note: this feature can be supported in virtio 1.0 as well,
> > > as we have unused bits in both descriptor and used ring there.
> > >
> > > * Descriptor length in device descriptors
> > >
> > > virtio 1.0 places strict requirements on descriptor length. For example
> > > it must be 0 in used ring of TX VQ of a network device since nothing is
> > > written.  In practice guests do not seem to use this, so we can simplify
> > > devices a bit by removing this requirement - if length is unused it
> > > should be ignored by driver.
> > >
> > > Some devices use identically-sized buffers in all descriptors.
> > > Ignoring length for driver descriptors there could be an option too.
> > >
> > > * Writing at an offset
> > >
> > > Some devices might want to write into some descriptors
> > > at an offset, the offset would be in config space,
> > > and a descriptor flag could indicate this:
> > >
> > > #define VRING_DESC_F_OFFSET 0x0020
> > >
> > > How exactly to use the offset would be device specific,
> > > for example it can be used to align beginning of packet
> > > in the 1st buffer for mergeable buffers to cache line boundary
> > > while also aligning rest of buffers.
> > >
> > > * Non power-of-2 ring sizes
> > >
> > > As the ring simply wraps around, there's no reason to
> > > require ring size to be power of two.
> > > It can be made a separate feature though.
> > >
> > >
> > > * Interrupt/event suppression
> > >
> > > virtio 1.0 has two mechanisms for suppression but only
> > > one can be used at a time. we pack them together
> > > in a structure - one for interrupts, one for notifications:
> > >
> > > struct event {
> > > 	__le16 idx;
> > > 	__le16 flags;
> > > }
> > >
> > > Both fields would be optional, with a feature bit:
> > > VIRTIO_F_EVENT_IDX
> > > VIRTIO_F_EVENT_FLAGS
> > >
> > > * Flags can be used like in virtio 1.0, by storing a special
> > > value there:
> > >
> > > #define VRING_F_EVENT_ENABLE  0x0
> > >
> > > #define VRING_F_EVENT_DISABLE 0x1
> > >
> > > * Event index would be in the range 0 to 2 * Queue Size
> > > (to detect wrap arounds) and wrap to 0 after that.
> > >
> > > The assumption is that each side maintains an internal
> > > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > > In that case, interrupt triggers when counter reaches
> > > the given value.
> > >
> > > * If both features are negotiated, a special flags value
> > > can be used to switch to event idx:
> > >
> > > #define VRING_F_EVENT_IDX     0x2
> > >
> > >
> > > * Prototype
> > >
> > > A partial prototype can be found under
> > > tools/virtio/ringtest/ring.c
> > >
> > > Test run:
> > > [mst@tuck ringtest]$ time ./ring
> > > real    0m0.399s
> > > user    0m0.791s
> > > sys     0m0.000s
> > > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > > real    0m0.503s
> > > user    0m0.999s
> > > sys     0m0.000s
> > >
> > > It is planned to update it to this interface. Future changes and
> > > enhancements can (and should) be tested against this prototype.
> > >
> > > * Feature sets
> > > In particular are we going overboard with feature bits?  It becomes hard
> > > to support all combinations in drivers and devices.  Maybe we should
> > > document reasonable feature sets to be supported for each device.
> > >
> > > * Known issues/ideas
> > >
> > > This layout is optimized for host/guest communication,
> > > in a sense even more aggressively than virtio 1.0.
> > > It might be suboptimal for PCI hardware implementations.
> > > However, one notes that current virtio pci drivers are
> > > unlikely to work with PCI hardware implementations anyway
> > > (e.g. due to use of SMP barriers for ordering).
> > >
> > > Suggestions for improving this are welcome but need to be tested to make
> > > sure our main use case does not regress.  Of course some improvements
> > > might be made optional, but if we add too many of these driver becomes
> > > unmanageable.
> > >
> > > ---
> > >
> > > Note: should this proposal be accepted and approved, one or more
> > >       claims disclosed to the TC admin and listed on the Virtio TC
> > >       IPR page
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > > w.oasis-
> > >
> open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > >
> ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > >
> ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> > >       might become Essential Claims.
> > >
> > > --
> > > MST
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v2
  2017-07-18 16:23     ` Michael S. Tsirkin
  (?)
@ 2017-07-19  7:41     ` Lior Narkis
  2017-07-20 13:06         ` Michael S. Tsirkin
  -1 siblings, 1 reply; 92+ messages in thread
From: Lior Narkis @ 2017-07-19  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Tuesday, July 18, 2017 7:23 PM
> To: Lior Narkis <liorn@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v2
> 
> On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > > -----Original Message-----
> > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org]
> > > On Behalf Of Michael S. Tsirkin
> > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > To: virtio-dev@lists.oasis-open.org
> > > Cc: virtualization@lists.linux-foundation.org
> > > Subject: [virtio-dev] packed ring layout proposal v2
> > >
> > > This is an update from v1 version.
> > > Changes:
> > > - update event suppression mechanism
> > > - separate options for indirect and direct s/g
> > > - lots of new features
> > >
> > > ---
> > >
> > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > The idea is to have a r/w descriptor in a ring structure,
> > > replacing the used and available ring, index and descriptor
> > > buffer.
> > >
> > > * Descriptor ring:
> > >
> > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > clear.  Flags are always set/cleared last.
> > >
> > > #define DESC_HW 0x0080
> > >
> > > struct desc {
> > >         __le64 addr;
> > >         __le32 len;
> > >         __le16 index;
> > >         __le16 flags;
> > > };
> > >
> > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > it belongs to the driver.
> > >
> > > We can use 1 bit to set direction
> > > /* This marks a buffer as write-only (otherwise read-only). */
> > > #define VRING_DESC_F_WRITE      2
> > >
> >
> > A valid bit per descriptor does not let the device do an efficient prefetch.
> > An alternative is to use a producer index(PI).
> > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> by the device, the device knows how much work it has outstanding, so it can
> do the prefetch accordingly.
> > There are few options for the device to acquire the PI.
> > Most efficient will be to write the PI in the doorbell together with the queue
> number.
> 
> Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> Or just the PI if we don't need the queue number.
> 
> > I would like to raise the need for a Completion Queue(CQ).
> > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> connected to a single CQ.
> > So when the device completes the work on the descriptor, it writes a
> Completion Queue Entry (CQE) to the CQ.
> 
> This seems similar to the design we currently have with a separate used
> ring. It seems to underperform writing into the available ring
> at least on a microbenchmark. The reason seems to be that
> more cache lines need to get invalidated and re-fetched.


Few points on that:
Each PCIe write will cause invalidation to a cache line.
Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.


> 
> > CQEs are continuous in memory so prefetching by the driver is efficient,
> although the device might complete work descriptors out of order.
> 
> That's not different from proposal you are replying to - descriptors
> can be used and written out in any order, they are contigious
> so driver can prefetch. 

Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.
With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
This is why it is better for prefetching.

> I'd like to add that attempts to
> add prefetch e.g. for the virtio used ring never showed any
> conclusive gains - some workloads would get minor a speedup, others
> lose a bit of performance. I do not think anyone looked
> deeply into why that was the case. It would be easy for you to add this
> code to current virtio drivers and/or devices and try it out.

Noted.
I will say though that mlx5 uses prefetch and gets good performance because of it...

> 
> > The interrupt handler is connected to the CQ, so an allocation of a single CQ
> per core, with a single interrupt handler is possible although this core might be
> using multiple WQs.
> 
> Sending a single interrupt from multiple rings might save some
> cycles. event index/interrupt disable are currently in
> RAM so access is very cheap for the guest.
> If you are going to share, just disable all interrupts
> when you start processing.
> 
> I was wondering how do people want to do this in hardware
> in fact. Are you going to read event index after each descriptor?

Not sure I got you here.
Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?

> 
> It might make sense to move event index/flags into device memory. And
> once you do this, writing these out might become slower, and then some
> kind of sharing of the event index might help.
> 
> No one suggested anything like this so far - maybe it's less of an issue
> than I think, e.g. because of interrupt coalescing (which virtio also
> does not have, but could be added if there is interest) or maybe people
> just mostly care about polling performance.
> 
> > One application for multiple WQs with a single CQ is Quality of Service(QoS).
> > A user can open a WQ per QoS value(pcp value for example), and the device
> will schedule the work accordingly.
> 
> A ring per QOS level might make sense e.g. in a network device. I don't
> see why it's helpful to write out completed entries into a separate
> ring for that though.

I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.

Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ, 
So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.

> 
> As we don't have QOS support in virtio net at all right now,
> would that be best discussed once we have a working prototype
> and everyone can see what the problem is?

Understood.
Although, I think the layout should not change frequently.

> 
> 
> > > * Scatter/gather support
> > >
> > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > >
> > > /* This marks a buffer as continuing via the next field. */
> > > #define VRING_DESC_F_NEXT       1
> > >
> > > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > >
> > > Also unlike virtio 1.0, use of this flag will be an optional feature
> > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > >
> > > * Indirect buffers
> > >
> > > Can be marked like in virtio 1.0:
> > >
> > > /* This means the buffer contains a table of buffer descriptors. */
> > > #define VRING_DESC_F_INDIRECT   4
> > >
> > > Unlike virtio 1.0, this is a table, not a list:
> > > struct indirect_descriptor_table {
> > >         /* The actual descriptors (16 bytes each) */
> > >         struct virtq_desc desc[len / 16];
> > > };
> > >
> > > The first descriptor is located at start of the indirect descriptor
> > > table, additional indirect descriptors come immediately afterwards.
> > > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > > table. Others should be set to 0 and are ignored.  id is also set to 0
> > > and should be ignored.
> > >
> > > virtio 1.0 seems to allow a s/g entry followed by
> > > an indirect descriptor. This does not seem useful,
> > > so we do not allow that anymore.
> > >
> > > This support would be an optional feature, same as in virtio 1.0
> > >
> > > * Batching descriptors:
> > >
> > > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > > incrementing the used/avail index by values > 1.  We can support this by
> > > chaining a list of descriptors through a bit the flags field.
> > > To allow use together with s/g, a different bit will be used.
> > >
> > > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > >
> > > Batching works for both driver and device descriptors.
> > >
> > >
> > >
> > > * Processing descriptors in and out of order
> > >
> > > Device processing all descriptors in order can simply flip
> > > the DESC_HW bit as it is done with descriptors.
> > >
> > > Device can write descriptors out in order as they are used, overwriting
> > > descriptors that are there.
> > >
> > > Device must not use a descriptor until DESC_HW is set.
> > > It is only required to look at the first descriptor
> > > submitted.
> > >
> > > Driver must not overwrite a descriptor until DESC_HW is clear.
> > > It is only required to look at the first descriptor
> > > submitted.
> > >
> > > * Device specific descriptor flags
> > > We have a lot of unused space in the descriptor.  This can be put to
> > > good use by reserving some flag bits for device use.
> > > For example, network device can set a bit to request
> > > that header in the descriptor is suppressed
> > > (in case it's all 0s anyway). This reduces cache utilization.
> > >
> > > Note: this feature can be supported in virtio 1.0 as well,
> > > as we have unused bits in both descriptor and used ring there.
> > >
> > > * Descriptor length in device descriptors
> > >
> > > virtio 1.0 places strict requirements on descriptor length. For example
> > > it must be 0 in used ring of TX VQ of a network device since nothing is
> > > written.  In practice guests do not seem to use this, so we can simplify
> > > devices a bit by removing this requirement - if length is unused it
> > > should be ignored by driver.
> > >
> > > Some devices use identically-sized buffers in all descriptors.
> > > Ignoring length for driver descriptors there could be an option too.
> > >
> > > * Writing at an offset
> > >
> > > Some devices might want to write into some descriptors
> > > at an offset, the offset would be in config space,
> > > and a descriptor flag could indicate this:
> > >
> > > #define VRING_DESC_F_OFFSET 0x0020
> > >
> > > How exactly to use the offset would be device specific,
> > > for example it can be used to align beginning of packet
> > > in the 1st buffer for mergeable buffers to cache line boundary
> > > while also aligning rest of buffers.
> > >
> > > * Non power-of-2 ring sizes
> > >
> > > As the ring simply wraps around, there's no reason to
> > > require ring size to be power of two.
> > > It can be made a separate feature though.
> > >
> > >
> > > * Interrupt/event suppression
> > >
> > > virtio 1.0 has two mechanisms for suppression but only
> > > one can be used at a time. we pack them together
> > > in a structure - one for interrupts, one for notifications:
> > >
> > > struct event {
> > > 	__le16 idx;
> > > 	__le16 flags;
> > > }
> > >
> > > Both fields would be optional, with a feature bit:
> > > VIRTIO_F_EVENT_IDX
> > > VIRTIO_F_EVENT_FLAGS
> > >
> > > * Flags can be used like in virtio 1.0, by storing a special
> > > value there:
> > >
> > > #define VRING_F_EVENT_ENABLE  0x0
> > >
> > > #define VRING_F_EVENT_DISABLE 0x1
> > >
> > > * Event index would be in the range 0 to 2 * Queue Size
> > > (to detect wrap arounds) and wrap to 0 after that.
> > >
> > > The assumption is that each side maintains an internal
> > > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > > In that case, interrupt triggers when counter reaches
> > > the given value.
> > >
> > > * If both features are negotiated, a special flags value
> > > can be used to switch to event idx:
> > >
> > > #define VRING_F_EVENT_IDX     0x2
> > >
> > >
> > > * Prototype
> > >
> > > A partial prototype can be found under
> > > tools/virtio/ringtest/ring.c
> > >
> > > Test run:
> > > [mst@tuck ringtest]$ time ./ring
> > > real    0m0.399s
> > > user    0m0.791s
> > > sys     0m0.000s
> > > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > > real    0m0.503s
> > > user    0m0.999s
> > > sys     0m0.000s
> > >
> > > It is planned to update it to this interface. Future changes and
> > > enhancements can (and should) be tested against this prototype.
> > >
> > > * Feature sets
> > > In particular are we going overboard with feature bits?  It becomes hard
> > > to support all combinations in drivers and devices.  Maybe we should
> > > document reasonable feature sets to be supported for each device.
> > >
> > > * Known issues/ideas
> > >
> > > This layout is optimized for host/guest communication,
> > > in a sense even more aggressively than virtio 1.0.
> > > It might be suboptimal for PCI hardware implementations.
> > > However, one notes that current virtio pci drivers are
> > > unlikely to work with PCI hardware implementations anyway
> > > (e.g. due to use of SMP barriers for ordering).
> > >
> > > Suggestions for improving this are welcome but need to be tested to make
> > > sure our main use case does not regress.  Of course some improvements
> > > might be made optional, but if we add too many of these driver becomes
> > > unmanageable.
> > >
> > > ---
> > >
> > > Note: should this proposal be accepted and approved, one or more
> > >       claims disclosed to the TC admin and listed on the Virtio TC
> > >       IPR page
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > > w.oasis-
> > >
> open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > >
> ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > >
> ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> > >       might become Essential Claims.
> > >
> > > --
> > > MST
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
  2017-07-19  7:41     ` Lior Narkis
@ 2017-07-20 13:06         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-07-20 13:06 UTC (permalink / raw)
  To: Lior Narkis; +Cc: virtio-dev, virtualization

On Wed, Jul 19, 2017 at 07:41:55AM +0000, Lior Narkis wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Tuesday, July 18, 2017 7:23 PM
> > To: Lior Narkis <liorn@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v2
> > 
> > On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > > > -----Original Message-----
> > > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> > open.org]
> > > > On Behalf Of Michael S. Tsirkin
> > > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > > To: virtio-dev@lists.oasis-open.org
> > > > Cc: virtualization@lists.linux-foundation.org
> > > > Subject: [virtio-dev] packed ring layout proposal v2
> > > >
> > > > This is an update from v1 version.
> > > > Changes:
> > > > - update event suppression mechanism
> > > > - separate options for indirect and direct s/g
> > > > - lots of new features
> > > >
> > > > ---
> > > >
> > > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > > The idea is to have a r/w descriptor in a ring structure,
> > > > replacing the used and available ring, index and descriptor
> > > > buffer.
> > > >
> > > > * Descriptor ring:
> > > >
> > > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > > clear.  Flags are always set/cleared last.
> > > >
> > > > #define DESC_HW 0x0080
> > > >
> > > > struct desc {
> > > >         __le64 addr;
> > > >         __le32 len;
> > > >         __le16 index;
> > > >         __le16 flags;
> > > > };
> > > >
> > > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > > it belongs to the driver.
> > > >
> > > > We can use 1 bit to set direction
> > > > /* This marks a buffer as write-only (otherwise read-only). */
> > > > #define VRING_DESC_F_WRITE      2
> > > >
> > >
> > > A valid bit per descriptor does not let the device do an efficient prefetch.
> > > An alternative is to use a producer index(PI).
> > > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> > by the device, the device knows how much work it has outstanding, so it can
> > do the prefetch accordingly.
> > > There are few options for the device to acquire the PI.
> > > Most efficient will be to write the PI in the doorbell together with the queue
> > number.
> > 
> > Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> > Or just the PI if we don't need the queue number.
> > 
> > > I would like to raise the need for a Completion Queue(CQ).
> > > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> > connected to a single CQ.
> > > So when the device completes the work on the descriptor, it writes a
> > Completion Queue Entry (CQE) to the CQ.
> > 
> > This seems similar to the design we currently have with a separate used
> > ring. It seems to underperform writing into the available ring
> > at least on a microbenchmark. The reason seems to be that
> > more cache lines need to get invalidated and re-fetched.
> 
> 
> Few points on that:
> Each PCIe write will cause invalidation to a cache line.
> Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
> Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.
> 

I'm not sure how this is related to your suggestion. Sorry about being
dense.  You suggested a separate used ring that can also be shared with
multiple available rings. Current design in effect makes used and
available rings overlap instead. One side effect is each entry is bigger
now (16 bytes as opposed to 8 bytes previously) so it should be easier
to move metadata from e.g. virtio net header to the descriptor entry.


> > 
> > > CQEs are continuous in memory so prefetching by the driver is efficient,
> > although the device might complete work descriptors out of order.
> > 
> > That's not different from proposal you are replying to - descriptors
> > can be used and written out in any order, they are contigious
> > so driver can prefetch. 
> 
> Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.

Note: I think you mean used, not completed. Let's use the standard virtio terminology.
The answer is - not necessarily. device can write them out at entries 1,2,3. The only
requirement is really that eventually all entries 1-4 are switches to
driver ownership.

v2 says:
	Device can write descriptors out in order as they are used, overwriting
	descriptors that are there.

I think it would be clearer if it said:

	Device can write descriptors out in the order in which they are used, overwriting
	descriptors that are there.


> With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
> This is why it is better for prefetching.

Looks like a misunderstanding then.

> > I'd like to add that attempts to
> > add prefetch e.g. for the virtio used ring never showed any
> > conclusive gains - some workloads would get minor a speedup, others
> > lose a bit of performance. I do not think anyone looked
> > deeply into why that was the case. It would be easy for you to add this
> > code to current virtio drivers and/or devices and try it out.
> 
> Noted.
> I will say though that mlx5 uses prefetch and gets good performance because of it...

It is pretty popular with drivers, worth revisiting if someone has the
time.


> > 
> > > The interrupt handler is connected to the CQ, so an allocation of a single CQ
> > per core, with a single interrupt handler is possible although this core might be
> > using multiple WQs.
> > 
> > Sending a single interrupt from multiple rings might save some
> > cycles. event index/interrupt disable are currently in
> > RAM so access is very cheap for the guest.
> > If you are going to share, just disable all interrupts
> > when you start processing.
> > 
> > I was wondering how do people want to do this in hardware
> > in fact. Are you going to read event index after each descriptor?
> 
> Not sure I got you here.
> Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?

virtio has a flags/event index based interrupt suppression. It relies on
device reading flags/index after writing out a batch of descriptors.
Is this too costly and we should switch to driver writing the index,
or ok since it's only once per batch?

> > 
> > It might make sense to move event index/flags into device memory. And
> > once you do this, writing these out might become slower, and then some
> > kind of sharing of the event index might help.
> > 
> > No one suggested anything like this so far - maybe it's less of an issue
> > than I think, e.g. because of interrupt coalescing (which virtio also
> > does not have, but could be added if there is interest) or maybe people
> > just mostly care about polling performance.
> > 
> > > One application for multiple WQs with a single CQ is Quality of Service(QoS).
> > > A user can open a WQ per QoS value(pcp value for example), and the device
> > will schedule the work accordingly.
> > 
> > A ring per QOS level might make sense e.g. in a network device. I don't
> > see why it's helpful to write out completed entries into a separate
> > ring for that though.
> 
> I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.
> 
> Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ, 
> So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.
> 
> > 
> > As we don't have QOS support in virtio net at all right now,
> > would that be best discussed once we have a working prototype
> > and everyone can see what the problem is?
> 
> Understood.
> Although, I think the layout should not change frequently.
> 
> > 
> > 
> > > > * Scatter/gather support
> > > >
> > > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > > >
> > > > /* This marks a buffer as continuing via the next field. */
> > > > #define VRING_DESC_F_NEXT       1
> > > >
> > > > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > > >
> > > > Also unlike virtio 1.0, use of this flag will be an optional feature
> > > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > > >
> > > > * Indirect buffers
> > > >
> > > > Can be marked like in virtio 1.0:
> > > >
> > > > /* This means the buffer contains a table of buffer descriptors. */
> > > > #define VRING_DESC_F_INDIRECT   4
> > > >
> > > > Unlike virtio 1.0, this is a table, not a list:
> > > > struct indirect_descriptor_table {
> > > >         /* The actual descriptors (16 bytes each) */
> > > >         struct virtq_desc desc[len / 16];
> > > > };
> > > >
> > > > The first descriptor is located at start of the indirect descriptor
> > > > table, additional indirect descriptors come immediately afterwards.
> > > > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > > > table. Others should be set to 0 and are ignored.  id is also set to 0
> > > > and should be ignored.
> > > >
> > > > virtio 1.0 seems to allow a s/g entry followed by
> > > > an indirect descriptor. This does not seem useful,
> > > > so we do not allow that anymore.
> > > >
> > > > This support would be an optional feature, same as in virtio 1.0
> > > >
> > > > * Batching descriptors:
> > > >
> > > > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > > > incrementing the used/avail index by values > 1.  We can support this by
> > > > chaining a list of descriptors through a bit the flags field.
> > > > To allow use together with s/g, a different bit will be used.
> > > >
> > > > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > > >
> > > > Batching works for both driver and device descriptors.
> > > >
> > > >
> > > >
> > > > * Processing descriptors in and out of order
> > > >
> > > > Device processing all descriptors in order can simply flip
> > > > the DESC_HW bit as it is done with descriptors.
> > > >
> > > > Device can write descriptors out in order as they are used, overwriting
> > > > descriptors that are there.
> > > >
> > > > Device must not use a descriptor until DESC_HW is set.
> > > > It is only required to look at the first descriptor
> > > > submitted.
> > > >
> > > > Driver must not overwrite a descriptor until DESC_HW is clear.
> > > > It is only required to look at the first descriptor
> > > > submitted.
> > > >
> > > > * Device specific descriptor flags
> > > > We have a lot of unused space in the descriptor.  This can be put to
> > > > good use by reserving some flag bits for device use.
> > > > For example, network device can set a bit to request
> > > > that header in the descriptor is suppressed
> > > > (in case it's all 0s anyway). This reduces cache utilization.
> > > >
> > > > Note: this feature can be supported in virtio 1.0 as well,
> > > > as we have unused bits in both descriptor and used ring there.
> > > >
> > > > * Descriptor length in device descriptors
> > > >
> > > > virtio 1.0 places strict requirements on descriptor length. For example
> > > > it must be 0 in used ring of TX VQ of a network device since nothing is
> > > > written.  In practice guests do not seem to use this, so we can simplify
> > > > devices a bit by removing this requirement - if length is unused it
> > > > should be ignored by driver.
> > > >
> > > > Some devices use identically-sized buffers in all descriptors.
> > > > Ignoring length for driver descriptors there could be an option too.
> > > >
> > > > * Writing at an offset
> > > >
> > > > Some devices might want to write into some descriptors
> > > > at an offset, the offset would be in config space,
> > > > and a descriptor flag could indicate this:
> > > >
> > > > #define VRING_DESC_F_OFFSET 0x0020
> > > >
> > > > How exactly to use the offset would be device specific,
> > > > for example it can be used to align beginning of packet
> > > > in the 1st buffer for mergeable buffers to cache line boundary
> > > > while also aligning rest of buffers.
> > > >
> > > > * Non power-of-2 ring sizes
> > > >
> > > > As the ring simply wraps around, there's no reason to
> > > > require ring size to be power of two.
> > > > It can be made a separate feature though.
> > > >
> > > >
> > > > * Interrupt/event suppression
> > > >
> > > > virtio 1.0 has two mechanisms for suppression but only
> > > > one can be used at a time. we pack them together
> > > > in a structure - one for interrupts, one for notifications:
> > > >
> > > > struct event {
> > > > 	__le16 idx;
> > > > 	__le16 flags;
> > > > }
> > > >
> > > > Both fields would be optional, with a feature bit:
> > > > VIRTIO_F_EVENT_IDX
> > > > VIRTIO_F_EVENT_FLAGS
> > > >
> > > > * Flags can be used like in virtio 1.0, by storing a special
> > > > value there:
> > > >
> > > > #define VRING_F_EVENT_ENABLE  0x0
> > > >
> > > > #define VRING_F_EVENT_DISABLE 0x1
> > > >
> > > > * Event index would be in the range 0 to 2 * Queue Size
> > > > (to detect wrap arounds) and wrap to 0 after that.
> > > >
> > > > The assumption is that each side maintains an internal
> > > > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > > > In that case, interrupt triggers when counter reaches
> > > > the given value.
> > > >
> > > > * If both features are negotiated, a special flags value
> > > > can be used to switch to event idx:
> > > >
> > > > #define VRING_F_EVENT_IDX     0x2
> > > >
> > > >
> > > > * Prototype
> > > >
> > > > A partial prototype can be found under
> > > > tools/virtio/ringtest/ring.c
> > > >
> > > > Test run:
> > > > [mst@tuck ringtest]$ time ./ring
> > > > real    0m0.399s
> > > > user    0m0.791s
> > > > sys     0m0.000s
> > > > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > > > real    0m0.503s
> > > > user    0m0.999s
> > > > sys     0m0.000s
> > > >
> > > > It is planned to update it to this interface. Future changes and
> > > > enhancements can (and should) be tested against this prototype.
> > > >
> > > > * Feature sets
> > > > In particular are we going overboard with feature bits?  It becomes hard
> > > > to support all combinations in drivers and devices.  Maybe we should
> > > > document reasonable feature sets to be supported for each device.
> > > >
> > > > * Known issues/ideas
> > > >
> > > > This layout is optimized for host/guest communication,
> > > > in a sense even more aggressively than virtio 1.0.
> > > > It might be suboptimal for PCI hardware implementations.
> > > > However, one notes that current virtio pci drivers are
> > > > unlikely to work with PCI hardware implementations anyway
> > > > (e.g. due to use of SMP barriers for ordering).
> > > >
> > > > Suggestions for improving this are welcome but need to be tested to make
> > > > sure our main use case does not regress.  Of course some improvements
> > > > might be made optional, but if we add too many of these driver becomes
> > > > unmanageable.
> > > >
> > > > ---
> > > >
> > > > Note: should this proposal be accepted and approved, one or more
> > > >       claims disclosed to the TC admin and listed on the Virtio TC
> > > >       IPR page
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > > > w.oasis-
> > > >
> > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > > >
> > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > > >
> > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > > > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> > > >       might become Essential Claims.
> > > >
> > > > --
> > > > MST
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v2
@ 2017-07-20 13:06         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-07-20 13:06 UTC (permalink / raw)
  To: Lior Narkis; +Cc: virtio-dev, virtualization

On Wed, Jul 19, 2017 at 07:41:55AM +0000, Lior Narkis wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Tuesday, July 18, 2017 7:23 PM
> > To: Lior Narkis <liorn@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v2
> > 
> > On Sun, Jul 16, 2017 at 06:00:45AM +0000, Lior Narkis wrote:
> > > > -----Original Message-----
> > > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> > open.org]
> > > > On Behalf Of Michael S. Tsirkin
> > > > Sent: Wednesday, February 08, 2017 5:20 AM
> > > > To: virtio-dev@lists.oasis-open.org
> > > > Cc: virtualization@lists.linux-foundation.org
> > > > Subject: [virtio-dev] packed ring layout proposal v2
> > > >
> > > > This is an update from v1 version.
> > > > Changes:
> > > > - update event suppression mechanism
> > > > - separate options for indirect and direct s/g
> > > > - lots of new features
> > > >
> > > > ---
> > > >
> > > > Performance analysis of this is in my kvm forum 2016 presentation.
> > > > The idea is to have a r/w descriptor in a ring structure,
> > > > replacing the used and available ring, index and descriptor
> > > > buffer.
> > > >
> > > > * Descriptor ring:
> > > >
> > > > Guest adds descriptors with unique index values and DESC_HW set in flags.
> > > > Host overwrites used descriptors with correct len, index, and DESC_HW
> > > > clear.  Flags are always set/cleared last.
> > > >
> > > > #define DESC_HW 0x0080
> > > >
> > > > struct desc {
> > > >         __le64 addr;
> > > >         __le32 len;
> > > >         __le16 index;
> > > >         __le16 flags;
> > > > };
> > > >
> > > > When DESC_HW is set, descriptor belongs to device. When it is clear,
> > > > it belongs to the driver.
> > > >
> > > > We can use 1 bit to set direction
> > > > /* This marks a buffer as write-only (otherwise read-only). */
> > > > #define VRING_DESC_F_WRITE      2
> > > >
> > >
> > > A valid bit per descriptor does not let the device do an efficient prefetch.
> > > An alternative is to use a producer index(PI).
> > > Using the PI posted by the driver, and the Consumer Index(CI) maintained
> > by the device, the device knows how much work it has outstanding, so it can
> > do the prefetch accordingly.
> > > There are few options for the device to acquire the PI.
> > > Most efficient will be to write the PI in the doorbell together with the queue
> > number.
> > 
> > Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
> > Or just the PI if we don't need the queue number.
> > 
> > > I would like to raise the need for a Completion Queue(CQ).
> > > Multiple Work Queues(hold the work descriptors, WQ in short) can be
> > connected to a single CQ.
> > > So when the device completes the work on the descriptor, it writes a
> > Completion Queue Entry (CQE) to the CQ.
> > 
> > This seems similar to the design we currently have with a separate used
> > ring. It seems to underperform writing into the available ring
> > at least on a microbenchmark. The reason seems to be that
> > more cache lines need to get invalidated and re-fetched.
> 
> 
> Few points on that:
> Each PCIe write will cause invalidation to a cache line.
> Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
> Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.
> 

I'm not sure how this is related to your suggestion. Sorry about being
dense.  You suggested a separate used ring that can also be shared with
multiple available rings. Current design in effect makes used and
available rings overlap instead. One side effect is each entry is bigger
now (16 bytes as opposed to 8 bytes previously) so it should be easier
to move metadata from e.g. virtio net header to the descriptor entry.


> > 
> > > CQEs are continuous in memory so prefetching by the driver is efficient,
> > although the device might complete work descriptors out of order.
> > 
> > That's not different from proposal you are replying to - descriptors
> > can be used and written out in any order, they are contigious
> > so driver can prefetch. 
> 
> Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.

Note: I think you mean used, not completed. Let's use the standard virtio terminology.
The answer is - not necessarily. device can write them out at entries 1,2,3. The only
requirement is really that eventually all entries 1-4 are switches to
driver ownership.

v2 says:
	Device can write descriptors out in order as they are used, overwriting
	descriptors that are there.

I think it would be clearer if it said:

	Device can write descriptors out in the order in which they are used, overwriting
	descriptors that are there.


> With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
> This is why it is better for prefetching.

Looks like a misunderstanding then.

> > I'd like to add that attempts to
> > add prefetch e.g. for the virtio used ring never showed any
> > conclusive gains - some workloads would get minor a speedup, others
> > lose a bit of performance. I do not think anyone looked
> > deeply into why that was the case. It would be easy for you to add this
> > code to current virtio drivers and/or devices and try it out.
> 
> Noted.
> I will say though that mlx5 uses prefetch and gets good performance because of it...

It is pretty popular with drivers, worth revisiting if someone has the
time.


> > 
> > > The interrupt handler is connected to the CQ, so an allocation of a single CQ
> > per core, with a single interrupt handler is possible although this core might be
> > using multiple WQs.
> > 
> > Sending a single interrupt from multiple rings might save some
> > cycles. event index/interrupt disable are currently in
> > RAM so access is very cheap for the guest.
> > If you are going to share, just disable all interrupts
> > when you start processing.
> > 
> > I was wondering how do people want to do this in hardware
> > in fact. Are you going to read event index after each descriptor?
> 
> Not sure I got you here.
> Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?

virtio has a flags/event index based interrupt suppression. It relies on
device reading flags/index after writing out a batch of descriptors.
Is this too costly and we should switch to driver writing the index,
or ok since it's only once per batch?

> > 
> > It might make sense to move event index/flags into device memory. And
> > once you do this, writing these out might become slower, and then some
> > kind of sharing of the event index might help.
> > 
> > No one suggested anything like this so far - maybe it's less of an issue
> > than I think, e.g. because of interrupt coalescing (which virtio also
> > does not have, but could be added if there is interest) or maybe people
> > just mostly care about polling performance.
> > 
> > > One application for multiple WQs with a single CQ is Quality of Service(QoS).
> > > A user can open a WQ per QoS value(pcp value for example), and the device
> > will schedule the work accordingly.
> > 
> > A ring per QOS level might make sense e.g. in a network device. I don't
> > see why it's helpful to write out completed entries into a separate
> > ring for that though.
> 
> I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.
> 
> Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ, 
> So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.
> 
> > 
> > As we don't have QOS support in virtio net at all right now,
> > would that be best discussed once we have a working prototype
> > and everyone can see what the problem is?
> 
> Understood.
> Although, I think the layout should not change frequently.
> 
> > 
> > 
> > > > * Scatter/gather support
> > > >
> > > > We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:
> > > >
> > > > /* This marks a buffer as continuing via the next field. */
> > > > #define VRING_DESC_F_NEXT       1
> > > >
> > > > Unlike virtio 1.0, all descriptors must have distinct ID values.
> > > >
> > > > Also unlike virtio 1.0, use of this flag will be an optional feature
> > > > (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
> > > >
> > > > * Indirect buffers
> > > >
> > > > Can be marked like in virtio 1.0:
> > > >
> > > > /* This means the buffer contains a table of buffer descriptors. */
> > > > #define VRING_DESC_F_INDIRECT   4
> > > >
> > > > Unlike virtio 1.0, this is a table, not a list:
> > > > struct indirect_descriptor_table {
> > > >         /* The actual descriptors (16 bytes each) */
> > > >         struct virtq_desc desc[len / 16];
> > > > };
> > > >
> > > > The first descriptor is located at start of the indirect descriptor
> > > > table, additional indirect descriptors come immediately afterwards.
> > > > DESC_F_WRITE is the only valid flag for descriptors in the indirect
> > > > table. Others should be set to 0 and are ignored.  id is also set to 0
> > > > and should be ignored.
> > > >
> > > > virtio 1.0 seems to allow a s/g entry followed by
> > > > an indirect descriptor. This does not seem useful,
> > > > so we do not allow that anymore.
> > > >
> > > > This support would be an optional feature, same as in virtio 1.0
> > > >
> > > > * Batching descriptors:
> > > >
> > > > virtio 1.0 allows passing a batch of descriptors in both directions, by
> > > > incrementing the used/avail index by values > 1.  We can support this by
> > > > chaining a list of descriptors through a bit the flags field.
> > > > To allow use together with s/g, a different bit will be used.
> > > >
> > > > #define VRING_DESC_F_BATCH_NEXT 0x0010
> > > >
> > > > Batching works for both driver and device descriptors.
> > > >
> > > >
> > > >
> > > > * Processing descriptors in and out of order
> > > >
> > > > Device processing all descriptors in order can simply flip
> > > > the DESC_HW bit as it is done with descriptors.
> > > >
> > > > Device can write descriptors out in order as they are used, overwriting
> > > > descriptors that are there.
> > > >
> > > > Device must not use a descriptor until DESC_HW is set.
> > > > It is only required to look at the first descriptor
> > > > submitted.
> > > >
> > > > Driver must not overwrite a descriptor until DESC_HW is clear.
> > > > It is only required to look at the first descriptor
> > > > submitted.
> > > >
> > > > * Device specific descriptor flags
> > > > We have a lot of unused space in the descriptor.  This can be put to
> > > > good use by reserving some flag bits for device use.
> > > > For example, network device can set a bit to request
> > > > that header in the descriptor is suppressed
> > > > (in case it's all 0s anyway). This reduces cache utilization.
> > > >
> > > > Note: this feature can be supported in virtio 1.0 as well,
> > > > as we have unused bits in both descriptor and used ring there.
> > > >
> > > > * Descriptor length in device descriptors
> > > >
> > > > virtio 1.0 places strict requirements on descriptor length. For example
> > > > it must be 0 in used ring of TX VQ of a network device since nothing is
> > > > written.  In practice guests do not seem to use this, so we can simplify
> > > > devices a bit by removing this requirement - if length is unused it
> > > > should be ignored by driver.
> > > >
> > > > Some devices use identically-sized buffers in all descriptors.
> > > > Ignoring length for driver descriptors there could be an option too.
> > > >
> > > > * Writing at an offset
> > > >
> > > > Some devices might want to write into some descriptors
> > > > at an offset, the offset would be in config space,
> > > > and a descriptor flag could indicate this:
> > > >
> > > > #define VRING_DESC_F_OFFSET 0x0020
> > > >
> > > > How exactly to use the offset would be device specific,
> > > > for example it can be used to align beginning of packet
> > > > in the 1st buffer for mergeable buffers to cache line boundary
> > > > while also aligning rest of buffers.
> > > >
> > > > * Non power-of-2 ring sizes
> > > >
> > > > As the ring simply wraps around, there's no reason to
> > > > require ring size to be power of two.
> > > > It can be made a separate feature though.
> > > >
> > > >
> > > > * Interrupt/event suppression
> > > >
> > > > virtio 1.0 has two mechanisms for suppression but only
> > > > one can be used at a time. we pack them together
> > > > in a structure - one for interrupts, one for notifications:
> > > >
> > > > struct event {
> > > > 	__le16 idx;
> > > > 	__le16 flags;
> > > > }
> > > >
> > > > Both fields would be optional, with a feature bit:
> > > > VIRTIO_F_EVENT_IDX
> > > > VIRTIO_F_EVENT_FLAGS
> > > >
> > > > * Flags can be used like in virtio 1.0, by storing a special
> > > > value there:
> > > >
> > > > #define VRING_F_EVENT_ENABLE  0x0
> > > >
> > > > #define VRING_F_EVENT_DISABLE 0x1
> > > >
> > > > * Event index would be in the range 0 to 2 * Queue Size
> > > > (to detect wrap arounds) and wrap to 0 after that.
> > > >
> > > > The assumption is that each side maintains an internal
> > > > descriptor counter 0 to 2 * Queue Size that wraps to 0.
> > > > In that case, interrupt triggers when counter reaches
> > > > the given value.
> > > >
> > > > * If both features are negotiated, a special flags value
> > > > can be used to switch to event idx:
> > > >
> > > > #define VRING_F_EVENT_IDX     0x2
> > > >
> > > >
> > > > * Prototype
> > > >
> > > > A partial prototype can be found under
> > > > tools/virtio/ringtest/ring.c
> > > >
> > > > Test run:
> > > > [mst@tuck ringtest]$ time ./ring
> > > > real    0m0.399s
> > > > user    0m0.791s
> > > > sys     0m0.000s
> > > > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > > > real    0m0.503s
> > > > user    0m0.999s
> > > > sys     0m0.000s
> > > >
> > > > It is planned to update it to this interface. Future changes and
> > > > enhancements can (and should) be tested against this prototype.
> > > >
> > > > * Feature sets
> > > > In particular are we going overboard with feature bits?  It becomes hard
> > > > to support all combinations in drivers and devices.  Maybe we should
> > > > document reasonable feature sets to be supported for each device.
> > > >
> > > > * Known issues/ideas
> > > >
> > > > This layout is optimized for host/guest communication,
> > > > in a sense even more aggressively than virtio 1.0.
> > > > It might be suboptimal for PCI hardware implementations.
> > > > However, one notes that current virtio pci drivers are
> > > > unlikely to work with PCI hardware implementations anyway
> > > > (e.g. due to use of SMP barriers for ordering).
> > > >
> > > > Suggestions for improving this are welcome but need to be tested to make
> > > > sure our main use case does not regress.  Of course some improvements
> > > > might be made optional, but if we add too many of these driver becomes
> > > > unmanageable.
> > > >
> > > > ---
> > > >
> > > > Note: should this proposal be accepted and approved, one or more
> > > >       claims disclosed to the TC admin and listed on the Virtio TC
> > > >       IPR page
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
> > > > w.oasis-
> > > >
> > open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
> > > >
> > ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
> > > >
> > ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
> > > > k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
> > > >       might become Essential Claims.
> > > >
> > > > --
> > > > MST
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] packed ring layout proposal v3
@ 2017-09-10  5:06 Michael S. Tsirkin
  2017-02-08 13:37 ` packed ring layout proposal v2 Christian Borntraeger
                   ` (19 more replies)
  0 siblings, 20 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-10  5:06 UTC (permalink / raw)
  To: virtio-dev; +Cc: virtualization

This is an update from v2 version.
Changes:
- update event suppression mechanism
- add wrap counter: DESC_WRAP flag in addition to
  DESC_DRIVER flag used for validity so device does not have to
  write out all used descriptors.
- more options especially helpful for hardware implementations
- in-order processing option due to popular demand
- list of TODO items to consider as a follow-up, only two are
  open questions we need to descide now, marked as blocker,
  others are future enhancement ideas.

---

Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.

Note: the following mode of operation will actually work
without changes when descriptor rings do not overlap, with driver
writing out available entries in a read-only driver descriptor ring,
device writing used entries in a write-only device descriptor ring.

TODO: does this have any value for some? E.g. as a security feature?


* Descriptor ring:

Driver writes descriptors with unique index values and DESC_DRIVER set in flags.
Descriptors are written in a ring order: from start to end of ring,
wrapping around to the beginning.
Device writes used descriptors with correct len, index, and DESC_HW clear.
Again descriptors are written in ring order. This might not be the same
order of driver descriptors, and not all descriptors have to be written out.

Driver and device are expected to maintain (internally) a wrap-around
bit, starting at 0 and changing value each time they start writing out
descriptors at the beginning of the ring. This bit is passed as
DESC_WRAP bit in the flags field.

Flags are always set/cleared last.

Note that driver can write descriptors out in any order, but device
will not execute descriptor X+1 until descriptor X has been
read as valid.

Driver operation:

Driver makes descriptors available to device by writing out descriptors
in the descriptor ring. Once ring is full, driver waits for device to
use some descriptors before making more available.

Descriptors can be used by device in any order, but must be read from
ring in-order, and must be read completely before starting use.  Thus,
once a descriptor is used, driver can over-write both this descriptor
and any descriptors which preceded it in the ring.

Driver can detect use of descriptor either by device specific means
(e.g. detect a buffer data change by device) or in a generic way
by detecting that a used buffer has been written out by device.


Driver writes out available scatter/gather descriptor entries in guest
descriptor format:


#define DESC_WRAP 0x0040
#define DESC_DRIVER 0x0080

struct desc {
        __le64 addr;
        __le32 len;
        __le16 index;
        __le16 flags;
};

Fields:

addr - physical address of a s/g entry
len - length of an entry
index - unique index.  The low $\lceil log(N) \rceil - 1$
      bits of the index is a driver-supplied value which can have any value
      (under driver control).  The high bits are reserved and should be
      set to 0.

flags - descriptor flags.

Descriptors written by driver have DESC_DRIVER set.

Writing out this field makes the descriptor available for the device to use,
so all other fields must be valid when it is written.

DESC_WRAP - device uses this field to detect descriptor change by driver.

Driver can use 1 bit to set direction
/* This marks a descriptor as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE      2


Device operation (using descriptors):

Device is looking for descriptors in ring order. After detecting that
the flags value has changed with DESC_DRIVER set and DESC_WRAP matching
the wrap-around counter, it can start using the descriptors.
Descriptors can be used in any order, but must be read from ring
in-order.  In other words device must not read descriptor X after it
started using descriptor X+1.  Further, all buffer descriptors must be
read completely before device starts using the buffer.

This because once a descriptor is used, driver can over-write both this
descriptor and any preceeding descriptors in ring.

To help driver detect use of descriptors and to pass extra meta-data
to driver, device writes out used entries in device descriptor format:


#define DESC_WRAP 0x0040
#define DESC_DRIVER 0x0080

struct desc {
        __le64 reserved;
        __le32 len;
        __le16 index;
        __le16 flags;
};

Fields:

reserved - can be any value, ignored by driver
len - length written by device. only valid if VRING_DESC_F_WRITE is set
      len bytes of data from beginning of buffer are assumed to have been updated
index - unique index copied from the driver descriptor that has been used.
flags - descriptor flags.

Descriptors written by device have DESC_DRIVER clear.

Writing out this field notifies the driver that it can re-use the
descriptor id. It is also a signal that driver can over-write the
relevant descriptor (with the supplied id), and any other

DESC_WRAP - driver uses this field to detect descriptor change by device.

If device has used a buffer containing a write descriptor, it sets this bit:
#define VRING_DESC_F_WRITE      2

* Driver scatter/gather support

Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0:

/* This marks a buffer as continuing in the next ring entry. */
#define VRING_DESC_F_NEXT       1

When driver descriptors are chained in this way, multiple
descriptors are treated as a part of a single transaction
containing an optional write buffer followed by an optional read buffer.
All descriptors in the chain must have the same ID.

Unlike virtio 1.0, use of this flag will be an optional feature
so both devices and drivers can opt out of it.
If they do, they can either negotiate indirect descriptors or use
single-descriptor entries exclusively for buffers.

Device might detect a partial descriptor chain (VRING_DESC_F_NEXT
set but next descriptor not valid). In that case it must not
use any parts of the chain - it will later be completed by driver,
but device is allowed to store the valid parts of the chain as
driver is not allowed to change them anymore.

Two options are available:

Device can write out the same number of descriptors for the chain,
setting VRING_DESC_F_NEXT for all but the last descriptor.
Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set.

Device only writes out a single descriptor for the whole chain.
However, to keep device and driver in sync, it then skips a number of
descriptors corresponding to the length of the chain before writing out
the next used descriptor.
After detecting a used descriptor driver must find out the length of the
chain that it built in order to know where to look for the next
device descriptor.

* Indirect buffers

Indirect buffer descriptors is an optional feature.
These are always written by driver, not the device.
Indirect buffers have a special flag bit set - like in virtio 1.0:

/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT   4

VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear.

len specifies the length of the indirect descriptor buffer in bytes
and must be a multiple of 16.

Unlike virtio 1.0, the buffer pointed to is a table, not a list:
struct indirect_descriptor_table {
        /* The actual descriptors (16 bytes each) */
        struct indirect_desc desc[len / 16];
};

The first descriptor is located at start of the indirect descriptor
table, additional indirect descriptors come immediately afterwards.

struct indirect_desc {
        __le64 addr;
        __le32 len;
        __le16 reserved;
        __le16 flags;
};


DESC_F_WRITE is the only valid flag for descriptors in the indirect
table. Others should be set to 0 and are ignored.  reserved field is
also set to 0 and should be ignored.

TODO (blocker): virtio 1.0 allows a s/g entry followed by
      an indirect descriptor. Is this useful?

This support would be an optional feature, same as in virtio 1.0

* Batching descriptors:

virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1.
At the moment only batching of used descriptors is used.

We can support this by chaining a list of device descriptors through
VRING_DESC_F_MORE flag. Device sets this bit to signal
driver that this is part of a batch of used descriptors
which are all part of a single transaction.

Driver might detect a partial descriptor chain (VRING_DESC_F_MORE
set but next descriptor not valid). In that case it must not
use any parts of the chain - it will later be completed by device,
but driver is allowed to store the valid parts of the chain as
device is not allowed to change them anymore.

Descriptor should not have both VRING_DESC_F_MORE and
VRING_DESC_F_NEXT set.

* Using descriptors in order

Some devices can guarantee that descriptors are used in
the order in which they were made available.
This allows driver optimizations and can be negotiated through
a feature bit.

* Per ring flags

It is useful to support features for some rings but not others.
E.g. it's reasonable to use single buffers for RX rings but
sg or indirect for TX rings of the network device.
Generic configuration space will be extended so features can
be negotiated per ring.

* Selective use of descriptors

As described above, descriptors with NEXT bit set are part of a
scatter/gather chain and so do not have to cause device to write a used
descriptor out.

Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to
signal to device that it does not have to write out the used descriptor
as it is part of a batch of descriptors. Device has two options (similar
to VRING_DESC_F_NEXT):

Device can write out the same number of descriptors for the batch,
setting VRING_DESC_F_MORE for all but the last descriptor.
Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.

Device only writes out a single descriptor for the whole batch.
However, to keep device and driver in sync, it then skips a number of
descriptors corresponding to the length of the batch before writing out
the next used descriptor.
After detecting a used descriptor driver must find out the length of the
batch that it built in order to know where to look for the next
device descriptor.


TODO (blocker): skipping descriptors for selective and
scatter/gather seem to be only requested with in-order right now. Let's
require in-order for this skipping?  This will simplify the accounting
by driver.


* Interrupt/event suppression

virtio 1.0 has two mechanisms for suppression but only
one can be used at a time. we pack them together
in a structure - one for interrupts, one for notifications:

struct event {
	__le16 idx;
	__le16 flags;
}

Both fields would be optional, with a feature bit:
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS

Flags can be used like in virtio 1.0, by storing a special
value there:

#define VRING_F_EVENT_ENABLE  0x0

#define VRING_F_EVENT_DISABLE 0x1

Event index includes the index of the descriptor
which should trigger the event, and the wrap counter
in the high bit.

In that case, interrupt triggers when descriptor is written at a given
location in the ring (or skipped in case of NEXT/MORE).

If both features are negotiated, a special flags value
can be used to switch to event idx:

#define VRING_F_EVENT_IDX     0x2

* Available notification

Driver currently writes out the queue number to device to
kick off ring processing.

As queue number is currently 16 bit, we can extend that
to additionally include the offset within ring of the descriptor
which triggered the kick event in bits 16 to 30,
and the wrap counter in the high bit (31).

Device is allowed to pre-fetch descriptors beyond the specified
offset but is not required to do so.



* TODO: interrupt coalescing

Does it make sense just for networking or for all devices?
If later should we make it a per ring or a global feature?


* TODO: event index/flags in device memory?

Should we move the event index/flags to device memory?
Might be helpful for hardware configuration so they do not
need to do DMA reads to check whether interrupt is needed.
OTOH maybe interrupt coalescing is sufficient for this.


* TODO: Device specific descriptor flags

We have a lot of unused space in the descriptor.  This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.

Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.

* TODO: Descriptor length in device descriptors

Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.

* TODO: Writing at an offset

Some devices might want to write into some descriptors
at an offset, the offset would be in reserved field in the descriptor,
possibly a descriptor flag could indicate this:

#define VRING_DESC_F_OFFSET 0x0020

How exactly to use the offset would be device specific,
for example it can be used to align beginning of packet
in the 1st buffer for mergeable buffers to cache line boundary
while also aligning rest of buffers.

* TODO: Non power-of-2 ring sizes

As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.


TODO: limits on buffer alignment/size

Seems to be useful for RX for networking.
Is there a need to negotiate above in a generic way
or is this a networking specific optimization?

TODO: expose wrap counters to device for debugging

TODO: expose last avail/used offsets to device/driver for debugging

TODO: ability to reset individual rings

---

Note: should this proposal be accepted and approved, one or more
      claims disclosed to the TC admin and listed on the Virtio TC
      IPR page https://www.oasis-open.org/committees/virtio/ipr.php
      might become Essential Claims.
Note: the page above is unfortunately out of date and out of
      my hands. I'm in the process of updating ipr disclosures
      in github instead.  Will make sure all is in place before
      this proposal is put to vote. As usual this TC operates under the
      Non-Assertion Mode of the OASIS IPR Policy, which protects
      anyone implementing the virtio spec.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2017-09-11  7:47 ` [virtio-dev] Re: packed ring layout proposal v3 Jason Wang
@ 2017-09-11  7:47 ` Jason Wang
  2017-09-12 16:20 ` [virtio-dev] " Willem de Bruijn
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Jason Wang @ 2017-09-11  7:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization



On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
> This is an update from v2 version.
> Changes:
> - update event suppression mechanism
> - add wrap counter: DESC_WRAP flag in addition to
>    DESC_DRIVER flag used for validity so device does not have to
>    write out all used descriptors.

Do we have benchmark result to show the advantage of DESC_DRIVER over 
e.g avail/used index?

> - more options especially helpful for hardware implementations
> - in-order processing option due to popular demand
> - list of TODO items to consider as a follow-up, only two are
>    open questions we need to descide now, marked as blocker,
>    others are future enhancement ideas.
>
> ---
>
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
>
> Note: the following mode of operation will actually work
> without changes when descriptor rings do not overlap, with driver
> writing out available entries in a read-only driver descriptor ring,
> device writing used entries in a write-only device descriptor ring.
>
> TODO: does this have any value for some? E.g. as a security feature?
>
>
> * Descriptor ring:
>
> Driver writes descriptors with unique index values and DESC_DRIVER set in flags.

You probably mean DESC_HW here?

> Descriptors are written in a ring order: from start to end of ring,
> wrapping around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same
> order of driver descriptors, and not all descriptors have to be written out.
>
> Driver and device are expected to maintain (internally) a wrap-around
> bit, starting at 0 and changing value each time they start writing out
> descriptors at the beginning of the ring. This bit is passed as
> DESC_WRAP bit in the flags field.

I'm not sure this is really needed, I think it could be done through 
checking vq.num?

>
> Flags are always set/cleared last.
>
> Note that driver can write descriptors out in any order, but device
> will not execute descriptor X+1 until descriptor X has been
> read as valid.
>
> Driver operation:
>
> Driver makes descriptors available to device by writing out descriptors
> in the descriptor ring. Once ring is full, driver waits for device to
> use some descriptors before making more available.
>
> Descriptors can be used by device in any order, but must be read from
> ring in-order, and must be read completely before starting use.  Thus,
> once a descriptor is used, driver can over-write both this descriptor
> and any descriptors which preceded it in the ring.
>
> Driver can detect use of descriptor either by device specific means
> (e.g. detect a buffer data change by device) or in a generic way
> by detecting that a used buffer has been written out by device.
>
>
> Driver writes out available scatter/gather descriptor entries in guest
> descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>          __le64 addr;
>          __le32 len;
>          __le16 index;
>          __le16 flags;
> };
>
> Fields:
>
> addr - physical address of a s/g entry
> len - length of an entry
> index - unique index.  The low $\lceil log(N) \rceil - 1$
>        bits of the index is a driver-supplied value which can have any value
>        (under driver control).  The high bits are reserved and should be
>        set to 0.

Drivers usually have their own information ring, so I'm not sure 
exposing such flexibility is really needed. For completion the only 
hardware meaningful information is the index of the descriptor. And 
DESC_WRAP could be checked implicitly through index in desc < index of 
this descriptor. (Though I'm still not quite sure DESC_WRAP is needed).

>
> flags - descriptor flags.
>
> Descriptors written by driver have DESC_DRIVER set.
>
> Writing out this field makes the descriptor available for the device to use,
> so all other fields must be valid when it is written.
>
> DESC_WRAP - device uses this field to detect descriptor change by driver.

This looks a little bit confused, device in fact can check this through 
DESC_HW  (or DESC_DRIVER) too?

>
> Driver can use 1 bit to set direction
> /* This marks a descriptor as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2
>
>
> Device operation (using descriptors):
>
> Device is looking for descriptors in ring order. After detecting that
> the flags value has changed with DESC_DRIVER set and DESC_WRAP matching
> the wrap-around counter, it can start using the descriptors.
> Descriptors can be used in any order, but must be read from ring
> in-order.  In other words device must not read descriptor X after it
> started using descriptor X+1.  Further, all buffer descriptors must be
> read completely before device starts using the buffer.
>
> This because once a descriptor is used, driver can over-write both this
> descriptor and any preceeding descriptors in ring.
>
> To help driver detect use of descriptors and to pass extra meta-data
> to driver, device writes out used entries in device descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>          __le64 reserved;
>          __le32 len;
>          __le16 index;
>          __le16 flags;
> };
>
> Fields:
>
> reserved - can be any value, ignored by driver
> len - length written by device. only valid if VRING_DESC_F_WRITE is set
>        len bytes of data from beginning of buffer are assumed to have been updated
> index - unique index copied from the driver descriptor that has been used.
> flags - descriptor flags.
>
> Descriptors written by device have DESC_DRIVER clear.
>
> Writing out this field notifies the driver that it can re-use the
> descriptor id. It is also a signal that driver can over-write the
> relevant descriptor (with the supplied id), and any other
>
> DESC_WRAP - driver uses this field to detect descriptor change by device.
>
> If device has used a buffer containing a write descriptor, it sets this bit:
> #define VRING_DESC_F_WRITE      2
>
> * Driver scatter/gather support
>
> Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0:
>
> /* This marks a buffer as continuing in the next ring entry. */
> #define VRING_DESC_F_NEXT       1
>
> When driver descriptors are chained in this way, multiple
> descriptors are treated as a part of a single transaction
> containing an optional write buffer followed by an optional read buffer.
> All descriptors in the chain must have the same ID.
>
> Unlike virtio 1.0, use of this flag will be an optional feature
> so both devices and drivers can opt out of it.
> If they do, they can either negotiate indirect descriptors or use
> single-descriptor entries exclusively for buffers.
>
> Device might detect a partial descriptor chain (VRING_DESC_F_NEXT
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by driver,
> but device is allowed to store the valid parts of the chain as
> driver is not allowed to change them anymore.

Does it mean e.g device need to busy wait for complete chain if it found 
an incomplete one? Looks suboptimal.

>
> Two options are available:
>
> Device can write out the same number of descriptors for the chain,
> setting VRING_DESC_F_NEXT for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set.
>
> Device only writes out a single descriptor for the whole chain.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the chain before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> chain that it built in order to know where to look for the next
> device descriptor.
>
> * Indirect buffers
>
> Indirect buffer descriptors is an optional feature.
> These are always written by driver, not the device.
> Indirect buffers have a special flag bit set - like in virtio 1.0:
>
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
>
> VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear.
>
> len specifies the length of the indirect descriptor buffer in bytes
> and must be a multiple of 16.
>
> Unlike virtio 1.0, the buffer pointed to is a table, not a list:
> struct indirect_descriptor_table {
>          /* The actual descriptors (16 bytes each) */
>          struct indirect_desc desc[len / 16];
> };
>
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
>
> struct indirect_desc {
>          __le64 addr;
>          __le32 len;
>          __le16 reserved;
>          __le16 flags;
> };
>
>
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  reserved field is
> also set to 0 and should be ignored.
>
> TODO (blocker): virtio 1.0 allows a s/g entry followed by
>        an indirect descriptor. Is this useful?
>
> This support would be an optional feature, same as in virtio 1.0
>
> * Batching descriptors:
>
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
>
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal
> driver that this is part of a batch of used descriptors
> which are all part of a single transaction.

If this is a part of a single transaction, I don't see obvious different 
with DESC_F_NEXT?). I thought for batching, each descriptor is 
independent and should belong to several different transactions. (E.g 
for net, each descriptor could be an independent packet).

>
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by device,
> but driver is allowed to store the valid parts of the chain as
> device is not allowed to change them anymore.
>
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
>
> * Using descriptors in order
>
> Some devices can guarantee that descriptors are used in
> the order in which they were made available.
> This allows driver optimizations and can be negotiated through
> a feature bit.
>
> * Per ring flags
>
> It is useful to support features for some rings but not others.
> E.g. it's reasonable to use single buffers for RX rings but
> sg or indirect for TX rings of the network device.
> Generic configuration space will be extended so features can
> be negotiated per ring.
>
> * Selective use of descriptors
>
> As described above, descriptors with NEXT bit set are part of a
> scatter/gather chain and so do not have to cause device to write a used
> descriptor out.
>
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to
> signal to device that it does not have to write out the used descriptor
> as it is part of a batch of descriptors. Device has two options (similar
> to VRING_DESC_F_NEXT):
>
> Device can write out the same number of descriptors for the batch,
> setting VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
>
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the batch before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> batch that it built in order to know where to look for the next
> device descriptor.

A silly question, how can driver find out the length of the batch 
effectively?  Looks like it can only scan the ring until one that has 
DESC_HW cleared?

>
>
> TODO (blocker): skipping descriptors for selective and
> scatter/gather seem to be only requested with in-order right now. Let's
> require in-order for this skipping?  This will simplify the accounting
> by driver.
>
>
> * Interrupt/event suppression
>
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
>
> struct event {
> 	__le16 idx;
> 	__le16 flags;
> }
>
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
>
> Flags can be used like in virtio 1.0, by storing a special
> value there:
>
> #define VRING_F_EVENT_ENABLE  0x0
>
> #define VRING_F_EVENT_DISABLE 0x1
>
> Event index includes the index of the descriptor
> which should trigger the event, and the wrap counter
> in the high bit.

Not specific to v3, but looks like with event index, we can't achieve 
interruptless or exitless consider idx may wrap.

>
> In that case, interrupt triggers when descriptor is written at a given
> location in the ring (or skipped in case of NEXT/MORE).
>
> If both features are negotiated, a special flags value
> can be used to switch to event idx:
>
> #define VRING_F_EVENT_IDX     0x2
>
> * Available notification
>
> Driver currently writes out the queue number to device to
> kick off ring processing.
>
> As queue number is currently 16 bit, we can extend that
> to additionally include the offset within ring of the descriptor
> which triggered the kick event in bits 16 to 30,
> and the wrap counter in the high bit (31).
>
> Device is allowed to pre-fetch descriptors beyond the specified
> offset but is not required to do so.

With DESC_HW or other flag, prefetching may introduce extra overhead I 
think since it need to keep scan descriptor until DESC_HW is not set?

>
>
>
> * TODO: interrupt coalescing
>
> Does it make sense just for networking or for all devices?
> If later should we make it a per ring or a global feature?
>
>
> * TODO: event index/flags in device memory?
>
> Should we move the event index/flags to device memory?
> Might be helpful for hardware configuration so they do not
> need to do DMA reads to check whether interrupt is needed.
> OTOH maybe interrupt coalescing is sufficient for this.
>
>
> * TODO: Device specific descriptor flags
>
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
>
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.

I think we need try at least packing virtio-net header in the descriptor 
ring.

>
> * TODO: Descriptor length in device descriptors
>
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
>
> * TODO: Writing at an offset
>
> Some devices might want to write into some descriptors
> at an offset, the offset would be in reserved field in the descriptor,
> possibly a descriptor flag could indicate this:
>
> #define VRING_DESC_F_OFFSET 0x0020
>
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.

May be even more e.g NET_SKB_PAD, then we could use build_skb() for 
Linux drivers.

>
> * TODO: Non power-of-2 ring sizes
>
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
>
>
> TODO: limits on buffer alignment/size
>
> Seems to be useful for RX for networking.
> Is there a need to negotiate above in a generic way
> or is this a networking specific optimization?
>
> TODO: expose wrap counters to device for debugging
>
> TODO: expose last avail/used offsets to device/driver for debugging
>
> TODO: ability to reset individual rings

Any actual usage of this?

Thanks

>
> ---
>
> Note: should this proposal be accepted and approved, one or more
>        claims disclosed to the TC admin and listed on the Virtio TC
>        IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>        might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>        my hands. I'm in the process of updating ipr disclosures
>        in github instead.  Will make sure all is in place before
>        this proposal is put to vote. As usual this TC operates under the
>        Non-Assertion Mode of the OASIS IPR Policy, which protects
>        anyone implementing the virtio spec.
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] Re: packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2017-07-16  6:00 ` Lior Narkis
@ 2017-09-11  7:47 ` Jason Wang
  2017-09-12 16:23   ` Willem de Bruijn
  2017-09-12 16:23   ` Willem de Bruijn
  2017-09-11  7:47 ` Jason Wang
                   ` (7 subsequent siblings)
  19 siblings, 2 replies; 92+ messages in thread
From: Jason Wang @ 2017-09-11  7:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization



On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
> This is an update from v2 version.
> Changes:
> - update event suppression mechanism
> - add wrap counter: DESC_WRAP flag in addition to
>    DESC_DRIVER flag used for validity so device does not have to
>    write out all used descriptors.

Do we have benchmark result to show the advantage of DESC_DRIVER over 
e.g avail/used index?

> - more options especially helpful for hardware implementations
> - in-order processing option due to popular demand
> - list of TODO items to consider as a follow-up, only two are
>    open questions we need to descide now, marked as blocker,
>    others are future enhancement ideas.
>
> ---
>
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
>
> Note: the following mode of operation will actually work
> without changes when descriptor rings do not overlap, with driver
> writing out available entries in a read-only driver descriptor ring,
> device writing used entries in a write-only device descriptor ring.
>
> TODO: does this have any value for some? E.g. as a security feature?
>
>
> * Descriptor ring:
>
> Driver writes descriptors with unique index values and DESC_DRIVER set in flags.

You probably mean DESC_HW here?

> Descriptors are written in a ring order: from start to end of ring,
> wrapping around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same
> order of driver descriptors, and not all descriptors have to be written out.
>
> Driver and device are expected to maintain (internally) a wrap-around
> bit, starting at 0 and changing value each time they start writing out
> descriptors at the beginning of the ring. This bit is passed as
> DESC_WRAP bit in the flags field.

I'm not sure this is really needed, I think it could be done through 
checking vq.num?

>
> Flags are always set/cleared last.
>
> Note that driver can write descriptors out in any order, but device
> will not execute descriptor X+1 until descriptor X has been
> read as valid.
>
> Driver operation:
>
> Driver makes descriptors available to device by writing out descriptors
> in the descriptor ring. Once ring is full, driver waits for device to
> use some descriptors before making more available.
>
> Descriptors can be used by device in any order, but must be read from
> ring in-order, and must be read completely before starting use.  Thus,
> once a descriptor is used, driver can over-write both this descriptor
> and any descriptors which preceded it in the ring.
>
> Driver can detect use of descriptor either by device specific means
> (e.g. detect a buffer data change by device) or in a generic way
> by detecting that a used buffer has been written out by device.
>
>
> Driver writes out available scatter/gather descriptor entries in guest
> descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>          __le64 addr;
>          __le32 len;
>          __le16 index;
>          __le16 flags;
> };
>
> Fields:
>
> addr - physical address of a s/g entry
> len - length of an entry
> index - unique index.  The low $\lceil log(N) \rceil - 1$
>        bits of the index is a driver-supplied value which can have any value
>        (under driver control).  The high bits are reserved and should be
>        set to 0.

Drivers usually have their own information ring, so I'm not sure 
exposing such flexibility is really needed. For completion the only 
hardware meaningful information is the index of the descriptor. And 
DESC_WRAP could be checked implicitly through index in desc < index of 
this descriptor. (Though I'm still not quite sure DESC_WRAP is needed).

>
> flags - descriptor flags.
>
> Descriptors written by driver have DESC_DRIVER set.
>
> Writing out this field makes the descriptor available for the device to use,
> so all other fields must be valid when it is written.
>
> DESC_WRAP - device uses this field to detect descriptor change by driver.

This looks a little bit confused, device in fact can check this through 
DESC_HW  (or DESC_DRIVER) too?

>
> Driver can use 1 bit to set direction
> /* This marks a descriptor as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2
>
>
> Device operation (using descriptors):
>
> Device is looking for descriptors in ring order. After detecting that
> the flags value has changed with DESC_DRIVER set and DESC_WRAP matching
> the wrap-around counter, it can start using the descriptors.
> Descriptors can be used in any order, but must be read from ring
> in-order.  In other words device must not read descriptor X after it
> started using descriptor X+1.  Further, all buffer descriptors must be
> read completely before device starts using the buffer.
>
> This because once a descriptor is used, driver can over-write both this
> descriptor and any preceeding descriptors in ring.
>
> To help driver detect use of descriptors and to pass extra meta-data
> to driver, device writes out used entries in device descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>          __le64 reserved;
>          __le32 len;
>          __le16 index;
>          __le16 flags;
> };
>
> Fields:
>
> reserved - can be any value, ignored by driver
> len - length written by device. only valid if VRING_DESC_F_WRITE is set
>        len bytes of data from beginning of buffer are assumed to have been updated
> index - unique index copied from the driver descriptor that has been used.
> flags - descriptor flags.
>
> Descriptors written by device have DESC_DRIVER clear.
>
> Writing out this field notifies the driver that it can re-use the
> descriptor id. It is also a signal that driver can over-write the
> relevant descriptor (with the supplied id), and any other
>
> DESC_WRAP - driver uses this field to detect descriptor change by device.
>
> If device has used a buffer containing a write descriptor, it sets this bit:
> #define VRING_DESC_F_WRITE      2
>
> * Driver scatter/gather support
>
> Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0:
>
> /* This marks a buffer as continuing in the next ring entry. */
> #define VRING_DESC_F_NEXT       1
>
> When driver descriptors are chained in this way, multiple
> descriptors are treated as a part of a single transaction
> containing an optional write buffer followed by an optional read buffer.
> All descriptors in the chain must have the same ID.
>
> Unlike virtio 1.0, use of this flag will be an optional feature
> so both devices and drivers can opt out of it.
> If they do, they can either negotiate indirect descriptors or use
> single-descriptor entries exclusively for buffers.
>
> Device might detect a partial descriptor chain (VRING_DESC_F_NEXT
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by driver,
> but device is allowed to store the valid parts of the chain as
> driver is not allowed to change them anymore.

Does it mean e.g device need to busy wait for complete chain if it found 
an incomplete one? Looks suboptimal.

>
> Two options are available:
>
> Device can write out the same number of descriptors for the chain,
> setting VRING_DESC_F_NEXT for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set.
>
> Device only writes out a single descriptor for the whole chain.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the chain before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> chain that it built in order to know where to look for the next
> device descriptor.
>
> * Indirect buffers
>
> Indirect buffer descriptors is an optional feature.
> These are always written by driver, not the device.
> Indirect buffers have a special flag bit set - like in virtio 1.0:
>
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
>
> VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear.
>
> len specifies the length of the indirect descriptor buffer in bytes
> and must be a multiple of 16.
>
> Unlike virtio 1.0, the buffer pointed to is a table, not a list:
> struct indirect_descriptor_table {
>          /* The actual descriptors (16 bytes each) */
>          struct indirect_desc desc[len / 16];
> };
>
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
>
> struct indirect_desc {
>          __le64 addr;
>          __le32 len;
>          __le16 reserved;
>          __le16 flags;
> };
>
>
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  reserved field is
> also set to 0 and should be ignored.
>
> TODO (blocker): virtio 1.0 allows a s/g entry followed by
>        an indirect descriptor. Is this useful?
>
> This support would be an optional feature, same as in virtio 1.0
>
> * Batching descriptors:
>
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
>
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal
> driver that this is part of a batch of used descriptors
> which are all part of a single transaction.

If this is a part of a single transaction, I don't see obvious different 
with DESC_F_NEXT?). I thought for batching, each descriptor is 
independent and should belong to several different transactions. (E.g 
for net, each descriptor could be an independent packet).

>
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by device,
> but driver is allowed to store the valid parts of the chain as
> device is not allowed to change them anymore.
>
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
>
> * Using descriptors in order
>
> Some devices can guarantee that descriptors are used in
> the order in which they were made available.
> This allows driver optimizations and can be negotiated through
> a feature bit.
>
> * Per ring flags
>
> It is useful to support features for some rings but not others.
> E.g. it's reasonable to use single buffers for RX rings but
> sg or indirect for TX rings of the network device.
> Generic configuration space will be extended so features can
> be negotiated per ring.
>
> * Selective use of descriptors
>
> As described above, descriptors with NEXT bit set are part of a
> scatter/gather chain and so do not have to cause device to write a used
> descriptor out.
>
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to
> signal to device that it does not have to write out the used descriptor
> as it is part of a batch of descriptors. Device has two options (similar
> to VRING_DESC_F_NEXT):
>
> Device can write out the same number of descriptors for the batch,
> setting VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
>
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the batch before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> batch that it built in order to know where to look for the next
> device descriptor.

A silly question, how can driver find out the length of the batch 
effectively?  Looks like it can only scan the ring until one that has 
DESC_HW cleared?

>
>
> TODO (blocker): skipping descriptors for selective and
> scatter/gather seem to be only requested with in-order right now. Let's
> require in-order for this skipping?  This will simplify the accounting
> by driver.
>
>
> * Interrupt/event suppression
>
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
>
> struct event {
> 	__le16 idx;
> 	__le16 flags;
> }
>
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
>
> Flags can be used like in virtio 1.0, by storing a special
> value there:
>
> #define VRING_F_EVENT_ENABLE  0x0
>
> #define VRING_F_EVENT_DISABLE 0x1
>
> Event index includes the index of the descriptor
> which should trigger the event, and the wrap counter
> in the high bit.

Not specific to v3, but looks like with event index, we can't achieve 
interruptless or exitless consider idx may wrap.

>
> In that case, interrupt triggers when descriptor is written at a given
> location in the ring (or skipped in case of NEXT/MORE).
>
> If both features are negotiated, a special flags value
> can be used to switch to event idx:
>
> #define VRING_F_EVENT_IDX     0x2
>
> * Available notification
>
> Driver currently writes out the queue number to device to
> kick off ring processing.
>
> As queue number is currently 16 bit, we can extend that
> to additionally include the offset within ring of the descriptor
> which triggered the kick event in bits 16 to 30,
> and the wrap counter in the high bit (31).
>
> Device is allowed to pre-fetch descriptors beyond the specified
> offset but is not required to do so.

With DESC_HW or other flag, prefetching may introduce extra overhead I 
think since it need to keep scan descriptor until DESC_HW is not set?

>
>
>
> * TODO: interrupt coalescing
>
> Does it make sense just for networking or for all devices?
> If later should we make it a per ring or a global feature?
>
>
> * TODO: event index/flags in device memory?
>
> Should we move the event index/flags to device memory?
> Might be helpful for hardware configuration so they do not
> need to do DMA reads to check whether interrupt is needed.
> OTOH maybe interrupt coalescing is sufficient for this.
>
>
> * TODO: Device specific descriptor flags
>
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
>
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.

I think we need try at least packing virtio-net header in the descriptor 
ring.

>
> * TODO: Descriptor length in device descriptors
>
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
>
> * TODO: Writing at an offset
>
> Some devices might want to write into some descriptors
> at an offset, the offset would be in reserved field in the descriptor,
> possibly a descriptor flag could indicate this:
>
> #define VRING_DESC_F_OFFSET 0x0020
>
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.

May be even more e.g NET_SKB_PAD, then we could use build_skb() for 
Linux drivers.

>
> * TODO: Non power-of-2 ring sizes
>
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
>
>
> TODO: limits on buffer alignment/size
>
> Seems to be useful for RX for networking.
> Is there a need to negotiate above in a generic way
> or is this a networking specific optimization?
>
> TODO: expose wrap counters to device for debugging
>
> TODO: expose last avail/used offsets to device/driver for debugging
>
> TODO: ability to reset individual rings

Any actual usage of this?

Thanks

>
> ---
>
> Note: should this proposal be accepted and approved, one or more
>        claims disclosed to the TC admin and listed on the Virtio TC
>        IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>        might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>        my hands. I'm in the process of updating ipr disclosures
>        in github instead.  Will make sure all is in place before
>        this proposal is put to vote. As usual this TC operates under the
>        Non-Assertion Mode of the OASIS IPR Policy, which protects
>        anyone implementing the virtio spec.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (13 preceding siblings ...)
  2017-09-12 16:20 ` [virtio-dev] " Willem de Bruijn
@ 2017-09-12 16:20 ` Willem de Bruijn
  2017-09-14  8:23 ` Ilya Lesokhin
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Willem de Bruijn @ 2017-09-12 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Sun, Sep 10, 2017 at 1:06 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> This is an update from v2 version.
> Changes:
> - update event suppression mechanism
> - add wrap counter: DESC_WRAP flag in addition to
>   DESC_DRIVER flag used for validity so device does not have to
>   write out all used descriptors.
> - more options especially helpful for hardware implementations
> - in-order processing option due to popular demand
> - list of TODO items to consider as a follow-up, only two are
>   open questions we need to descide now, marked as blocker,
>   others are future enhancement ideas.

Perhaps this would make a good topic for a BoF session at the upcoming
netdev. A new ring structure can be useful elsewhere, too, such as for
af_packet v4.

>
> ---
>
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
>
> Note: the following mode of operation will actually work
> without changes when descriptor rings do not overlap, with driver
> writing out available entries in a read-only driver descriptor ring,
> device writing used entries in a write-only device descriptor ring.

The ring is always read-write, as the consumer has to toggle the
DESC_DRIVER flag, right? Which mode are you referring to?

> TODO: does this have any value for some? E.g. as a security feature?
>
>
> * Descriptor ring:
>
> Driver writes descriptors with unique index values and DESC_DRIVER set in flags.
> Descriptors are written in a ring order: from start to end of ring,
> wrapping around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same
> order of driver descriptors, and not all descriptors have to be written out.

Virtio rings are used in both directions between guest device driver and
host device, as well as in increasingly many situations outside vm i/o.

I suggest using producer and consumer instead of driver and device
when describing ring operations.

When working on the virtio-net tx path, having to invert all the documentation
in my head, because it is written from an rx point of view was a bit tricky ;-)

I would then also convert DESC_DRIVER to DESC_VALID or so.

> Driver and device are expected to maintain (internally) a wrap-around
> bit, starting at 0 and changing value each time they start writing out
> descriptors at the beginning of the ring. This bit is passed as
> DESC_WRAP bit in the flags field.

So, the flag effectively doubles the namespace of the id from 16 bit to
17 bit? Instead, how about using a larger identifier. Such as 32 bit.

This also future proofs the design for cases where the ring may grow
to exceed 65536 entries. Doing so is not a short term change, but it
ould avoid the need for indirect descriptors and give greated room
for out of order acknowledgement.

> Flags are always set/cleared last.
>
> Note that driver can write descriptors out in any order, but device
> will not execute descriptor X+1 until descriptor X has been
> read as valid.

Why this constraint on the ring?

> Driver operation:
>
> Driver makes descriptors available to device by writing out descriptors
> in the descriptor ring. Once ring is full, driver waits for device to
> use some descriptors before making more available.
>
> Descriptors can be used by device in any order, but must be read from
> ring in-order, and must be read completely before starting use.  Thus,
> once a descriptor is used, driver can over-write both this descriptor
> and any descriptors which preceded it in the ring.

Does this mean that completing a descriptor by the consumer implicitly
completes all descriptors that precede it in the ring?

> Driver can detect use of descriptor either by device specific means
> (e.g. detect a buffer data change by device) or in a generic way
> by detecting that a used buffer has been written out by device.

I don't quite follow this.

> Driver writes out available scatter/gather descriptor entries in guest
> descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>         __le64 addr;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
>
> Fields:
>
> addr - physical address of a s/g entry
> len - length of an entry

Is this ever larger than 16-bit? If not, then reducing to 16-bit allows
growing index to 32-bit.

> index - unique index.  The low $\lceil log(N) \rceil - 1$
>       bits of the index is a driver-supplied value which can have any value
>       (under driver control).  The high bits are reserved and should be
>       set to 0.
>
> flags - descriptor flags.
>
> Descriptors written by driver have DESC_DRIVER set.
>
> Writing out this field makes the descriptor available for the device to use,
> so all other fields must be valid when it is written.
>
> DESC_WRAP - device uses this field to detect descriptor change by driver.
>
> Driver can use 1 bit to set direction
> /* This marks a descriptor as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2

This is a per-ring flag, as opposed to the per-descriptor flags DESC_*.
Please make that explicit and state in which structure it is set.
>
>
> Device operation (using descriptors):
>
> Device is looking for descriptors in ring order. After detecting that
> the flags value has changed with DESC_DRIVER set and DESC_WRAP matching
> the wrap-around counter, it can start using the descriptors.
> Descriptors can be used in any order, but must be read from ring
> in-order.  In other words device must not read descriptor X after it
> started using descriptor X+1.

Why? This is the same question as above, really. This seems like a
device constraint, not necessarily a constraint to impose on the ring
format.

> Further, all buffer descriptors must be
> read completely before device starts using the buffer.
>
> This because once a descriptor is used, driver can over-write both this
> descriptor and any preceeding descriptors in ring.

This does explain the above constraint. I guess that I just do not
understand the reason for this behavior.

> To help driver detect use of descriptors and to pass extra meta-data
> to driver, device writes out used entries in device descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>         __le64 reserved;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
>
> Fields:
>
> reserved - can be any value, ignored by driver
> len - length written by device. only valid if VRING_DESC_F_WRITE is set
>       len bytes of data from beginning of buffer are assumed to have been updated
> index - unique index copied from the driver descriptor that has been used.
> flags - descriptor flags.
>
> Descriptors written by device have DESC_DRIVER clear.
>
> Writing out this field notifies the driver that it can re-use the
> descriptor id. It is also a signal that driver can over-write the
> relevant descriptor (with the supplied id), and any other
>
> DESC_WRAP - driver uses this field to detect descriptor change by device.
>
> If device has used a buffer containing a write descriptor, it sets this bit:
> #define VRING_DESC_F_WRITE      2
>
> * Driver scatter/gather support
>
> Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0:
>
> /* This marks a buffer as continuing in the next ring entry. */
> #define VRING_DESC_F_NEXT       1

Isn't this a descriptor flag, so DESC_NEXT?

>
> When driver descriptors are chained in this way, multiple
> descriptors are treated as a part of a single transaction
> containing an optional write buffer followed by an optional read buffer.

Can you elaborate on the last part about optional write and read buffer?

> All descriptors in the chain must have the same ID.

If so, then the explicit flag is not needed?

> Unlike virtio 1.0, use of this flag will be an optional feature
> so both devices and drivers can opt out of it.
> If they do, they can either negotiate indirect descriptors or use
> single-descriptor entries exclusively for buffers.
>
> Device might detect a partial descriptor chain (VRING_DESC_F_NEXT
> set but next descriptor not valid).

This can be forbidden, by requiring the producer to set the
DESC_DRIVER bit on the first descriptor only after the entire
chain has been written.

Do chains have to consist of consecutive descriptors?

> In that case it must not
> use any parts of the chain - it will later be completed by driver,
> but device is allowed to store the valid parts of the chain as
> driver is not allowed to change them anymore.
>
> Two options are available:
>
> Device can write out the same number of descriptors for the chain,
> setting VRING_DESC_F_NEXT for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set.
>
> Device only writes out a single descriptor for the whole chain.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the chain before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> chain that it built in order to know where to look for the next
> device descriptor.
>
> * Indirect buffers
>
> Indirect buffer descriptors is an optional feature.
> These are always written by driver, not the device.
> Indirect buffers have a special flag bit set - like in virtio 1.0:
>
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
>
> VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear.
>
> len specifies the length of the indirect descriptor buffer in bytes
> and must be a multiple of 16.

Multiple of sizeof(struct indirect_desc).

Also, struct indirect_desc is identical to struct desc. No need for a
separate struct definition?

>
> Unlike virtio 1.0, the buffer pointed to is a table, not a list:

This is just a linear array, right?

> struct indirect_descriptor_table {
>         /* The actual descriptors (16 bytes each) */
>         struct indirect_desc desc[len / 16];
> };
>
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
>
> struct indirect_desc {
>         __le64 addr;
>         __le32 len;
>         __le16 reserved;
>         __le16 flags;
> };
>
>
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  reserved field is
> also set to 0 and should be ignored.
>
> TODO (blocker): virtio 1.0 allows a s/g entry followed by
>       an indirect descriptor. Is this useful?

Sounds like unnecessary complexity.

> This support would be an optional feature, same as in virtio 1.0
>
> * Batching descriptors:
>
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
>
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal
> driver that this is part of a batch of used descriptors
> which are all part of a single transaction.
>
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by device,
> but driver is allowed to store the valid parts of the chain as
> device is not allowed to change them anymore.
>
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
>
> * Using descriptors in order
>
> Some devices can guarantee that descriptors are used in
> the order in which they were made available.
> This allows driver optimizations and can be negotiated through
> a feature bit.
>
> * Per ring flags
>
> It is useful to support features for some rings but not others.
> E.g. it's reasonable to use single buffers for RX rings but
> sg or indirect for TX rings of the network device.
> Generic configuration space will be extended so features can
> be negotiated per ring.
>
> * Selective use of descriptors
>
> As described above, descriptors with NEXT bit set are part of a
> scatter/gather chain and so do not have to cause device to write a used
> descriptor out.
>
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to
> signal to device that it does not have to write out the used descriptor
> as it is part of a batch of descriptors. Device has two options (similar
> to VRING_DESC_F_NEXT):
>
> Device can write out the same number of descriptors for the batch,
> setting VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
>
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the batch before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> batch that it built in order to know where to look for the next
> device descriptor.
>
>
> TODO (blocker): skipping descriptors for selective and
> scatter/gather seem to be only requested with in-order right now. Let's
> require in-order for this skipping?  This will simplify the accounting
> by driver.
>
>
> * Interrupt/event suppression
>
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
>
> struct event {
>         __le16 idx;
>         __le16 flags;
> }
>
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
>
> Flags can be used like in virtio 1.0, by storing a special
> value there:
>
> #define VRING_F_EVENT_ENABLE  0x0
>
> #define VRING_F_EVENT_DISABLE 0x1
>
> Event index includes the index of the descriptor
> which should trigger the event, and the wrap counter
> in the high bit.
>
> In that case, interrupt triggers when descriptor is written at a given
> location in the ring (or skipped in case of NEXT/MORE).
>
> If both features are negotiated, a special flags value
> can be used to switch to event idx:
>
> #define VRING_F_EVENT_IDX     0x2
>
> * Available notification
>
> Driver currently writes out the queue number to device to
> kick off ring processing.
>
> As queue number is currently 16 bit, we can extend that
> to additionally include the offset within ring of the descriptor
> which triggered the kick event in bits 16 to 30,
> and the wrap counter in the high bit (31).
>
> Device is allowed to pre-fetch descriptors beyond the specified
> offset but is not required to do so.
>
>
>
> * TODO: interrupt coalescing
>
> Does it make sense just for networking or for all devices?
> If later should we make it a per ring or a global feature?
>
>
> * TODO: event index/flags in device memory?
>
> Should we move the event index/flags to device memory?
> Might be helpful for hardware configuration so they do not
> need to do DMA reads to check whether interrupt is needed.

Agreed. This also resembles physical devices more closely.

> OTOH maybe interrupt coalescing is sufficient for this.
>
>
> * TODO: Device specific descriptor flags
>
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
>
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.
>
> * TODO: Descriptor length in device descriptors
>
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
>
> * TODO: Writing at an offset
>
> Some devices might want to write into some descriptors
> at an offset, the offset would be in reserved field in the descriptor,
> possibly a descriptor flag could indicate this:
>
> #define VRING_DESC_F_OFFSET 0x0020
>
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.
>
> * TODO: Non power-of-2 ring sizes
>
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
>
>
> TODO: limits on buffer alignment/size
>
> Seems to be useful for RX for networking.
> Is there a need to negotiate above in a generic way
> or is this a networking specific optimization?
>
> TODO: expose wrap counters to device for debugging
>
> TODO: expose last avail/used offsets to device/driver for debugging
>
> TODO: ability to reset individual rings
>
> ---
>
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>       might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>       my hands. I'm in the process of updating ipr disclosures
>       in github instead.  Will make sure all is in place before
>       this proposal is put to vote. As usual this TC operates under the
>       Non-Assertion Mode of the OASIS IPR Policy, which protects
>       anyone implementing the virtio spec.
>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (12 preceding siblings ...)
  2017-09-11  7:47 ` Jason Wang
@ 2017-09-12 16:20 ` Willem de Bruijn
  2017-09-12 16:20 ` Willem de Bruijn
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Willem de Bruijn @ 2017-09-12 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Sun, Sep 10, 2017 at 1:06 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> This is an update from v2 version.
> Changes:
> - update event suppression mechanism
> - add wrap counter: DESC_WRAP flag in addition to
>   DESC_DRIVER flag used for validity so device does not have to
>   write out all used descriptors.
> - more options especially helpful for hardware implementations
> - in-order processing option due to popular demand
> - list of TODO items to consider as a follow-up, only two are
>   open questions we need to descide now, marked as blocker,
>   others are future enhancement ideas.

Perhaps this would make a good topic for a BoF session at the upcoming
netdev. A new ring structure can be useful elsewhere, too, such as for
af_packet v4.

>
> ---
>
> Performance analysis of this is in my kvm forum 2016 presentation.
> The idea is to have a r/w descriptor in a ring structure,
> replacing the used and available ring, index and descriptor
> buffer.
>
> Note: the following mode of operation will actually work
> without changes when descriptor rings do not overlap, with driver
> writing out available entries in a read-only driver descriptor ring,
> device writing used entries in a write-only device descriptor ring.

The ring is always read-write, as the consumer has to toggle the
DESC_DRIVER flag, right? Which mode are you referring to?

> TODO: does this have any value for some? E.g. as a security feature?
>
>
> * Descriptor ring:
>
> Driver writes descriptors with unique index values and DESC_DRIVER set in flags.
> Descriptors are written in a ring order: from start to end of ring,
> wrapping around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same
> order of driver descriptors, and not all descriptors have to be written out.

Virtio rings are used in both directions between guest device driver and
host device, as well as in increasingly many situations outside vm i/o.

I suggest using producer and consumer instead of driver and device
when describing ring operations.

When working on the virtio-net tx path, having to invert all the documentation
in my head, because it is written from an rx point of view was a bit tricky ;-)

I would then also convert DESC_DRIVER to DESC_VALID or so.

> Driver and device are expected to maintain (internally) a wrap-around
> bit, starting at 0 and changing value each time they start writing out
> descriptors at the beginning of the ring. This bit is passed as
> DESC_WRAP bit in the flags field.

So, the flag effectively doubles the namespace of the id from 16 bit to
17 bit? Instead, how about using a larger identifier. Such as 32 bit.

This also future proofs the design for cases where the ring may grow
to exceed 65536 entries. Doing so is not a short term change, but it
ould avoid the need for indirect descriptors and give greated room
for out of order acknowledgement.

> Flags are always set/cleared last.
>
> Note that driver can write descriptors out in any order, but device
> will not execute descriptor X+1 until descriptor X has been
> read as valid.

Why this constraint on the ring?

> Driver operation:
>
> Driver makes descriptors available to device by writing out descriptors
> in the descriptor ring. Once ring is full, driver waits for device to
> use some descriptors before making more available.
>
> Descriptors can be used by device in any order, but must be read from
> ring in-order, and must be read completely before starting use.  Thus,
> once a descriptor is used, driver can over-write both this descriptor
> and any descriptors which preceded it in the ring.

Does this mean that completing a descriptor by the consumer implicitly
completes all descriptors that precede it in the ring?

> Driver can detect use of descriptor either by device specific means
> (e.g. detect a buffer data change by device) or in a generic way
> by detecting that a used buffer has been written out by device.

I don't quite follow this.

> Driver writes out available scatter/gather descriptor entries in guest
> descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>         __le64 addr;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
>
> Fields:
>
> addr - physical address of a s/g entry
> len - length of an entry

Is this ever larger than 16-bit? If not, then reducing to 16-bit allows
growing index to 32-bit.

> index - unique index.  The low $\lceil log(N) \rceil - 1$
>       bits of the index is a driver-supplied value which can have any value
>       (under driver control).  The high bits are reserved and should be
>       set to 0.
>
> flags - descriptor flags.
>
> Descriptors written by driver have DESC_DRIVER set.
>
> Writing out this field makes the descriptor available for the device to use,
> so all other fields must be valid when it is written.
>
> DESC_WRAP - device uses this field to detect descriptor change by driver.
>
> Driver can use 1 bit to set direction
> /* This marks a descriptor as write-only (otherwise read-only). */
> #define VRING_DESC_F_WRITE      2

This is a per-ring flag, as opposed to the per-descriptor flags DESC_*.
Please make that explicit and state in which structure it is set.
>
>
> Device operation (using descriptors):
>
> Device is looking for descriptors in ring order. After detecting that
> the flags value has changed with DESC_DRIVER set and DESC_WRAP matching
> the wrap-around counter, it can start using the descriptors.
> Descriptors can be used in any order, but must be read from ring
> in-order.  In other words device must not read descriptor X after it
> started using descriptor X+1.

Why? This is the same question as above, really. This seems like a
device constraint, not necessarily a constraint to impose on the ring
format.

> Further, all buffer descriptors must be
> read completely before device starts using the buffer.
>
> This because once a descriptor is used, driver can over-write both this
> descriptor and any preceeding descriptors in ring.

This does explain the above constraint. I guess that I just do not
understand the reason for this behavior.

> To help driver detect use of descriptors and to pass extra meta-data
> to driver, device writes out used entries in device descriptor format:
>
>
> #define DESC_WRAP 0x0040
> #define DESC_DRIVER 0x0080
>
> struct desc {
>         __le64 reserved;
>         __le32 len;
>         __le16 index;
>         __le16 flags;
> };
>
> Fields:
>
> reserved - can be any value, ignored by driver
> len - length written by device. only valid if VRING_DESC_F_WRITE is set
>       len bytes of data from beginning of buffer are assumed to have been updated
> index - unique index copied from the driver descriptor that has been used.
> flags - descriptor flags.
>
> Descriptors written by device have DESC_DRIVER clear.
>
> Writing out this field notifies the driver that it can re-use the
> descriptor id. It is also a signal that driver can over-write the
> relevant descriptor (with the supplied id), and any other
>
> DESC_WRAP - driver uses this field to detect descriptor change by device.
>
> If device has used a buffer containing a write descriptor, it sets this bit:
> #define VRING_DESC_F_WRITE      2
>
> * Driver scatter/gather support
>
> Driver can use 1 bit to chain s/g entries in a request, similar to virtio 1.0:
>
> /* This marks a buffer as continuing in the next ring entry. */
> #define VRING_DESC_F_NEXT       1

Isn't this a descriptor flag, so DESC_NEXT?

>
> When driver descriptors are chained in this way, multiple
> descriptors are treated as a part of a single transaction
> containing an optional write buffer followed by an optional read buffer.

Can you elaborate on the last part about optional write and read buffer?

> All descriptors in the chain must have the same ID.

If so, then the explicit flag is not needed?

> Unlike virtio 1.0, use of this flag will be an optional feature
> so both devices and drivers can opt out of it.
> If they do, they can either negotiate indirect descriptors or use
> single-descriptor entries exclusively for buffers.
>
> Device might detect a partial descriptor chain (VRING_DESC_F_NEXT
> set but next descriptor not valid).

This can be forbidden, by requiring the producer to set the
DESC_DRIVER bit on the first descriptor only after the entire
chain has been written.

Do chains have to consist of consecutive descriptors?

> In that case it must not
> use any parts of the chain - it will later be completed by driver,
> but device is allowed to store the valid parts of the chain as
> driver is not allowed to change them anymore.
>
> Two options are available:
>
> Device can write out the same number of descriptors for the chain,
> setting VRING_DESC_F_NEXT for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_NEXT bit set.
>
> Device only writes out a single descriptor for the whole chain.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the chain before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> chain that it built in order to know where to look for the next
> device descriptor.
>
> * Indirect buffers
>
> Indirect buffer descriptors is an optional feature.
> These are always written by driver, not the device.
> Indirect buffers have a special flag bit set - like in virtio 1.0:
>
> /* This means the buffer contains a table of buffer descriptors. */
> #define VRING_DESC_F_INDIRECT   4
>
> VRING_DESC_F_WRITE and VRING_DESC_F_NEXT are always clear.
>
> len specifies the length of the indirect descriptor buffer in bytes
> and must be a multiple of 16.

Multiple of sizeof(struct indirect_desc).

Also, struct indirect_desc is identical to struct desc. No need for a
separate struct definition?

>
> Unlike virtio 1.0, the buffer pointed to is a table, not a list:

This is just a linear array, right?

> struct indirect_descriptor_table {
>         /* The actual descriptors (16 bytes each) */
>         struct indirect_desc desc[len / 16];
> };
>
> The first descriptor is located at start of the indirect descriptor
> table, additional indirect descriptors come immediately afterwards.
>
> struct indirect_desc {
>         __le64 addr;
>         __le32 len;
>         __le16 reserved;
>         __le16 flags;
> };
>
>
> DESC_F_WRITE is the only valid flag for descriptors in the indirect
> table. Others should be set to 0 and are ignored.  reserved field is
> also set to 0 and should be ignored.
>
> TODO (blocker): virtio 1.0 allows a s/g entry followed by
>       an indirect descriptor. Is this useful?

Sounds like unnecessary complexity.

> This support would be an optional feature, same as in virtio 1.0
>
> * Batching descriptors:
>
> virtio 1.0 allows passing a batch of descriptors in both directions, by
> incrementing the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
>
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal
> driver that this is part of a batch of used descriptors
> which are all part of a single transaction.
>
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE
> set but next descriptor not valid). In that case it must not
> use any parts of the chain - it will later be completed by device,
> but driver is allowed to store the valid parts of the chain as
> device is not allowed to change them anymore.
>
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
>
> * Using descriptors in order
>
> Some devices can guarantee that descriptors are used in
> the order in which they were made available.
> This allows driver optimizations and can be negotiated through
> a feature bit.
>
> * Per ring flags
>
> It is useful to support features for some rings but not others.
> E.g. it's reasonable to use single buffers for RX rings but
> sg or indirect for TX rings of the network device.
> Generic configuration space will be extended so features can
> be negotiated per ring.
>
> * Selective use of descriptors
>
> As described above, descriptors with NEXT bit set are part of a
> scatter/gather chain and so do not have to cause device to write a used
> descriptor out.
>
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to
> signal to device that it does not have to write out the used descriptor
> as it is part of a batch of descriptors. Device has two options (similar
> to VRING_DESC_F_NEXT):
>
> Device can write out the same number of descriptors for the batch,
> setting VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
>
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of
> descriptors corresponding to the length of the batch before writing out
> the next used descriptor.
> After detecting a used descriptor driver must find out the length of the
> batch that it built in order to know where to look for the next
> device descriptor.
>
>
> TODO (blocker): skipping descriptors for selective and
> scatter/gather seem to be only requested with in-order right now. Let's
> require in-order for this skipping?  This will simplify the accounting
> by driver.
>
>
> * Interrupt/event suppression
>
> virtio 1.0 has two mechanisms for suppression but only
> one can be used at a time. we pack them together
> in a structure - one for interrupts, one for notifications:
>
> struct event {
>         __le16 idx;
>         __le16 flags;
> }
>
> Both fields would be optional, with a feature bit:
> VIRTIO_F_EVENT_IDX
> VIRTIO_F_EVENT_FLAGS
>
> Flags can be used like in virtio 1.0, by storing a special
> value there:
>
> #define VRING_F_EVENT_ENABLE  0x0
>
> #define VRING_F_EVENT_DISABLE 0x1
>
> Event index includes the index of the descriptor
> which should trigger the event, and the wrap counter
> in the high bit.
>
> In that case, interrupt triggers when descriptor is written at a given
> location in the ring (or skipped in case of NEXT/MORE).
>
> If both features are negotiated, a special flags value
> can be used to switch to event idx:
>
> #define VRING_F_EVENT_IDX     0x2
>
> * Available notification
>
> Driver currently writes out the queue number to device to
> kick off ring processing.
>
> As queue number is currently 16 bit, we can extend that
> to additionally include the offset within ring of the descriptor
> which triggered the kick event in bits 16 to 30,
> and the wrap counter in the high bit (31).
>
> Device is allowed to pre-fetch descriptors beyond the specified
> offset but is not required to do so.
>
>
>
> * TODO: interrupt coalescing
>
> Does it make sense just for networking or for all devices?
> If later should we make it a per ring or a global feature?
>
>
> * TODO: event index/flags in device memory?
>
> Should we move the event index/flags to device memory?
> Might be helpful for hardware configuration so they do not
> need to do DMA reads to check whether interrupt is needed.

Agreed. This also resembles physical devices more closely.

> OTOH maybe interrupt coalescing is sufficient for this.
>
>
> * TODO: Device specific descriptor flags
>
> We have a lot of unused space in the descriptor.  This can be put to
> good use by reserving some flag bits for device use.
> For example, network device can set a bit to request
> that header in the descriptor is suppressed
> (in case it's all 0s anyway). This reduces cache utilization.
>
> Note: this feature can be supported in virtio 1.0 as well,
> as we have unused bits in both descriptor and used ring there.
>
> * TODO: Descriptor length in device descriptors
>
> Some devices use identically-sized buffers in all descriptors.
> Ignoring length for driver descriptors there could be an option too.
>
> * TODO: Writing at an offset
>
> Some devices might want to write into some descriptors
> at an offset, the offset would be in reserved field in the descriptor,
> possibly a descriptor flag could indicate this:
>
> #define VRING_DESC_F_OFFSET 0x0020
>
> How exactly to use the offset would be device specific,
> for example it can be used to align beginning of packet
> in the 1st buffer for mergeable buffers to cache line boundary
> while also aligning rest of buffers.
>
> * TODO: Non power-of-2 ring sizes
>
> As the ring simply wraps around, there's no reason to
> require ring size to be power of two.
> It can be made a separate feature though.
>
>
> TODO: limits on buffer alignment/size
>
> Seems to be useful for RX for networking.
> Is there a need to negotiate above in a generic way
> or is this a networking specific optimization?
>
> TODO: expose wrap counters to device for debugging
>
> TODO: expose last avail/used offsets to device/driver for debugging
>
> TODO: ability to reset individual rings
>
> ---
>
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>       might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>       my hands. I'm in the process of updating ipr disclosures
>       in github instead.  Will make sure all is in place before
>       this proposal is put to vote. As usual this TC operates under the
>       Non-Assertion Mode of the OASIS IPR Policy, which protects
>       anyone implementing the virtio spec.
>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] Re: packed ring layout proposal v3
  2017-09-11  7:47 ` [virtio-dev] Re: packed ring layout proposal v3 Jason Wang
  2017-09-12 16:23   ` Willem de Bruijn
@ 2017-09-12 16:23   ` Willem de Bruijn
  1 sibling, 0 replies; 92+ messages in thread
From: Willem de Bruijn @ 2017-09-12 16:23 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtio-dev, virtualization, Michael S. Tsirkin

On Mon, Sep 11, 2017 at 3:47 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
>>
>> This is an update from v2 version.
>> Changes:
>> - update event suppression mechanism
>> - add wrap counter: DESC_WRAP flag in addition to
>>    DESC_DRIVER flag used for validity so device does not have to
>>    write out all used descriptors.
>
>
> Do we have benchmark result to show the advantage of DESC_DRIVER over e.g
> avail/used index?

The KVM forum presentation has some numbers.

I'm not sure that synthetic benchmarks will provide much value, as we
understand the trade-off quite well.

The benefit of this model is improved best case performance, by having
a single cacheline read instead of two for the indirect used/avail ring model.

The drawback is worse worst case, as scanning the ring of descriptors
introduces more cacheline misses than scanning the compressed
used/avail ring.

This model is easier to implement in hardware and the common case is
likely close to the best case, so I think it makes sense.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] Re: packed ring layout proposal v3
  2017-09-11  7:47 ` [virtio-dev] Re: packed ring layout proposal v3 Jason Wang
@ 2017-09-12 16:23   ` Willem de Bruijn
  2017-09-13  1:26     ` Jason Wang
  2017-09-13  1:26     ` Jason Wang
  2017-09-12 16:23   ` Willem de Bruijn
  1 sibling, 2 replies; 92+ messages in thread
From: Willem de Bruijn @ 2017-09-12 16:23 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, virtio-dev, virtualization

On Mon, Sep 11, 2017 at 3:47 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
>>
>> This is an update from v2 version.
>> Changes:
>> - update event suppression mechanism
>> - add wrap counter: DESC_WRAP flag in addition to
>>    DESC_DRIVER flag used for validity so device does not have to
>>    write out all used descriptors.
>
>
> Do we have benchmark result to show the advantage of DESC_DRIVER over e.g
> avail/used index?

The KVM forum presentation has some numbers.

I'm not sure that synthetic benchmarks will provide much value, as we
understand the trade-off quite well.

The benefit of this model is improved best case performance, by having
a single cacheline read instead of two for the indirect used/avail ring model.

The drawback is worse worst case, as scanning the ring of descriptors
introduces more cacheline misses than scanning the compressed
used/avail ring.

This model is easier to implement in hardware and the common case is
likely close to the best case, so I think it makes sense.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] Re: packed ring layout proposal v3
  2017-09-12 16:23   ` Willem de Bruijn
@ 2017-09-13  1:26     ` Jason Wang
  2017-09-13  1:26     ` Jason Wang
  1 sibling, 0 replies; 92+ messages in thread
From: Jason Wang @ 2017-09-13  1:26 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: virtio-dev, virtualization, Michael S. Tsirkin



On 2017年09月13日 00:23, Willem de Bruijn wrote:
> On Mon, Sep 11, 2017 at 3:47 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
>>> This is an update from v2 version.
>>> Changes:
>>> - update event suppression mechanism
>>> - add wrap counter: DESC_WRAP flag in addition to
>>>     DESC_DRIVER flag used for validity so device does not have to
>>>     write out all used descriptors.
>>
>> Do we have benchmark result to show the advantage of DESC_DRIVER over e.g
>> avail/used index?
> The KVM forum presentation has some numbers.

Yes. My question may be not accurate. I meant maybe we should benchmark 
packed ring layout without DESC_DRIVER but something like queue 
tail/head or producer/consumer (or whatever it called). Looks like most 
more nic does not use a flag inside descriptor to exam the descriptor 
ownership.

>
> I'm not sure that synthetic benchmarks will provide much value, as we
> understand the trade-off quite well.
>
> The benefit of this model is improved best case performance, by having
> a single cacheline read instead of two for the indirect used/avail ring model.
>
> The drawback is worse worst case, as scanning the ring of descriptors
> introduces more cacheline misses than scanning the compressed
> used/avail ring.

Like I've replied, looks like the scanning is not friendly to batching 
or prefetching and can cause extra overheads.

>
> This model is easier to implement in hardware and the common case is
> likely close to the best case, so I think it makes sense.

Maybe, but we probably need inputs from hardware vendor guys.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] Re: packed ring layout proposal v3
  2017-09-12 16:23   ` Willem de Bruijn
  2017-09-13  1:26     ` Jason Wang
@ 2017-09-13  1:26     ` Jason Wang
  1 sibling, 0 replies; 92+ messages in thread
From: Jason Wang @ 2017-09-13  1:26 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, virtio-dev, virtualization



On 2017年09月13日 00:23, Willem de Bruijn wrote:
> On Mon, Sep 11, 2017 at 3:47 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年09月10日 13:06, Michael S. Tsirkin wrote:
>>> This is an update from v2 version.
>>> Changes:
>>> - update event suppression mechanism
>>> - add wrap counter: DESC_WRAP flag in addition to
>>>     DESC_DRIVER flag used for validity so device does not have to
>>>     write out all used descriptors.
>>
>> Do we have benchmark result to show the advantage of DESC_DRIVER over e.g
>> avail/used index?
> The KVM forum presentation has some numbers.

Yes. My question may be not accurate. I meant maybe we should benchmark 
packed ring layout without DESC_DRIVER but something like queue 
tail/head or producer/consumer (or whatever it called). Looks like most 
more nic does not use a flag inside descriptor to exam the descriptor 
ownership.

>
> I'm not sure that synthetic benchmarks will provide much value, as we
> understand the trade-off quite well.
>
> The benefit of this model is improved best case performance, by having
> a single cacheline read instead of two for the indirect used/avail ring model.
>
> The drawback is worse worst case, as scanning the ring of descriptors
> introduces more cacheline misses than scanning the compressed
> used/avail ring.

Like I've replied, looks like the scanning is not friendly to batching 
or prefetching and can cause extra overheads.

>
> This model is easier to implement in hardware and the common case is
> likely close to the best case, so I think it makes sense.

Maybe, but we probably need inputs from hardware vendor guys.

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (14 preceding siblings ...)
  2017-09-12 16:20 ` Willem de Bruijn
@ 2017-09-14  8:23 ` Ilya Lesokhin
  2017-09-20  9:11   ` Liang, Cunming
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Ilya Lesokhin @ 2017-09-14  8:23 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> -----Original Message-----
> From: virtualization-bounces@lists.linux-foundation.org
> [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf Of
> Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 8:06 AM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: packed ring layout proposal v3
> 
> This is an update from v2 version.
...
> When driver descriptors are chained in this way, multiple descriptors are
> treated as a part of a single transaction containing an optional write buffer
> followed by an optional read buffer.
> All descriptors in the chain must have the same ID.
> 
...

I think you should consider removing  the "same ID" requirement.

Assuming out of order execution, how is the driver supposed to re-assign unique IDs to the previously
chained descriptor?
Do you expected driver to copy original IDs somewhere else before the chaining and then restore them after the chain is
executed?

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
@ 2017-09-20  9:11   ` Liang, Cunming
  2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-09-20  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

Hi Michael,

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Descriptor ring:
> 
> Driver writes descriptors with unique index values and DESC_DRIVER set in
> flags.
> Descriptors are written in a ring order: from start to end of ring, wrapping
> around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same order
> of driver descriptors, and not all descriptors have to be written out.
> 
> Driver and device are expected to maintain (internally) a wrap-around bit,
> starting at 0 and changing value each time they start writing out descriptors
> at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
> field.

One simple question there, trying to understand the usage of DESC_WRAP flag.

DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?

Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?

> 
> Flags are always set/cleared last.
> 
> Note that driver can write descriptors out in any order, but device will not
> execute descriptor X+1 until descriptor X has been read as valid.
> 
> Driver operation:
> 
[...]
> 
> DESC_WRAP - device uses this field to detect descriptor change by driver.

Device uses this field to detect change of wrap-around boundary by driver? 

[...]
> 
> Device operation (using descriptors):
> 
[...]
> 
> DESC_WRAP - driver uses this field to detect descriptor change by device.

Driver uses this field to detect change of wrap-around boundary by device?

By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.


Thanks,
Steve

> 
[...]
> 
> ---
> 
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>       might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>       my hands. I'm in the process of updating ipr disclosures
>       in github instead.  Will make sure all is in place before
>       this proposal is put to vote. As usual this TC operates under the
>       Non-Assertion Mode of the OASIS IPR Policy, which protects
>       anyone implementing the virtio spec.
> 
> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
@ 2017-09-20  9:11   ` Liang, Cunming
  0 siblings, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-09-20  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

Hi Michael,

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Descriptor ring:
> 
> Driver writes descriptors with unique index values and DESC_DRIVER set in
> flags.
> Descriptors are written in a ring order: from start to end of ring, wrapping
> around to the beginning.
> Device writes used descriptors with correct len, index, and DESC_HW clear.
> Again descriptors are written in ring order. This might not be the same order
> of driver descriptors, and not all descriptors have to be written out.
> 
> Driver and device are expected to maintain (internally) a wrap-around bit,
> starting at 0 and changing value each time they start writing out descriptors
> at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
> field.

One simple question there, trying to understand the usage of DESC_WRAP flag.

DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?

Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?

> 
> Flags are always set/cleared last.
> 
> Note that driver can write descriptors out in any order, but device will not
> execute descriptor X+1 until descriptor X has been read as valid.
> 
> Driver operation:
> 
[...]
> 
> DESC_WRAP - device uses this field to detect descriptor change by driver.

Device uses this field to detect change of wrap-around boundary by driver? 

[...]
> 
> Device operation (using descriptors):
> 
[...]
> 
> DESC_WRAP - driver uses this field to detect descriptor change by device.

Driver uses this field to detect change of wrap-around boundary by device?

By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.


Thanks,
Steve

> 
[...]
> 
> ---
> 
> Note: should this proposal be accepted and approved, one or more
>       claims disclosed to the TC admin and listed on the Virtio TC
>       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>       might become Essential Claims.
> Note: the page above is unfortunately out of date and out of
>       my hands. I'm in the process of updating ipr disclosures
>       in github instead.  Will make sure all is in place before
>       this proposal is put to vote. As usual this TC operates under the
>       Non-Assertion Mode of the OASIS IPR Policy, which protects
>       anyone implementing the virtio spec.
> 
> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
@ 2017-09-21 13:36   ` Liang, Cunming
  2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-09-21 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

Hi,

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing
> the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
> 
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of
> a batch of used descriptors which are all part of a single transaction.

It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).

- batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take  batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descriptor
  of chain to hint the expected position for device to write back.

- s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain. 

> 
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next
> descriptor not valid). In that case it must not use any parts of the chain - it will
> later be completed by device, but driver is allowed to store the valid parts of the
> chain as device is not allowed to change them anymore.
As each descriptor represent for a whole packet(otherwise it's s/g chain), wondering why it must not use any parts of the chain. 

> 
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
> 
[...]
> 
> * Selective use of descriptors
> 
> As described above, descriptors with NEXT bit set are part of a scatter/gather
> chain and so do not have to cause device to write a used descriptor out.
> 
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to
> device that it does not have to write out the used descriptor as it is part of a batch
> of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> 
> Device can write out the same number of descriptors for the batch, setting
> VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.

> 
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of descriptors
> corresponding to the length of the batch before writing out the next used
> descriptor.
> After detecting a used descriptor driver must find out the length of the batch that
> it built in order to know where to look for the next device descriptor.
It would be good to keep it simple on device side, and to have the driver control the expectation.

> 
> 
> TODO (blocker): skipping descriptors for selective and scatter/gather seem to be
> only requested with in-order right now. Let's require in-order for this skipping?
> This will simplify the accounting by driver.
> 
> 

Thanks,
Steve

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
@ 2017-09-21 13:36   ` Liang, Cunming
  0 siblings, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-09-21 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

Hi,

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> Behalf Of Michael S. Tsirkin
> Sent: Sunday, September 10, 2017 1:06 PM
> To: virtio-dev@lists.oasis-open.org
> Cc: virtualization@lists.linux-foundation.org
> Subject: [virtio-dev] packed ring layout proposal v3
> 
[...]
> * Batching descriptors:
> 
> virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing
> the used/avail index by values > 1.
> At the moment only batching of used descriptors is used.
> 
> We can support this by chaining a list of device descriptors through
> VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of
> a batch of used descriptors which are all part of a single transaction.

It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).

- batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take  batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descriptor of chain to hint the expected position for device to write back.

- s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain. 

> 
> Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next
> descriptor not valid). In that case it must not use any parts of the chain - it will
> later be completed by device, but driver is allowed to store the valid parts of the
> chain as device is not allowed to change them anymore.
As each descriptor represent for a whole packet(otherwise it's s/g chain), wondering why it must not use any parts of the chain. 

> 
> Descriptor should not have both VRING_DESC_F_MORE and
> VRING_DESC_F_NEXT set.
> 
[...]
> 
> * Selective use of descriptors
> 
> As described above, descriptors with NEXT bit set are part of a scatter/gather
> chain and so do not have to cause device to write a used descriptor out.
> 
> Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to
> device that it does not have to write out the used descriptor as it is part of a batch
> of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> 
> Device can write out the same number of descriptors for the batch, setting
> VRING_DESC_F_MORE for all but the last descriptor.
> Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.

> 
> Device only writes out a single descriptor for the whole batch.
> However, to keep device and driver in sync, it then skips a number of descriptors
> corresponding to the length of the batch before writing out the next used
> descriptor.
> After detecting a used descriptor driver must find out the length of the batch that
> it built in order to know where to look for the next device descriptor.
It would be good to keep it simple on device side, and to have the driver control the expectation.

> 
> 
> TODO (blocker): skipping descriptors for selective and scatter/gather seem to be
> only requested with in-order right now. Let's require in-order for this skipping?
> This will simplify the accounting by driver.
> 
> 

Thanks,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-20  9:11   ` Liang, Cunming
  (?)
@ 2017-09-25 22:24   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-25 22:24 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: virtio-dev, virtualization

On Wed, Sep 20, 2017 at 09:11:57AM +0000, Liang, Cunming wrote:
> Hi Michael,
> 
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Descriptor ring:
> > 
> > Driver writes descriptors with unique index values and DESC_DRIVER set in
> > flags.
> > Descriptors are written in a ring order: from start to end of ring, wrapping
> > around to the beginning.
> > Device writes used descriptors with correct len, index, and DESC_HW clear.
> > Again descriptors are written in ring order. This might not be the same order
> > of driver descriptors, and not all descriptors have to be written out.
> > 
> > Driver and device are expected to maintain (internally) a wrap-around bit,
> > starting at 0 and changing value each time they start writing out descriptors
> > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
> > field.
> 
> One simple question there, trying to understand the usage of DESC_WRAP flag.
> 
> DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?
> 
> Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
> DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?


No, this is necessary if not all descriptors are overwritten by device
after they are used.

Each time driver overwrites a descriptor, the value in DESC_WRAP changes
which makes it possible for device to detect that there's a new
descriptor.


> > 
> > Flags are always set/cleared last.
> > 
> > Note that driver can write descriptors out in any order, but device will not
> > execute descriptor X+1 until descriptor X has been read as valid.
> > 
> > Driver operation:
> > 
> [...]
> > 
> > DESC_WRAP - device uses this field to detect descriptor change by driver.
> 
> Device uses this field to detect change of wrap-around boundary by driver? 
> 
> [...]
> > 
> > Device operation (using descriptors):
> > 
> [...]
> > 
> > DESC_WRAP - driver uses this field to detect descriptor change by device.
> 
> Driver uses this field to detect change of wrap-around boundary by device?
>
> By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.
> 
> 
> Thanks,
> Steve

So v2 simply said descriptor has a single bit: driver writes 1 there,
device writes 0.

This requires device to overwrite each descriptor and people asked 
for a way to communicate where some descriptors are not overwritten.

This new bit helps device distinguish new and old descriptors written by driver.



> > 
> [...]
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >       claims disclosed to the TC admin and listed on the Virtio TC
> >       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >       might become Essential Claims.
> > Note: the page above is unfortunately out of date and out of
> >       my hands. I'm in the process of updating ipr disclosures
> >       in github instead.  Will make sure all is in place before
> >       this proposal is put to vote. As usual this TC operates under the
> >       Non-Assertion Mode of the OASIS IPR Policy, which protects
> >       anyone implementing the virtio spec.
> > 
> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-20  9:11   ` Liang, Cunming
  (?)
  (?)
@ 2017-09-25 22:24   ` Michael S. Tsirkin
  2017-09-26 23:38     ` Steven Luong (sluong)
  2017-09-26 23:38     ` Steven Luong (sluong)
  -1 siblings, 2 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-25 22:24 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: virtio-dev, virtualization

On Wed, Sep 20, 2017 at 09:11:57AM +0000, Liang, Cunming wrote:
> Hi Michael,
> 
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> > On Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Descriptor ring:
> > 
> > Driver writes descriptors with unique index values and DESC_DRIVER set in
> > flags.
> > Descriptors are written in a ring order: from start to end of ring, wrapping
> > around to the beginning.
> > Device writes used descriptors with correct len, index, and DESC_HW clear.
> > Again descriptors are written in ring order. This might not be the same order
> > of driver descriptors, and not all descriptors have to be written out.
> > 
> > Driver and device are expected to maintain (internally) a wrap-around bit,
> > starting at 0 and changing value each time they start writing out descriptors
> > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
> > field.
> 
> One simple question there, trying to understand the usage of DESC_WRAP flag.
> 
> DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?
> 
> Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
> DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?


No, this is necessary if not all descriptors are overwritten by device
after they are used.

Each time driver overwrites a descriptor, the value in DESC_WRAP changes
which makes it possible for device to detect that there's a new
descriptor.


> > 
> > Flags are always set/cleared last.
> > 
> > Note that driver can write descriptors out in any order, but device will not
> > execute descriptor X+1 until descriptor X has been read as valid.
> > 
> > Driver operation:
> > 
> [...]
> > 
> > DESC_WRAP - device uses this field to detect descriptor change by driver.
> 
> Device uses this field to detect change of wrap-around boundary by driver? 
> 
> [...]
> > 
> > Device operation (using descriptors):
> > 
> [...]
> > 
> > DESC_WRAP - driver uses this field to detect descriptor change by device.
> 
> Driver uses this field to detect change of wrap-around boundary by device?
>
> By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.
> 
> 
> Thanks,
> Steve

So v2 simply said descriptor has a single bit: driver writes 1 there,
device writes 0.

This requires device to overwrite each descriptor and people asked 
for a way to communicate where some descriptors are not overwritten.

This new bit helps device distinguish new and old descriptors written by driver.



> > 
> [...]
> > 
> > ---
> > 
> > Note: should this proposal be accepted and approved, one or more
> >       claims disclosed to the TC admin and listed on the Virtio TC
> >       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >       might become Essential Claims.
> > Note: the page above is unfortunately out of date and out of
> >       my hands. I'm in the process of updating ipr disclosures
> >       in github instead.  Will make sure all is in place before
> >       this proposal is put to vote. As usual this TC operates under the
> >       Non-Assertion Mode of the OASIS IPR Policy, which protects
> >       anyone implementing the virtio spec.
> > 
> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-25 22:24   ` Michael S. Tsirkin
  2017-09-26 23:38     ` Steven Luong (sluong)
@ 2017-09-26 23:38     ` Steven Luong (sluong)
  1 sibling, 0 replies; 92+ messages in thread
From: Steven Luong (sluong) @ 2017-09-26 23:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, Liang, Cunming; +Cc: virtio-dev, virtualization

Michael,

Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.

Steven

On 9/25/17, 3:24 PM, "virtio-dev@lists.oasis-open.org on behalf of Michael S. Tsirkin" <virtio-dev@lists.oasis-open.org on behalf of mst@redhat.com> wrote:

    On Wed, Sep 20, 2017 at 09:11:57AM +0000, Liang, Cunming wrote:
    > Hi Michael,
    > 
    > > -----Original Message-----
    > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
    > > On Behalf Of Michael S. Tsirkin
    > > Sent: Sunday, September 10, 2017 1:06 PM
    > > To: virtio-dev@lists.oasis-open.org
    > > Cc: virtualization@lists.linux-foundation.org
    > > Subject: [virtio-dev] packed ring layout proposal v3
    > > 
    > [...]
    > > * Descriptor ring:
    > > 
    > > Driver writes descriptors with unique index values and DESC_DRIVER set in
    > > flags.
    > > Descriptors are written in a ring order: from start to end of ring, wrapping
    > > around to the beginning.
    > > Device writes used descriptors with correct len, index, and DESC_HW clear.
    > > Again descriptors are written in ring order. This might not be the same order
    > > of driver descriptors, and not all descriptors have to be written out.
    > > 
    > > Driver and device are expected to maintain (internally) a wrap-around bit,
    > > starting at 0 and changing value each time they start writing out descriptors
    > > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
    > > field.
    > 
    > One simple question there, trying to understand the usage of DESC_WRAP flag.
    > 
    > DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?
    > 
    > Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
    > DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?
    
    
    No, this is necessary if not all descriptors are overwritten by device
    after they are used.
    
    Each time driver overwrites a descriptor, the value in DESC_WRAP changes
    which makes it possible for device to detect that there's a new
    descriptor.
    
    
    > > 
    > > Flags are always set/cleared last.
    > > 
    > > Note that driver can write descriptors out in any order, but device will not
    > > execute descriptor X+1 until descriptor X has been read as valid.
    > > 
    > > Driver operation:
    > > 
    > [...]
    > > 
    > > DESC_WRAP - device uses this field to detect descriptor change by driver.
    > 
    > Device uses this field to detect change of wrap-around boundary by driver? 
    > 
    > [...]
    > > 
    > > Device operation (using descriptors):
    > > 
    > [...]
    > > 
    > > DESC_WRAP - driver uses this field to detect descriptor change by device.
    > 
    > Driver uses this field to detect change of wrap-around boundary by device?
    >
    > By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.
    > 
    > 
    > Thanks,
    > Steve
    
    So v2 simply said descriptor has a single bit: driver writes 1 there,
    device writes 0.
    
    This requires device to overwrite each descriptor and people asked 
    for a way to communicate where some descriptors are not overwritten.
    
    This new bit helps device distinguish new and old descriptors written by driver.
    
    
    
    > > 
    > [...]
    > > 
    > > ---
    > > 
    > > Note: should this proposal be accepted and approved, one or more
    > >       claims disclosed to the TC admin and listed on the Virtio TC
    > >       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
    > >       might become Essential Claims.
    > > Note: the page above is unfortunately out of date and out of
    > >       my hands. I'm in the process of updating ipr disclosures
    > >       in github instead.  Will make sure all is in place before
    > >       this proposal is put to vote. As usual this TC operates under the
    > >       Non-Assertion Mode of the OASIS IPR Policy, which protects
    > >       anyone implementing the virtio spec.
    > > 
    > > --
    > > MST
    > > 
    > > ---------------------------------------------------------------------
    > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
    > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
    For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
    
    

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-25 22:24   ` Michael S. Tsirkin
@ 2017-09-26 23:38     ` Steven Luong (sluong)
  2017-09-27 23:49         ` Michael S. Tsirkin
  2017-09-26 23:38     ` Steven Luong (sluong)
  1 sibling, 1 reply; 92+ messages in thread
From: Steven Luong (sluong) @ 2017-09-26 23:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, Liang, Cunming; +Cc: virtio-dev, virtualization

Michael,

Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.

Steven

On 9/25/17, 3:24 PM, "virtio-dev@lists.oasis-open.org on behalf of Michael S. Tsirkin" <virtio-dev@lists.oasis-open.org on behalf of mst@redhat.com> wrote:

    On Wed, Sep 20, 2017 at 09:11:57AM +0000, Liang, Cunming wrote:
    > Hi Michael,
    > 
    > > -----Original Message-----
    > > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
    > > On Behalf Of Michael S. Tsirkin
    > > Sent: Sunday, September 10, 2017 1:06 PM
    > > To: virtio-dev@lists.oasis-open.org
    > > Cc: virtualization@lists.linux-foundation.org
    > > Subject: [virtio-dev] packed ring layout proposal v3
    > > 
    > [...]
    > > * Descriptor ring:
    > > 
    > > Driver writes descriptors with unique index values and DESC_DRIVER set in
    > > flags.
    > > Descriptors are written in a ring order: from start to end of ring, wrapping
    > > around to the beginning.
    > > Device writes used descriptors with correct len, index, and DESC_HW clear.
    > > Again descriptors are written in ring order. This might not be the same order
    > > of driver descriptors, and not all descriptors have to be written out.
    > > 
    > > Driver and device are expected to maintain (internally) a wrap-around bit,
    > > starting at 0 and changing value each time they start writing out descriptors
    > > at the beginning of the ring. This bit is passed as DESC_WRAP bit in the flags
    > > field.
    > 
    > One simple question there, trying to understand the usage of DESC_WRAP flag.
    > 
    > DESC_WRAP bit is a new flag since v2. It's used to address 'non power-of-2 ring sizes' mentioned in v2?
    > 
    > Being confused by the statement of wrap-around bit here, it's an internal wrap-around counter represented by single bit (0/1)?
    > DESC_WRAP can appear on any descriptor entry in the ring, why it highlights changing value at the beginning of the ring?
    
    
    No, this is necessary if not all descriptors are overwritten by device
    after they are used.
    
    Each time driver overwrites a descriptor, the value in DESC_WRAP changes
    which makes it possible for device to detect that there's a new
    descriptor.
    
    
    > > 
    > > Flags are always set/cleared last.
    > > 
    > > Note that driver can write descriptors out in any order, but device will not
    > > execute descriptor X+1 until descriptor X has been read as valid.
    > > 
    > > Driver operation:
    > > 
    > [...]
    > > 
    > > DESC_WRAP - device uses this field to detect descriptor change by driver.
    > 
    > Device uses this field to detect change of wrap-around boundary by driver? 
    > 
    > [...]
    > > 
    > > Device operation (using descriptors):
    > > 
    > [...]
    > > 
    > > DESC_WRAP - driver uses this field to detect descriptor change by device.
    > 
    > Driver uses this field to detect change of wrap-around boundary by device?
    >
    > By using this, driver doesn't need to maintain any internal wrap-around count, but being aware of wrap-around by DESC_WRAP flag.
    > 
    > 
    > Thanks,
    > Steve
    
    So v2 simply said descriptor has a single bit: driver writes 1 there,
    device writes 0.
    
    This requires device to overwrite each descriptor and people asked 
    for a way to communicate where some descriptors are not overwritten.
    
    This new bit helps device distinguish new and old descriptors written by driver.
    
    
    
    > > 
    > [...]
    > > 
    > > ---
    > > 
    > > Note: should this proposal be accepted and approved, one or more
    > >       claims disclosed to the TC admin and listed on the Virtio TC
    > >       IPR page https://www.oasis-open.org/committees/virtio/ipr.php
    > >       might become Essential Claims.
    > > Note: the page above is unfortunately out of date and out of
    > >       my hands. I'm in the process of updating ipr disclosures
    > >       in github instead.  Will make sure all is in place before
    > >       this proposal is put to vote. As usual this TC operates under the
    > >       Non-Assertion Mode of the OASIS IPR Policy, which protects
    > >       anyone implementing the virtio spec.
    > > 
    > > --
    > > MST
    > > 
    > > ---------------------------------------------------------------------
    > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
    > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
    For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
    
    


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-26 23:38     ` Steven Luong (sluong)
@ 2017-09-27 23:49         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-27 23:49 UTC (permalink / raw)
  To: Steven Luong (sluong); +Cc: virtio-dev, virtualization

On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> Michael,
> 
> Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.
> 
> Steven

My bad, I will need to work on it. Here is an example:

Let's assume device promised to consume packets in order

ring size = 2

Ring is 0 initialized.

Device initially polls DESC[0].flags for WRAP bit to change.

driver adds:

DESC[0].addr = 1234
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP

and

DESC[0].addr = 5678
DESC[1].id = 1
DESC[1].flags = DESC_DRIVER | DESC_WRAP


it now starts polling DESC[0] flags.


Device reads 1234, executes it, does not use it.

Device reads 5678, executes it, and uses it:

DESC[0].id = 1
DESC[0].flags = 0

Device now polls DESC[0].flags for WRAP bit to change.

Now driver sees that DRIVER bit has been cleared, so it nows that id is
valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
overwrite.

So it writes it out. It wrapped around to beginning of ring,
so it flips the WRAP bit to 0 on all descriptors now:

DESC[0].addr = 9ABC
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT


DESC[0].addr = DEF0
DESC[0].id = 1
DESC[0].flags = DESC_DRIVER


Next round wrap will be 1 again.


To summarise:

DRIVER bit is used by driver to detect device has used one or more
descriptors.  WRAP is is used by device to detect driver has made a
new descriptor available.


-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
@ 2017-09-27 23:49         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-27 23:49 UTC (permalink / raw)
  To: Steven Luong (sluong); +Cc: Liang, Cunming, virtio-dev, virtualization

On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> Michael,
> 
> Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.
> 
> Steven

My bad, I will need to work on it. Here is an example:

Let's assume device promised to consume packets in order

ring size = 2

Ring is 0 initialized.

Device initially polls DESC[0].flags for WRAP bit to change.

driver adds:

DESC[0].addr = 1234
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP

and

DESC[0].addr = 5678
DESC[1].id = 1
DESC[1].flags = DESC_DRIVER | DESC_WRAP


it now starts polling DESC[0] flags.


Device reads 1234, executes it, does not use it.

Device reads 5678, executes it, and uses it:

DESC[0].id = 1
DESC[0].flags = 0

Device now polls DESC[0].flags for WRAP bit to change.

Now driver sees that DRIVER bit has been cleared, so it nows that id is
valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
overwrite.

So it writes it out. It wrapped around to beginning of ring,
so it flips the WRAP bit to 0 on all descriptors now:

DESC[0].addr = 9ABC
DESC[0].id = 0
DESC[0].flags = DESC_DRIVER | DESC_NEXT


DESC[0].addr = DEF0
DESC[0].id = 1
DESC[0].flags = DESC_DRIVER


Next round wrap will be 1 again.


To summarise:

DRIVER bit is used by driver to detect device has used one or more
descriptors.  WRAP is is used by device to detect driver has made a
new descriptor available.


-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-09-27 23:49         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-09-28  9:44         ` Liang, Cunming
  -1 siblings, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-09-28  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Steven Luong (sluong); +Cc: virtio-dev, virtualization


Get it now. Please correct me if I missing something.


Flags status hints,

- DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use

- DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet

- None: device has used the descriptor, and write descriptor out

- DESC_WRAP only: shall not happen, device make sure to clear it


Polling behavior is,

- Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)

- Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go


--
Steve

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Thursday, September 28, 2017 7:49 AM
> To: Steven Luong (sluong)
> Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > Michael,
> >
> > Would you please give an example or two how these two flags
> DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> confused by the description and still don’t quite grok it.
> >
> > Steven
> 
> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is valid. I
> sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring, so it flips the
> WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a new
> descriptor available.
> 
> 
> --
> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-09-27 23:49         ` Michael S. Tsirkin
  (?)
@ 2017-09-28  9:44         ` Liang, Cunming
  2017-10-01  4:08             ` Michael S. Tsirkin
  -1 siblings, 1 reply; 92+ messages in thread
From: Liang, Cunming @ 2017-09-28  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Steven Luong (sluong); +Cc: virtio-dev, virtualization


Get it now. Please correct me if I missing something.


Flags status hints,

- DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use

- DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet

- None: device has used the descriptor, and write descriptor out

- DESC_WRAP only: shall not happen, device make sure to clear it


Polling behavior is,

- Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)

- Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go


--
Steve

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Thursday, September 28, 2017 7:49 AM
> To: Steven Luong (sluong)
> Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > Michael,
> >
> > Would you please give an example or two how these two flags
> DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> confused by the description and still don’t quite grok it.
> >
> > Steven
> 
> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is valid. I
> sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring, so it flips the
> WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_NEXT
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a new
> descriptor available.
> 
> 
> --
> MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-27 23:49         ` Michael S. Tsirkin
                           ` (2 preceding siblings ...)
  (?)
@ 2017-09-28 21:13         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-28 21:13 UTC (permalink / raw)
  To: Steven Luong (sluong); +Cc: virtio-dev, virtualization

On Thu, Sep 28, 2017 at 02:49:15AM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > Michael,
> > 
> > Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.
> > 
> > Steven

Note: I made a mistake in the email. Instead of DESC_NEXT it should read
DESC_MORE everywhere. I corrected the quoted text below for simplicity.



> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is
> valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
> overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring,
> so it flips the WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a
> new descriptor available.
> 
> 
> -- 
> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-27 23:49         ` Michael S. Tsirkin
                           ` (3 preceding siblings ...)
  (?)
@ 2017-09-28 21:13         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-28 21:13 UTC (permalink / raw)
  To: Steven Luong (sluong); +Cc: Liang, Cunming, virtio-dev, virtualization

On Thu, Sep 28, 2017 at 02:49:15AM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > Michael,
> > 
> > Would you please give an example or two how these two flags DESC_DRIVER and DESC_WRAP are used together? Like others, I am confused by the description and still don’t quite grok it.
> > 
> > Steven

Note: I made a mistake in the email. Instead of DESC_NEXT it should read
DESC_MORE everywhere. I corrected the quoted text below for simplicity.



> My bad, I will need to work on it. Here is an example:
> 
> Let's assume device promised to consume packets in order
> 
> ring size = 2
> 
> Ring is 0 initialized.
> 
> Device initially polls DESC[0].flags for WRAP bit to change.
> 
> driver adds:
> 
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE | DESC_WRAP
> 
> and
> 
> DESC[0].addr = 5678
> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
> 
> 
> it now starts polling DESC[0] flags.
> 
> 
> Device reads 1234, executes it, does not use it.
> 
> Device reads 5678, executes it, and uses it:
> 
> DESC[0].id = 1
> DESC[0].flags = 0
> 
> Device now polls DESC[0].flags for WRAP bit to change.
> 
> Now driver sees that DRIVER bit has been cleared, so it nows that id is
> valid. I sees id 1, therefore id 0 and 1 has been read and are safe to
> overwrite.
> 
> So it writes it out. It wrapped around to beginning of ring,
> so it flips the WRAP bit to 0 on all descriptors now:
> 
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE
> 
> 
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER
> 
> 
> Next round wrap will be 1 again.
> 
> 
> To summarise:
> 
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a
> new descriptor available.
> 
> 
> -- 
> MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-21 13:36   ` Liang, Cunming
  (?)
@ 2017-09-28 21:27   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-28 21:27 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: virtio-dev, virtualization

On Thu, Sep 21, 2017 at 01:36:37PM +0000, Liang, Cunming wrote:
> Hi,
> 
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> > Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing
> > the used/avail index by values > 1.
> > At the moment only batching of used descriptors is used.
> > 
> > We can support this by chaining a list of device descriptors through
> > VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of
> > a batch of used descriptors which are all part of a single transaction.
> 
> It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).
> 
> - batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take  batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descript
 or of chain to hint the expected position for device to write back.

But driver does not really benefit from batching and does not know how
many to batch, this depends on device. E.g. a software device does not
need batching at all, a pci express device would want batches to be
multiples of what fits in a pci express transaction, etc.  We would have
to provide that info from device to driver.

> - s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain. 

EOP would be the reverse of NEXT then? I think it does not matter much,
but NEXT matches what is there in virtio 1.0 right now. It also means that
simple implementations with short buffers can have flags set to 0 which
seems cleaner.


> > 
> > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next
> > descriptor not valid). In that case it must not use any parts of the chain - it will
> > later be completed by device, but driver is allowed to store the valid parts of the
> > chain as device is not allowed to change them anymore.
> As each descriptor represent for a whole packet(otherwise it's s/g chain),

For RX mergeable buffers, a packet is composed of multiple s/g chains.

> wondering why it must not use any parts of the chain. 

This is to match what is there in virtio 1.0 right now: driver
does not touch any used descriptors until the used index is updated.




> > 
> > Descriptor should not have both VRING_DESC_F_MORE and
> > VRING_DESC_F_NEXT set.
> > 
> [...]
> > 
> > * Selective use of descriptors
> > 
> > As described above, descriptors with NEXT bit set are part of a scatter/gather
> > chain and so do not have to cause device to write a used descriptor out.
> > 
> > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to
> > device that it does not have to write out the used descriptor as it is part of a batch
> > of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> > 
> > Device can write out the same number of descriptors for the batch, setting
> > VRING_DESC_F_MORE for all but the last descriptor.
> > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
> It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.

I don't understand this statement. All I said is that it's up to device
whether to write out the descriptors with VRING_DESC_F_MORE, or to skip
the write out.

> > 
> > Device only writes out a single descriptor for the whole batch.
> > However, to keep device and driver in sync, it then skips a number of descriptors
> > corresponding to the length of the batch before writing out the next used
> > descriptor.
> > After detecting a used descriptor driver must find out the length of the batch that
> > it built in order to know where to look for the next device descriptor.
> It would be good to keep it simple on device side, and to have the driver control the expectation.

I'm not sure what above means either.
That is exactly what above proposal says: device simply writes out a single
descriptor. Driver has to keep track and know where the next one will be
written.

Example

Driver writes two pairs chained with MORE: 0 + 1, 2 + 3
Device writes: 0 and 3






> > 
> > 
> > TODO (blocker): skipping descriptors for selective and scatter/gather seem to be
> > only requested with in-order right now. Let's require in-order for this skipping?
> > This will simplify the accounting by driver.
> > 
> > 
> 
> Thanks,
> Steve

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-21 13:36   ` Liang, Cunming
  (?)
  (?)
@ 2017-09-28 21:27   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-09-28 21:27 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: virtio-dev, virtualization

On Thu, Sep 21, 2017 at 01:36:37PM +0000, Liang, Cunming wrote:
> Hi,
> 
> > -----Original Message-----
> > From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> > Behalf Of Michael S. Tsirkin
> > Sent: Sunday, September 10, 2017 1:06 PM
> > To: virtio-dev@lists.oasis-open.org
> > Cc: virtualization@lists.linux-foundation.org
> > Subject: [virtio-dev] packed ring layout proposal v3
> > 
> [...]
> > * Batching descriptors:
> > 
> > virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing
> > the used/avail index by values > 1.
> > At the moment only batching of used descriptors is used.
> > 
> > We can support this by chaining a list of device descriptors through
> > VRING_DESC_F_MORE flag. Device sets this bit to signal driver that this is part of
> > a batch of used descriptors which are all part of a single transaction.
> 
> It supposes each s/g chain represents for a packet, while each descriptor among batching chain represents for a packet. There're a few thoughts of batching chain(by VRING_DESC_F_MORE) and s/g chain(by VRING_DESC_F_NEXT).
> 
> - batching chain: It's up to device to coalesce the write-out of a batched used descriptors. As the batching size is variable, driver has to detect validity of each descriptor unless the number of incoming valid descriptor is predictable, being curious on the benefits of driver from VRING_DESC_F_MORE to take  batching descriptors as a single transaction. On device perspective, it's great to write out one descriptor for the whole chain. However, it assumes no other useful fields in each descriptor of chain needs to write. TX buffer reclaiming can be the candidate, while RX side has to update 'len' at least. Even for this purpose, instead of writing out VRING_DESC_F_MORE on a few descriptors to suppress device writing back, it's cheaper to set flag (e.g. VRING_DESC_F_WB) on single descriptor of chain to hint the expected position for device to write back.

But driver does not really benefit from batching and does not know how
many to batch, this depends on device. E.g. a software device does not
need batching at all, a pci express device would want batches to be
multiples of what fits in a pci express transaction, etc.  We would have
to provide that info from device to driver.

> - s/g chain: It makes sense to indicate the packet boundary. Considering in-order descriptor ring without VRING_DESC_F_INDIRECT, the next descriptor always belongs to the same s/g chain until end of packet indicators occur. So one alternative approach is only to set a flag (e.g. VRING_DESC_F_EOP) on the last descriptor of the chain. 

EOP would be the reverse of NEXT then? I think it does not matter much,
but NEXT matches what is there in virtio 1.0 right now. It also means that
simple implementations with short buffers can have flags set to 0 which
seems cleaner.


> > 
> > Driver might detect a partial descriptor chain (VRING_DESC_F_MORE set but next
> > descriptor not valid). In that case it must not use any parts of the chain - it will
> > later be completed by device, but driver is allowed to store the valid parts of the
> > chain as device is not allowed to change them anymore.
> As each descriptor represent for a whole packet(otherwise it's s/g chain),

For RX mergeable buffers, a packet is composed of multiple s/g chains.

> wondering why it must not use any parts of the chain. 

This is to match what is there in virtio 1.0 right now: driver
does not touch any used descriptors until the used index is updated.




> > 
> > Descriptor should not have both VRING_DESC_F_MORE and
> > VRING_DESC_F_NEXT set.
> > 
> [...]
> > 
> > * Selective use of descriptors
> > 
> > As described above, descriptors with NEXT bit set are part of a scatter/gather
> > chain and so do not have to cause device to write a used descriptor out.
> > 
> > Similarly, driver can set a flag VRING_DESC_F_MORE in the descriptor to signal to
> > device that it does not have to write out the used descriptor as it is part of a batch
> > of descriptors. Device has two options (similar to VRING_DESC_F_NEXT):
> > 
> > Device can write out the same number of descriptors for the batch, setting
> > VRING_DESC_F_MORE for all but the last descriptor.
> > Driver will ignore all used descriptors with VRING_DESC_F_MORE bit set.
> It will write out last descriptor without VRING_DESC_F_MORE anyway, so the following statement seems not like another option.

I don't understand this statement. All I said is that it's up to device
whether to write out the descriptors with VRING_DESC_F_MORE, or to skip
the write out.

> > 
> > Device only writes out a single descriptor for the whole batch.
> > However, to keep device and driver in sync, it then skips a number of descriptors
> > corresponding to the length of the batch before writing out the next used
> > descriptor.
> > After detecting a used descriptor driver must find out the length of the batch that
> > it built in order to know where to look for the next device descriptor.
> It would be good to keep it simple on device side, and to have the driver control the expectation.

I'm not sure what above means either.
That is exactly what above proposal says: device simply writes out a single
descriptor. Driver has to keep track and know where the next one will be
written.

Example

Driver writes two pairs chained with MORE: 0 + 1, 2 + 3
Device writes: 0 and 3






> > 
> > 
> > TODO (blocker): skipping descriptors for selective and scatter/gather seem to be
> > only requested with in-order right now. Let's require in-order for this skipping?
> > This will simplify the accounting by driver.
> > 
> > 
> 
> Thanks,
> Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-09-28  9:44         ` Liang, Cunming
@ 2017-10-01  4:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  4:08 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: virtio-dev, Steven Luong (sluong), virtualization

On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> 
> Get it now. Please correct me if I missing something.
> 
> 
> Flags status hints,
> 
> - DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use
> 
> - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet
> 
> - None: device has used the descriptor, and write descriptor out
> 
> - DESC_WRAP only: shall not happen, device make sure to clear it
> 
> 
> Polling behavior is,
> 
> - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
> 
> - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go
> 
> 
> --
> Steve


Hmm no, not what I had in mind.

DESC_DRIVER: used by driver to poll. Driver sets it when writing a
descriptor.  Device clears it when overwriting a descriptor.
Thus driver uses DESC_DRIVER to detect that device data in
descriptor is valid.



DESC_WRAP: used by device to poll. Driver sets it to a *different*
value every time it overwrites a descriptor. How to achieve it?
since descriptors are written out in ring order,
simply maintain the current value internally (start value 1) and flip it
every time you overwrite the first descriptor.
Device leaves it intact when overwriting a descriptor.


After writing down this explanation, I think the names aren't
great.



Let me try an alternative explanation.

---------------
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x11 are written by driver
values 0x0, 0x10 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes.

each time device writes out a descriptor, it must make sure
that the high bit in OWNER does not change.



this is exactly the same functionally, DRIVER is high bit and
WRAP is the low bit.  Does this make things clearer?
---------------



Maybe the difference between device and driver
is confusing. We can fix that by changing the values.
Here is an alternative. Let me know if you like it better -
I need to think a bit more to make sure it works,
but to give you an idea:


---------------
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x10 are written by driver
values 0x0, 0x11 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes. Thus first time
it writes 0x10, next time 0x1, then 0x10 again.

each time device writes out a descriptor, it must make sure
that the low bit in OWNER changes.  Thus first time
it writes 0x11, next time 0x0, then 0x11 again.

---------------



















> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, September 28, 2017 7:49 AM
> > To: Steven Luong (sluong)
> > Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> > virtualization@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v3
> > 
> > On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > > Michael,
> > >
> > > Would you please give an example or two how these two flags
> > DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> > confused by the description and still don’t quite grok it.
> > >
> > > Steven
> > 
> > My bad, I will need to work on it. Here is an example:
> > 
> > Let's assume device promised to consume packets in order
> > 
> > ring size = 2
> > 
> > Ring is 0 initialized.
> > 
> > Device initially polls DESC[0].flags for WRAP bit to change.
> > 
> > driver adds:
> > 
> > DESC[0].addr = 1234
> > DESC[0].id = 0
> > DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> > 
> > and
> > 
> > DESC[0].addr = 5678
> > DESC[1].id = 1
> > DESC[1].flags = DESC_DRIVER | DESC_WRAP
> > 
> > 
> > it now starts polling DESC[0] flags.
> > 
> > 
> > Device reads 1234, executes it, does not use it.
> > 
> > Device reads 5678, executes it, and uses it:
> > 
> > DESC[0].id = 1
> > DESC[0].flags = 0
> > 
> > Device now polls DESC[0].flags for WRAP bit to change.
> > 
> > Now driver sees that DRIVER bit has been cleared, so it nows that id is valid. I
> > sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> > 
> > So it writes it out. It wrapped around to beginning of ring, so it flips the
> > WRAP bit to 0 on all descriptors now:
> > 
> > DESC[0].addr = 9ABC
> > DESC[0].id = 0
> > DESC[0].flags = DESC_DRIVER | DESC_NEXT
> > 
> > 
> > DESC[0].addr = DEF0
> > DESC[0].id = 1
> > DESC[0].flags = DESC_DRIVER
> > 
> > 
> > Next round wrap will be 1 again.
> > 
> > 
> > To summarise:
> > 
> > DRIVER bit is used by driver to detect device has used one or more
> > descriptors.  WRAP is is used by device to detect driver has made a new
> > descriptor available.
> > 
> > 
> > --
> > MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
@ 2017-10-01  4:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  4:08 UTC (permalink / raw)
  To: Liang, Cunming; +Cc: Steven Luong (sluong), virtio-dev, virtualization

On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> 
> Get it now. Please correct me if I missing something.
> 
> 
> Flags status hints,
> 
> - DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use
> 
> - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet
> 
> - None: device has used the descriptor, and write descriptor out
> 
> - DESC_WRAP only: shall not happen, device make sure to clear it
> 
> 
> Polling behavior is,
> 
> - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
> 
> - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go
> 
> 
> --
> Steve


Hmm no, not what I had in mind.

DESC_DRIVER: used by driver to poll. Driver sets it when writing a
descriptor.  Device clears it when overwriting a descriptor.
Thus driver uses DESC_DRIVER to detect that device data in
descriptor is valid.



DESC_WRAP: used by device to poll. Driver sets it to a *different*
value every time it overwrites a descriptor. How to achieve it?
since descriptors are written out in ring order,
simply maintain the current value internally (start value 1) and flip it
every time you overwrite the first descriptor.
Device leaves it intact when overwriting a descriptor.


After writing down this explanation, I think the names aren't
great.



Let me try an alternative explanation.

---------------
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x11 are written by driver
values 0x0, 0x10 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes.

each time device writes out a descriptor, it must make sure
that the high bit in OWNER does not change.



this is exactly the same functionally, DRIVER is high bit and
WRAP is the low bit.  Does this make things clearer?
---------------



Maybe the difference between device and driver
is confusing. We can fix that by changing the values.
Here is an alternative. Let me know if you like it better -
I need to think a bit more to make sure it works,
but to give you an idea:


---------------
A two-bit field, DRIVER_OWNER, signals the buffer ownership.
It has 4 possible values:
values 0x1, 0x10 are written by driver
values 0x0, 0x11 are written by device

each time driver writes out a descriptor, it must make sure
that the high bit in OWNER changes. Thus first time
it writes 0x10, next time 0x1, then 0x10 again.

each time device writes out a descriptor, it must make sure
that the low bit in OWNER changes.  Thus first time
it writes 0x11, next time 0x0, then 0x11 again.

---------------



















> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Thursday, September 28, 2017 7:49 AM
> > To: Steven Luong (sluong)
> > Cc: Liang, Cunming; virtio-dev@lists.oasis-open.org;
> > virtualization@lists.linux-foundation.org
> > Subject: Re: [virtio-dev] packed ring layout proposal v3
> > 
> > On Tue, Sep 26, 2017 at 11:38:18PM +0000, Steven Luong (sluong) wrote:
> > > Michael,
> > >
> > > Would you please give an example or two how these two flags
> > DESC_DRIVER and DESC_WRAP are used together? Like others, I am
> > confused by the description and still don’t quite grok it.
> > >
> > > Steven
> > 
> > My bad, I will need to work on it. Here is an example:
> > 
> > Let's assume device promised to consume packets in order
> > 
> > ring size = 2
> > 
> > Ring is 0 initialized.
> > 
> > Device initially polls DESC[0].flags for WRAP bit to change.
> > 
> > driver adds:
> > 
> > DESC[0].addr = 1234
> > DESC[0].id = 0
> > DESC[0].flags = DESC_DRIVER | DESC_NEXT | DESC_WRAP
> > 
> > and
> > 
> > DESC[0].addr = 5678
> > DESC[1].id = 1
> > DESC[1].flags = DESC_DRIVER | DESC_WRAP
> > 
> > 
> > it now starts polling DESC[0] flags.
> > 
> > 
> > Device reads 1234, executes it, does not use it.
> > 
> > Device reads 5678, executes it, and uses it:
> > 
> > DESC[0].id = 1
> > DESC[0].flags = 0
> > 
> > Device now polls DESC[0].flags for WRAP bit to change.
> > 
> > Now driver sees that DRIVER bit has been cleared, so it nows that id is valid. I
> > sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.
> > 
> > So it writes it out. It wrapped around to beginning of ring, so it flips the
> > WRAP bit to 0 on all descriptors now:
> > 
> > DESC[0].addr = 9ABC
> > DESC[0].id = 0
> > DESC[0].flags = DESC_DRIVER | DESC_NEXT
> > 
> > 
> > DESC[0].addr = DEF0
> > DESC[0].id = 1
> > DESC[0].flags = DESC_DRIVER
> > 
> > 
> > Next round wrap will be 1 again.
> > 
> > 
> > To summarise:
> > 
> > DRIVER bit is used by driver to detect device has used one or more
> > descriptors.  WRAP is is used by device to detect driver has made a new
> > descriptor available.
> > 
> > 
> > --
> > MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-10-01  4:08             ` Michael S. Tsirkin
  (?)
@ 2017-10-04 12:39             ` Jens Freimann
  2017-10-04 12:58               ` Michael S. Tsirkin
  2017-10-04 12:58               ` Michael S. Tsirkin
  -1 siblings, 2 replies; 92+ messages in thread
From: Jens Freimann @ 2017-10-04 12:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, Steven Luong (sluong), virtualization

On Sun, Oct 01, 2017 at 04:08:29AM +0000, Michael S. Tsirkin wrote:
>On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
>>
>> Get it now. Please correct me if I missing something.
>>
>>
>> Flags status hints,
>>
>> - DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use
>>
>> - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet
>>
>> - None: device has used the descriptor, and write descriptor out
>>
>> - DESC_WRAP only: shall not happen, device make sure to clear it
>>
>>
>> Polling behavior is,
>>
>> - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
>>
>> - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go
>>
>>
>> --
>> Steve
>
>
>Hmm no, not what I had in mind.
>
>DESC_DRIVER: used by driver to poll. Driver sets it when writing a
>descriptor.  Device clears it when overwriting a descriptor.
>Thus driver uses DESC_DRIVER to detect that device data in
>descriptor is valid.

Basically DESC_HW from v2 split in two?

>
>DESC_WRAP: used by device to poll. Driver sets it to a *different*
>value every time it overwrites a descriptor. 
>How to achieve it?
>since descriptors are written out in ring order,
>simply maintain the current value internally (start value 1) and flip it
>every time you overwrite the first descriptor.
>Device leaves it intact when overwriting a descriptor.

This is confusing me a bit.

My understanding is: 
1. the internally kept wrap value only flipped when the first
descriptor is overwritten

2. the moment the first descriptor is written the internal wrap value
is flipped 0->1 or 1->0 and this value is written to every descriptor
DESC_WRAP until we reach the first descriptor again

>
>After writing down this explanation, I think the names aren't
>great.
>
>Let me try an alternative explanation.
>
>---------------
>A two-bit field, DRIVER_OWNER, signals the buffer ownership.
>It has 4 possible values:
>values 0x1, 0x11 are written by driver
>values 0x0, 0x10 are written by device

The 0x prefix might add to the confusion here. It is really just two
bits, no?

>each time driver writes out a descriptor, it must make sure
>that the high bit in OWNER changes.
>
>each time device writes out a descriptor, it must make sure
>that the high bit in OWNER does not change.
>
>this is exactly the same functionally, DRIVER is high bit and
>WRAP is the low bit.  Does this make things clearer?

So far it makes sense to me.
>---------------
>
>
>
>Maybe the difference between device and driver
>is confusing. We can fix that by changing the values.
>Here is an alternative. Let me know if you like it better -
>I need to think a bit more to make sure it works,
>but to give you an idea:
>
>
>---------------
>A two-bit field, DRIVER_OWNER, signals the buffer ownership.
>It has 4 possible values:
>values 0x1, 0x10 are written by driver
>values 0x0, 0x11 are written by device
>
>each time driver writes out a descriptor, it must make sure
>that the high bit in OWNER changes. Thus first time
>it writes 0x10, next time 0x1, then 0x10 again.
>
>each time device writes out a descriptor, it must make sure
>that the low bit in OWNER changes.  Thus first time
>it writes 0x11, next time 0x0, then 0x11 again.

DESC_WRAP is changed by the device now, so this would work differently
than in the scenario from above. 
This would mean we don't need the internally kept wrap value, right?


regards,
Jens 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-10-04 12:39             ` Jens Freimann
@ 2017-10-04 12:58               ` Michael S. Tsirkin
  2017-10-04 12:58               ` Michael S. Tsirkin
  1 sibling, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-04 12:58 UTC (permalink / raw)
  To: Jens Freimann; +Cc: virtio-dev, Steven Luong (sluong), virtualization

On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> On Sun, Oct 01, 2017 at 04:08:29AM +0000, Michael S. Tsirkin wrote:
> > On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> > > 
> > > Get it now. Please correct me if I missing something.
> > > 
> > > 
> > > Flags status hints,
> > > 
> > > - DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use
> > > 
> > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet
> > > 
> > > - None: device has used the descriptor, and write descriptor out
> > > 
> > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > 
> > > 
> > > Polling behavior is,
> > > 
> > > - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
> > > 
> > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go
> > > 
> > > 
> > > --
> > > Steve
> > 
> > 
> > Hmm no, not what I had in mind.
> > 
> > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > descriptor.  Device clears it when overwriting a descriptor.
> > Thus driver uses DESC_DRIVER to detect that device data in
> > descriptor is valid.
> 
> Basically DESC_HW from v2 split in two?

Yes in order to avoid overwriting all descriptors.

> > 
> > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > value every time it overwrites a descriptor. How to achieve it?
> > since descriptors are written out in ring order,
> > simply maintain the current value internally (start value 1) and flip it
> > every time you overwrite the first descriptor.
> > Device leaves it intact when overwriting a descriptor.
> 
> This is confusing me a bit.
> 
> My understanding is: 1. the internally kept wrap value only flipped when the
> first
> descriptor is overwritten
> 
> 2. the moment the first descriptor is written the internal wrap value
> is flipped 0->1 or 1->0 and this value is written to every descriptor
> DESC_WRAP until we reach the first descriptor again

Yes this is what I tried to say. Can you suggest a better wording then?

> > 
> > After writing down this explanation, I think the names aren't
> > great.
> > 
> > Let me try an alternative explanation.
> > 
> > ---------------
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x11 are written by driver
> > values 0x0, 0x10 are written by device
> 
> The 0x prefix might add to the confusion here. It is really just two
> bits, no?

Ouch. Yes I meant 0b. Thanks!

> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the high bit in OWNER does not change.
> > 
> > this is exactly the same functionally, DRIVER is high bit and
> > WRAP is the low bit.  Does this make things clearer?
> 
> So far it makes sense to me.
> > ---------------
> > 
> > 
> > 
> > Maybe the difference between device and driver
> > is confusing. We can fix that by changing the values.
> > Here is an alternative. Let me know if you like it better -
> > I need to think a bit more to make sure it works,
> > but to give you an idea:
> > 
> > 
> > ---------------
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x10 are written by driver
> > values 0x0, 0x11 are written by device
> > 
> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes. Thus first time
> > it writes 0x10, next time 0x1, then 0x10 again.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the low bit in OWNER changes.  Thus first time
> > it writes 0x11, next time 0x0, then 0x11 again.
> 
> DESC_WRAP is changed by the device now, so this would work differently
> than in the scenario from above. This would mean we don't need the
> internally kept wrap value, right?

I think no but I did not complete simulating this yet so I don't
yet know if it even works.

> 
> regards,
> Jens

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-10-04 12:39             ` Jens Freimann
  2017-10-04 12:58               ` Michael S. Tsirkin
@ 2017-10-04 12:58               ` Michael S. Tsirkin
  2017-10-10  9:56                 ` Liang, Cunming
  2017-10-10  9:56                 ` Liang, Cunming
  1 sibling, 2 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-04 12:58 UTC (permalink / raw)
  To: Jens Freimann
  Cc: Liang, Cunming, Steven Luong (sluong), virtio-dev, virtualization

On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> On Sun, Oct 01, 2017 at 04:08:29AM +0000, Michael S. Tsirkin wrote:
> > On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> > > 
> > > Get it now. Please correct me if I missing something.
> > > 
> > > 
> > > Flags status hints,
> > > 
> > > - DESC_DRIVER only: driver owns the descriptor w/o available info ready for device to use
> > > 
> > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available descriptor, device hasn't used it yet
> > > 
> > > - None: device has used the descriptor, and write descriptor out
> > > 
> > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > 
> > > 
> > > Polling behavior is,
> > > 
> > > - Device monitor DESC_WRAP bit set or not; If set, go to use descriptor and clear DESC_DRIVER bit in the end (note: always need to clear DESC_WRAP)
> > > 
> > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared, reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new available descriptor get ready to go
> > > 
> > > 
> > > --
> > > Steve
> > 
> > 
> > Hmm no, not what I had in mind.
> > 
> > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > descriptor.  Device clears it when overwriting a descriptor.
> > Thus driver uses DESC_DRIVER to detect that device data in
> > descriptor is valid.
> 
> Basically DESC_HW from v2 split in two?

Yes in order to avoid overwriting all descriptors.

> > 
> > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > value every time it overwrites a descriptor. How to achieve it?
> > since descriptors are written out in ring order,
> > simply maintain the current value internally (start value 1) and flip it
> > every time you overwrite the first descriptor.
> > Device leaves it intact when overwriting a descriptor.
> 
> This is confusing me a bit.
> 
> My understanding is: 1. the internally kept wrap value only flipped when the
> first
> descriptor is overwritten
> 
> 2. the moment the first descriptor is written the internal wrap value
> is flipped 0->1 or 1->0 and this value is written to every descriptor
> DESC_WRAP until we reach the first descriptor again

Yes this is what I tried to say. Can you suggest a better wording then?

> > 
> > After writing down this explanation, I think the names aren't
> > great.
> > 
> > Let me try an alternative explanation.
> > 
> > ---------------
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x11 are written by driver
> > values 0x0, 0x10 are written by device
> 
> The 0x prefix might add to the confusion here. It is really just two
> bits, no?

Ouch. Yes I meant 0b. Thanks!

> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the high bit in OWNER does not change.
> > 
> > this is exactly the same functionally, DRIVER is high bit and
> > WRAP is the low bit.  Does this make things clearer?
> 
> So far it makes sense to me.
> > ---------------
> > 
> > 
> > 
> > Maybe the difference between device and driver
> > is confusing. We can fix that by changing the values.
> > Here is an alternative. Let me know if you like it better -
> > I need to think a bit more to make sure it works,
> > but to give you an idea:
> > 
> > 
> > ---------------
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x10 are written by driver
> > values 0x0, 0x11 are written by device
> > 
> > each time driver writes out a descriptor, it must make sure
> > that the high bit in OWNER changes. Thus first time
> > it writes 0x10, next time 0x1, then 0x10 again.
> > 
> > each time device writes out a descriptor, it must make sure
> > that the low bit in OWNER changes.  Thus first time
> > it writes 0x11, next time 0x0, then 0x11 again.
> 
> DESC_WRAP is changed by the device now, so this would work differently
> than in the scenario from above. This would mean we don't need the
> internally kept wrap value, right?

I think no but I did not complete simulating this yet so I don't
yet know if it even works.

> 
> regards,
> Jens

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (17 preceding siblings ...)
  2017-09-21 13:36   ` Liang, Cunming
@ 2017-10-08  6:16 ` Ilya Lesokhin
  2017-10-08  6:16 ` [virtio-dev] " Ilya Lesokhin
  19 siblings, 0 replies; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-08  6:16 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> > -----Original Message-----
> > From: virtualization-bounces@lists.linux-foundation.org
> > [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf
> > Of Michael S. Tsirkin
> >
> > This is an update from v2 version.
>> ...
> > When driver descriptors are chained in this way, multiple descriptors
> > are treated as a part of a single transaction containing an optional
> > write buffer followed by an optional read buffer.
> > All descriptors in the chain must have the same ID.
> >

I apologize for the repost, I didn't realize I have to be a member of the 
virtio-dev mailing list.

I'm concerned about the "same ID" requirement in chained descriptors.
 
Assuming out of order execution, how is the driver supposed to re-assign
unique IDs to the previously chained descriptor?
Is the driver expected to copy original IDs somewhere else before the
chaining and then restore the IDs after the chain is executed?
 
Thanks,
Ilya

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] RE: packed ring layout proposal v3
  2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
                   ` (18 preceding siblings ...)
  2017-10-08  6:16 ` Ilya Lesokhin
@ 2017-10-08  6:16 ` Ilya Lesokhin
  2017-10-25 16:20     ` [virtio-dev] " Michael S. Tsirkin
  19 siblings, 1 reply; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-08  6:16 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtio-dev; +Cc: virtualization

> > -----Original Message-----
> > From: virtualization-bounces@lists.linux-foundation.org
> > [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf
> > Of Michael S. Tsirkin
> >
> > This is an update from v2 version.
>> ...
> > When driver descriptors are chained in this way, multiple descriptors
> > are treated as a part of a single transaction containing an optional
> > write buffer followed by an optional read buffer.
> > All descriptors in the chain must have the same ID.
> >

I apologize for the repost, I didn't realize I have to be a member of the 
virtio-dev mailing list.

I'm concerned about the "same ID" requirement in chained descriptors.
 
Assuming out of order execution, how is the driver supposed to re-assign
unique IDs to the previously chained descriptor?
Is the driver expected to copy original IDs somewhere else before the
chaining and then restore the IDs after the chain is executed?
 
Thanks,
Ilya


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-10-04 12:58               ` Michael S. Tsirkin
@ 2017-10-10  9:56                 ` Liang, Cunming
  2017-10-10  9:56                 ` Liang, Cunming
  1 sibling, 0 replies; 92+ messages in thread
From: Liang, Cunming @ 2017-10-10  9:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jens Freimann
  Cc: virtio-dev, Steven Luong (sluong), virtualization



> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, October 4, 2017 8:58 PM
> To: Jens Freimann <jfreimann@redhat.com>
> Cc: Liang, Cunming <cunming.liang@intel.com>; Steven Luong (sluong)
> <sluong@cisco.com>; virtio-dev@lists.oasis-open.org;
> virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> > On Sun, Oct 01, 2017 at 04:08:29AM +0000, Michael S. Tsirkin wrote:
> > > On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> > > >
> > > > Get it now. Please correct me if I missing something.
> > > >
> > > >
> > > > Flags status hints,
> > > >
> > > > - DESC_DRIVER only: driver owns the descriptor w/o available info
> > > > ready for device to use
> > > >
> > > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available
> > > > descriptor, device hasn't used it yet
> > > >
> > > > - None: device has used the descriptor, and write descriptor out
> > > >
> > > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > >
> > > >
> > > > Polling behavior is,
> > > >
> > > > - Device monitor DESC_WRAP bit set or not; If set, go to use
> > > > descriptor and clear DESC_DRIVER bit in the end (note: always need
> > > > to clear DESC_WRAP)
> > > >
> > > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared,
> > > > reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new
> > > > available descriptor get ready to go
> > > >
> > > >
> > > > --
> > > > Steve
> > >
> > >
> > > Hmm no, not what I had in mind.
> > >
> > > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > > descriptor.  Device clears it when overwriting a descriptor.
> > > Thus driver uses DESC_DRIVER to detect that device data in
> > > descriptor is valid.
> >
> > Basically DESC_HW from v2 split in two?
> 
> Yes in order to avoid overwriting all descriptors.
> 
> > >
> > > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > > value every time it overwrites a descriptor. How to achieve it?
> > > since descriptors are written out in ring order, simply maintain the
> > > current value internally (start value 1) and flip it every time you
> > > overwrite the first descriptor.
> > > Device leaves it intact when overwriting a descriptor.
Ok, get it now.

> >
> > This is confusing me a bit.
> >
> > My understanding is: 1. the internally kept wrap value only flipped
> > when the first descriptor is overwritten
> >
> > 2. the moment the first descriptor is written the internal wrap value
> > is flipped 0->1 or 1->0 and this value is written to every descriptor
> > DESC_WRAP until we reach the first descriptor again
That's right, it's also my take.
DESC_WRAP is only used by driver, device does nothing with that flag.

> 
> Yes this is what I tried to say. Can you suggest a better wording then?
The term of DESC_WRAP is fine to me.

> 
> > >
> > > After writing down this explanation, I think the names aren't great.
> > >
> > > Let me try an alternative explanation.
> > >
> > > ---------------
> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > > It has 4 possible values:
> > > values 0x1, 0x11 are written by driver values 0x0, 0x10 are written
> > > by device
> >
> > The 0x prefix might add to the confusion here. It is really just two
> > bits, no?
> 
> Ouch. Yes I meant 0b. Thanks!
0b00, 0b10 are written by device?
I suppose device can only clear high bit, can keep low bit no change.
Then the value written by device can be either 0b01 or 0b00, but 0b10 means to set high bit, no?

> 
> > > each time driver writes out a descriptor, it must make sure that the
> > > high bit in OWNER changes.
> > >
> > > each time device writes out a descriptor, it must make sure that the
> > > high bit in OWNER does not change.
Typo here? It should be "..., it must make sure that the low bit in OWNER does not change."?
For high bit in OWNER, each time devices writes out a descriptor, it must make sure to clear high bit in OWNER.
 
> > >
> > > this is exactly the same functionally, DRIVER is high bit and WRAP
> > > is the low bit.  Does this make things clearer?
> >
> > So far it makes sense to me.
It sounds good.

> > > ---------------
> > >
> > >
> > >
> > > Maybe the difference between device and driver is confusing. We can
> > > fix that by changing the values.
> > > Here is an alternative. Let me know if you like it better - I need
> > > to think a bit more to make sure it works, but to give you an idea:
> > >
> > >
> > > ---------------
> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > > It has 4 possible values:
> > > values 0x1, 0x10 are written by driver values 0x0, 0x11 are written
> > > by device
> > >
> > > each time driver writes out a descriptor, it must make sure that the
> > > high bit in OWNER changes. Thus first time it writes 0x10, next time
> > > 0x1, then 0x10 again.
> > >
> > > each time device writes out a descriptor, it must make sure that the
> > > low bit in OWNER changes.  Thus first time it writes 0x11, next time
> > > 0x0, then 0x11 again.
> >
> > DESC_WRAP is changed by the device now, so this would work differently
> > than in the scenario from above. This would mean we don't need the
> > internally kept wrap value, right?
> 
> I think no but I did not complete simulating this yet so I don't yet know if it
> even works.
> 
> >
> > regards,
> > Jens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [virtio-dev] packed ring layout proposal v3
  2017-10-04 12:58               ` Michael S. Tsirkin
  2017-10-10  9:56                 ` Liang, Cunming
@ 2017-10-10  9:56                 ` Liang, Cunming
  2017-10-11 12:22                   ` Jens Freimann
  1 sibling, 1 reply; 92+ messages in thread
From: Liang, Cunming @ 2017-10-10  9:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jens Freimann
  Cc: Steven Luong (sluong), virtio-dev, virtualization



> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, October 4, 2017 8:58 PM
> To: Jens Freimann <jfreimann@redhat.com>
> Cc: Liang, Cunming <cunming.liang@intel.com>; Steven Luong (sluong)
> <sluong@cisco.com>; virtio-dev@lists.oasis-open.org;
> virtualization@lists.linux-foundation.org
> Subject: Re: [virtio-dev] packed ring layout proposal v3
> 
> On Wed, Oct 04, 2017 at 02:39:01PM +0200, Jens Freimann wrote:
> > On Sun, Oct 01, 2017 at 04:08:29AM +0000, Michael S. Tsirkin wrote:
> > > On Thu, Sep 28, 2017 at 09:44:35AM +0000, Liang, Cunming wrote:
> > > >
> > > > Get it now. Please correct me if I missing something.
> > > >
> > > >
> > > > Flags status hints,
> > > >
> > > > - DESC_DRIVER only: driver owns the descriptor w/o available info
> > > > ready for device to use
> > > >
> > > > - DESC_DRIVER | DESC_WRAP: driver has prepared an available
> > > > descriptor, device hasn't used it yet
> > > >
> > > > - None: device has used the descriptor, and write descriptor out
> > > >
> > > > - DESC_WRAP only: shall not happen, device make sure to clear it
> > > >
> > > >
> > > > Polling behavior is,
> > > >
> > > > - Device monitor DESC_WRAP bit set or not; If set, go to use
> > > > descriptor and clear DESC_DRIVER bit in the end (note: always need
> > > > to clear DESC_WRAP)
> > > >
> > > > - Driver monitor DESC_DRIVER bit cleared or not; If cleared,
> > > > reclaim descriptor(set DESC_DRIVER) and set DESC_WRAP once new
> > > > available descriptor get ready to go
> > > >
> > > >
> > > > --
> > > > Steve
> > >
> > >
> > > Hmm no, not what I had in mind.
> > >
> > > DESC_DRIVER: used by driver to poll. Driver sets it when writing a
> > > descriptor.  Device clears it when overwriting a descriptor.
> > > Thus driver uses DESC_DRIVER to detect that device data in
> > > descriptor is valid.
> >
> > Basically DESC_HW from v2 split in two?
> 
> Yes in order to avoid overwriting all descriptors.
> 
> > >
> > > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > > value every time it overwrites a descriptor. How to achieve it?
> > > since descriptors are written out in ring order, simply maintain the
> > > current value internally (start value 1) and flip it every time you
> > > overwrite the first descriptor.
> > > Device leaves it intact when overwriting a descriptor.
Ok, get it now.

> >
> > This is confusing me a bit.
> >
> > My understanding is: 1. the internally kept wrap value only flipped
> > when the first descriptor is overwritten
> >
> > 2. the moment the first descriptor is written the internal wrap value
> > is flipped 0->1 or 1->0 and this value is written to every descriptor
> > DESC_WRAP until we reach the first descriptor again
That's right, it's also my take.
DESC_WRAP is only used by driver, device does nothing with that flag.

> 
> Yes this is what I tried to say. Can you suggest a better wording then?
The term of DESC_WRAP is fine to me.

> 
> > >
> > > After writing down this explanation, I think the names aren't great.
> > >
> > > Let me try an alternative explanation.
> > >
> > > ---------------
> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > > It has 4 possible values:
> > > values 0x1, 0x11 are written by driver values 0x0, 0x10 are written
> > > by device
> >
> > The 0x prefix might add to the confusion here. It is really just two
> > bits, no?
> 
> Ouch. Yes I meant 0b. Thanks!
0b00, 0b10 are written by device?
I suppose device can only clear high bit, can keep low bit no change.
Then the value written by device can be either 0b01 or 0b00, but 0b10 means to set high bit, no?

> 
> > > each time driver writes out a descriptor, it must make sure that the
> > > high bit in OWNER changes.
> > >
> > > each time device writes out a descriptor, it must make sure that the
> > > high bit in OWNER does not change.
Typo here? It should be "..., it must make sure that the low bit in OWNER does not change."?
For high bit in OWNER, each time devices writes out a descriptor, it must make sure to clear high bit in OWNER.
 
> > >
> > > this is exactly the same functionally, DRIVER is high bit and WRAP
> > > is the low bit.  Does this make things clearer?
> >
> > So far it makes sense to me.
It sounds good.

> > > ---------------
> > >
> > >
> > >
> > > Maybe the difference between device and driver is confusing. We can
> > > fix that by changing the values.
> > > Here is an alternative. Let me know if you like it better - I need
> > > to think a bit more to make sure it works, but to give you an idea:
> > >
> > >
> > > ---------------
> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > > It has 4 possible values:
> > > values 0x1, 0x10 are written by driver values 0x0, 0x11 are written
> > > by device
> > >
> > > each time driver writes out a descriptor, it must make sure that the
> > > high bit in OWNER changes. Thus first time it writes 0x10, next time
> > > 0x1, then 0x10 again.
> > >
> > > each time device writes out a descriptor, it must make sure that the
> > > low bit in OWNER changes.  Thus first time it writes 0x11, next time
> > > 0x0, then 0x11 again.
> >
> > DESC_WRAP is changed by the device now, so this would work differently
> > than in the scenario from above. This would mean we don't need the
> > internally kept wrap value, right?
> 
> I think no but I did not complete simulating this yet so I don't yet know if it
> even works.
> 
> >
> > regards,
> > Jens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
  2017-10-10  9:56                 ` Liang, Cunming
@ 2017-10-11 12:22                   ` Jens Freimann
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Freimann @ 2017-10-11 12:22 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: virtio-dev, virtualization, Steven Luong (sluong), Michael S. Tsirkin

On Tue, Oct 10, 2017 at 09:56:44AM +0000, Liang, Cunming wrote:
>> > > DESC_WRAP: used by device to poll. Driver sets it to a *different*
>> > > value every time it overwrites a descriptor. How to achieve it?
>> > > since descriptors are written out in ring order, simply maintain the
>> > > current value internally (start value 1) and flip it every time you
>> > > overwrite the first descriptor.
>> > > Device leaves it intact when overwriting a descriptor.
>Ok, get it now.
>
>> >
>> > This is confusing me a bit.
>> >
>> > My understanding is: 1. the internally kept wrap value only flipped
>> > when the first descriptor is overwritten
>> >
>> > 2. the moment the first descriptor is written the internal wrap value
>> > is flipped 0->1 or 1->0 and this value is written to every descriptor
>> > DESC_WRAP until we reach the first descriptor again
>That's right, it's also my take.
>DESC_WRAP is only used by driver, device does nothing with that flag.
>
>>
>> Yes this is what I tried to say. Can you suggest a better wording then?

I'll give it a try.

>The term of DESC_WRAP is fine to me.

Couldn't think of a better name either. 

>> > > After writing down this explanation, I think the names aren't great.
>> > >
>> > > Let me try an alternative explanation.
>> > >
>> > > ---------------
>> > > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
>> > > It has 4 possible values:
>> > > values 0x1, 0x11 are written by driver values 0x0, 0x10 are written
>> > > by device
>> >
>> > The 0x prefix might add to the confusion here. It is really just two
>> > bits, no?
>>
>> Ouch. Yes I meant 0b. Thanks!
>0b00, 0b10 are written by device?
>I suppose device can only clear high bit, can keep low bit no change.
>Then the value written by device can be either 0b01 or 0b00, but 0b10 means to set high bit, no?
>
>>
>> > > each time driver writes out a descriptor, it must make sure that the
>> > > high bit in OWNER changes.
>> > >
>> > > each time device writes out a descriptor, it must make sure that the
>> > > high bit in OWNER does not change.
>Typo here? It should be "..., it must make sure that the low bit in OWNER does not change."?
>For high bit in OWNER, each time devices writes out a descriptor, it must make sure to clear high bit in OWNER.
>
>> > >
>> > > this is exactly the same functionally, DRIVER is high bit and WRAP
>> > > is the low bit.  Does this make things clearer?
>> >
>> > So far it makes sense to me.
>It sounds good.

So I implemented two ideas in the DPDK prototype code. The code is
very rough and simple. I'll describe again how I understood the ideas.

1. The one discussed in this thread: Adding two flags DESC_DRIVER and DESC_WRAP. 

Driver code: keeps an internal wrap value. Every time we cross the ring
boundary at descriptor 0 the wrap value is flipped. For all
descriptors used from this point on we will set the DESC_WRAP flag if it wasn't
set before or the other way round. Driver code looks at the
DESC_DRIVER flag to check if the descriptor is available for it to
use.  

Device code: when dequeuing descriptors from the ring it keeps going
until it sees a different value in the DESC_WRAP bit. Device only
checks this bit but doesn't change it. Device clears DESC_DRIVER when
done with descriptor to signal to driver that descriptor can be reused.

https://github.com/jensfr/dpdk/tree/add-driver-and-wrap-flag

2. Driver writes ID of last written descriptor into index field of first
descriptor and turns on a flag DESC_SKIP in the descriptor. (This idea is from Michael)

Driver code: Let's say driver adds 32 descriptors to the ring at once.
It fills in starting at ring position 0 to 31. It will write 31 to the
index field of descriptor 0. 

Device code: When dequeueing descriptors from the ring it looks into
the first descriptor. For example, it looks at desc[0] and the index field is 31
instead of 0. In addition to this the flag DESC_SKIP is set. The
device can expect all descriptors in the ring from 0 to 31 to be ready
for it to dequeue and doesn't have to check the DESC_HW flag.

https://github.com/jensfr/dpdk/tree/desc-hw-index-hint


I tried implementing both ideas on top of the DPDK prototype code
(without the DESC_WB code) and ran a quick test with two testpmd
instances, one with a vhost-user interface and the other one with a
virtio-user device. 

From a performance point of view I saw no difference between both
implementations. 

I understand that DESC_DRIVER/DESC_WRAP would be better for virtio
hardware implementations?

regards,
Jens 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v3
  2017-10-08  6:16 ` [virtio-dev] " Ilya Lesokhin
@ 2017-10-25 16:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-25 16:20 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > -----Original Message-----
> > > From: virtualization-bounces@lists.linux-foundation.org
> > > [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf
> > > Of Michael S. Tsirkin
> > >
> > > This is an update from v2 version.
> >> ...
> > > When driver descriptors are chained in this way, multiple descriptors
> > > are treated as a part of a single transaction containing an optional
> > > write buffer followed by an optional read buffer.
> > > All descriptors in the chain must have the same ID.
> > >
> 
> I apologize for the repost, I didn't realize I have to be a member of the 
> virtio-dev mailing list.
> 
> I'm concerned about the "same ID" requirement in chained descriptors.

It's there really just so we can remove the doubt about which
descriptor's ID should be used. My testing does not show
a performance win from this, so I'm fine with removing this
requirement though I'd be curious to know why is it a problem.

> Assuming out of order execution, how is the driver supposed to re-assign
> unique IDs to the previously chained descriptor?

For example, driver can have a simple allocator for the IDs.


> Is the driver expected to copy original IDs somewhere else before the
> chaining and then restore the IDs after the chain is executed?
>  
> Thanks,
> Ilya

As device overwrites the ID, driver will have to write it out
each time, that's true. It's going to be a requirement even if
descriptors on the chain do not need to have the same ID.

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] Re: packed ring layout proposal v3
@ 2017-10-25 16:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-25 16:20 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > -----Original Message-----
> > > From: virtualization-bounces@lists.linux-foundation.org
> > > [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf
> > > Of Michael S. Tsirkin
> > >
> > > This is an update from v2 version.
> >> ...
> > > When driver descriptors are chained in this way, multiple descriptors
> > > are treated as a part of a single transaction containing an optional
> > > write buffer followed by an optional read buffer.
> > > All descriptors in the chain must have the same ID.
> > >
> 
> I apologize for the repost, I didn't realize I have to be a member of the 
> virtio-dev mailing list.
> 
> I'm concerned about the "same ID" requirement in chained descriptors.

It's there really just so we can remove the doubt about which
descriptor's ID should be used. My testing does not show
a performance win from this, so I'm fine with removing this
requirement though I'd be curious to know why is it a problem.

> Assuming out of order execution, how is the driver supposed to re-assign
> unique IDs to the previously chained descriptor?

For example, driver can have a simple allocator for the IDs.


> Is the driver expected to copy original IDs somewhere else before the
> chaining and then restore the IDs after the chain is executed?
>  
> Thanks,
> Ilya

As device overwrites the ID, driver will have to write it out
each time, that's true. It's going to be a requirement even if
descriptors on the chain do not need to have the same ID.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: packed ring layout proposal v3
  2017-10-25 16:20     ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2017-10-29  9:05     ` Ilya Lesokhin
  -1 siblings, 0 replies; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-29  9:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

My point was that if the driver is not required to change the IDs,
it can initialized the ID's in all the descriptors when the ring is created
and never write the ID field again.

A simple allocator for the IDs can solve the problem I presented but it is still more
expensive then not doing ID allocation at all.


> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Wednesday, October 25, 2017 7:20 PM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > > -----Original Message-----
> > > > From: virtualization-bounces@lists.linux-foundation.org
> > > > [mailto:virtualization-bounces@lists.linux-foundation.org] On
> > > > Behalf Of Michael S. Tsirkin
> > > >
> > > > This is an update from v2 version.
> > >> ...
> > > > When driver descriptors are chained in this way, multiple
> > > > descriptors are treated as a part of a single transaction
> > > > containing an optional write buffer followed by an optional read buffer.
> > > > All descriptors in the chain must have the same ID.
> > > >
> >
> > I apologize for the repost, I didn't realize I have to be a member of
> > the virtio-dev mailing list.
> >
> > I'm concerned about the "same ID" requirement in chained descriptors.
> 
> It's there really just so we can remove the doubt about which descriptor's ID
> should be used. My testing does not show a performance win from this, so I'm
> fine with removing this requirement though I'd be curious to know why is it a
> problem.
> 
> > Assuming out of order execution, how is the driver supposed to
> > re-assign unique IDs to the previously chained descriptor?
> 
> For example, driver can have a simple allocator for the IDs.
> 
> 
> > Is the driver expected to copy original IDs somewhere else before the
> > chaining and then restore the IDs after the chain is executed?
> >
> > Thanks,
> > Ilya
> 
> As device overwrites the ID, driver will have to write it out each time, that's true.
> It's going to be a requirement even if descriptors on the chain do not need to
> have the same ID.
> 
> --
> MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] RE: packed ring layout proposal v3
  2017-10-25 16:20     ` [virtio-dev] " Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-29  9:05     ` Ilya Lesokhin
  2017-10-29 14:21         ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 1 reply; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-29  9:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

My point was that if the driver is not required to change the IDs,
it can initialized the ID's in all the descriptors when the ring is created
and never write the ID field again.

A simple allocator for the IDs can solve the problem I presented but it is still more
expensive then not doing ID allocation at all.


> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Wednesday, October 25, 2017 7:20 PM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > > -----Original Message-----
> > > > From: virtualization-bounces@lists.linux-foundation.org
> > > > [mailto:virtualization-bounces@lists.linux-foundation.org] On
> > > > Behalf Of Michael S. Tsirkin
> > > >
> > > > This is an update from v2 version.
> > >> ...
> > > > When driver descriptors are chained in this way, multiple
> > > > descriptors are treated as a part of a single transaction
> > > > containing an optional write buffer followed by an optional read buffer.
> > > > All descriptors in the chain must have the same ID.
> > > >
> >
> > I apologize for the repost, I didn't realize I have to be a member of
> > the virtio-dev mailing list.
> >
> > I'm concerned about the "same ID" requirement in chained descriptors.
> 
> It's there really just so we can remove the doubt about which descriptor's ID
> should be used. My testing does not show a performance win from this, so I'm
> fine with removing this requirement though I'd be curious to know why is it a
> problem.
> 
> > Assuming out of order execution, how is the driver supposed to
> > re-assign unique IDs to the previously chained descriptor?
> 
> For example, driver can have a simple allocator for the IDs.
> 
> 
> > Is the driver expected to copy original IDs somewhere else before the
> > chaining and then restore the IDs after the chain is executed?
> >
> > Thanks,
> > Ilya
> 
> As device overwrites the ID, driver will have to write it out each time, that's true.
> It's going to be a requirement even if descriptors on the chain do not need to
> have the same ID.
> 
> --
> MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v3
  2017-10-29  9:05     ` [virtio-dev] " Ilya Lesokhin
@ 2017-10-29 14:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-29 14:21 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

If you do this whats the point of the id? Just use descriptor offset
like virtio 0 did.

On Sun, Oct 29, 2017 at 09:05:01AM +0000, Ilya Lesokhin wrote:
> My point was that if the driver is not required to change the IDs,
> it can initialized the ID's in all the descriptors when the ring is created
> and never write the ID field again.
> 
> A simple allocator for the IDs can solve the problem I presented but it is still more
> expensive then not doing ID allocation at all.
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, October 25, 2017 7:20 PM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > > > -----Original Message-----
> > > > > From: virtualization-bounces@lists.linux-foundation.org
> > > > > [mailto:virtualization-bounces@lists.linux-foundation.org] On
> > > > > Behalf Of Michael S. Tsirkin
> > > > >
> > > > > This is an update from v2 version.
> > > >> ...
> > > > > When driver descriptors are chained in this way, multiple
> > > > > descriptors are treated as a part of a single transaction
> > > > > containing an optional write buffer followed by an optional read buffer.
> > > > > All descriptors in the chain must have the same ID.
> > > > >
> > >
> > > I apologize for the repost, I didn't realize I have to be a member of
> > > the virtio-dev mailing list.
> > >
> > > I'm concerned about the "same ID" requirement in chained descriptors.
> > 
> > It's there really just so we can remove the doubt about which descriptor's ID
> > should be used. My testing does not show a performance win from this, so I'm
> > fine with removing this requirement though I'd be curious to know why is it a
> > problem.
> > 
> > > Assuming out of order execution, how is the driver supposed to
> > > re-assign unique IDs to the previously chained descriptor?
> > 
> > For example, driver can have a simple allocator for the IDs.
> > 
> > 
> > > Is the driver expected to copy original IDs somewhere else before the
> > > chaining and then restore the IDs after the chain is executed?
> > >
> > > Thanks,
> > > Ilya
> > 
> > As device overwrites the ID, driver will have to write it out each time, that's true.
> > It's going to be a requirement even if descriptors on the chain do not need to
> > have the same ID.
> > 
> > --
> > MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] Re: packed ring layout proposal v3
@ 2017-10-29 14:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-29 14:21 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

If you do this whats the point of the id? Just use descriptor offset
like virtio 0 did.

On Sun, Oct 29, 2017 at 09:05:01AM +0000, Ilya Lesokhin wrote:
> My point was that if the driver is not required to change the IDs,
> it can initialized the ID's in all the descriptors when the ring is created
> and never write the ID field again.
> 
> A simple allocator for the IDs can solve the problem I presented but it is still more
> expensive then not doing ID allocation at all.
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, October 25, 2017 7:20 PM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > On Sun, Oct 08, 2017 at 06:16:44AM +0000, Ilya Lesokhin wrote:
> > > > > -----Original Message-----
> > > > > From: virtualization-bounces@lists.linux-foundation.org
> > > > > [mailto:virtualization-bounces@lists.linux-foundation.org] On
> > > > > Behalf Of Michael S. Tsirkin
> > > > >
> > > > > This is an update from v2 version.
> > > >> ...
> > > > > When driver descriptors are chained in this way, multiple
> > > > > descriptors are treated as a part of a single transaction
> > > > > containing an optional write buffer followed by an optional read buffer.
> > > > > All descriptors in the chain must have the same ID.
> > > > >
> > >
> > > I apologize for the repost, I didn't realize I have to be a member of
> > > the virtio-dev mailing list.
> > >
> > > I'm concerned about the "same ID" requirement in chained descriptors.
> > 
> > It's there really just so we can remove the doubt about which descriptor's ID
> > should be used. My testing does not show a performance win from this, so I'm
> > fine with removing this requirement though I'd be curious to know why is it a
> > problem.
> > 
> > > Assuming out of order execution, how is the driver supposed to
> > > re-assign unique IDs to the previously chained descriptor?
> > 
> > For example, driver can have a simple allocator for the IDs.
> > 
> > 
> > > Is the driver expected to copy original IDs somewhere else before the
> > > chaining and then restore the IDs after the chain is executed?
> > >
> > > Thanks,
> > > Ilya
> > 
> > As device overwrites the ID, driver will have to write it out each time, that's true.
> > It's going to be a requirement even if descriptors on the chain do not need to
> > have the same ID.
> > 
> > --
> > MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: packed ring layout proposal v3
  2017-10-29 14:21         ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2017-10-29 14:34         ` Ilya Lesokhin
  -1 siblings, 0 replies; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-29 14:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Sunday, October 29, 2017 4:22 PM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> If you do this whats the point of the id? Just use descriptor offset like virtio 0 did.
> 

I agree that ID is pointless when requests are completed in order.

But I'm not sure what you mean by descriptor offset?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] RE: packed ring layout proposal v3
  2017-10-29 14:21         ` [virtio-dev] " Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-29 14:34         ` Ilya Lesokhin
  2017-10-30  2:08             ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 1 reply; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-29 14:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Sunday, October 29, 2017 4:22 PM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> If you do this whats the point of the id? Just use descriptor offset like virtio 0 did.
> 

I agree that ID is pointless when requests are completed in order.

But I'm not sure what you mean by descriptor offset?

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v3
  2017-10-29 14:34         ` [virtio-dev] " Ilya Lesokhin
@ 2017-10-30  2:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-30  2:08 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Sunday, October 29, 2017 4:22 PM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > If you do this whats the point of the id? Just use descriptor offset like virtio 0 did.
> > 
> 
> I agree that ID is pointless when requests are completed in order.
> 
> But I'm not sure what you mean by descriptor offset?

Where the descriptor is within the ring.

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] Re: packed ring layout proposal v3
@ 2017-10-30  2:08             ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-30  2:08 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Sunday, October 29, 2017 4:22 PM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > If you do this whats the point of the id? Just use descriptor offset like virtio 0 did.
> > 
> 
> I agree that ID is pointless when requests are completed in order.
> 
> But I'm not sure what you mean by descriptor offset?

Where the descriptor is within the ring.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: packed ring layout proposal v3
  2017-10-30  2:08             ` [virtio-dev] " Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-30  6:30             ` Ilya Lesokhin
  -1 siblings, 0 replies; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-30  6:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Monday, October 30, 2017 4:09 AM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > Sent: Sunday, October 29, 2017 4:22 PM
> > > To: Ilya Lesokhin <ilyal@mellanox.com>
> > > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > > Subject: Re: packed ring layout proposal v3
> > >
> > > If you do this whats the point of the id? Just use descriptor offset like virtio 0
> did.
> > >
> >
> > I agree that ID is pointless when requests are completed in order.
> >
> > But I'm not sure what you mean by descriptor offset?
> 
> Where the descriptor is within the ring.
> 

Using descriptor offset like virtio 0, won't work.
In virtio 0, there was no reordering in the descriptor ring, so the offset was always unique.
In the new spec, if descriptor in offset 2 completes before the descriptor in offset 1,
It can be put in offset 1, but reusing offset 1 is not yet safe.

Also, please ignore my earlier comment about in-order completion,
It invalidates the entire discussion.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] RE: packed ring layout proposal v3
  2017-10-30  2:08             ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2017-10-30  6:30             ` Ilya Lesokhin
  2017-10-30 16:30                 ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 1 reply; 92+ messages in thread
From: Ilya Lesokhin @ 2017-10-30  6:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Monday, October 30, 2017 4:09 AM
> To: Ilya Lesokhin <ilyal@mellanox.com>
> Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> Subject: Re: packed ring layout proposal v3
> 
> On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > Sent: Sunday, October 29, 2017 4:22 PM
> > > To: Ilya Lesokhin <ilyal@mellanox.com>
> > > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > > Subject: Re: packed ring layout proposal v3
> > >
> > > If you do this whats the point of the id? Just use descriptor offset like virtio 0
> did.
> > >
> >
> > I agree that ID is pointless when requests are completed in order.
> >
> > But I'm not sure what you mean by descriptor offset?
> 
> Where the descriptor is within the ring.
> 

Using descriptor offset like virtio 0, won't work.
In virtio 0, there was no reordering in the descriptor ring, so the offset was always unique.
In the new spec, if descriptor in offset 2 completes before the descriptor in offset 1,
It can be put in offset 1, but reusing offset 1 is not yet safe.

Also, please ignore my earlier comment about in-order completion,
It invalidates the entire discussion.




---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: packed ring layout proposal v3
  2017-10-30  6:30             ` [virtio-dev] " Ilya Lesokhin
@ 2017-10-30 16:30                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-30 16:30 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Mon, Oct 30, 2017 at 06:30:56AM +0000, Ilya Lesokhin wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Monday, October 30, 2017 4:09 AM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Sunday, October 29, 2017 4:22 PM
> > > > To: Ilya Lesokhin <ilyal@mellanox.com>
> > > > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > > > Subject: Re: packed ring layout proposal v3
> > > >
> > > > If you do this whats the point of the id? Just use descriptor offset like virtio 0
> > did.
> > > >
> > >
> > > I agree that ID is pointless when requests are completed in order.
> > >
> > > But I'm not sure what you mean by descriptor offset?
> > 
> > Where the descriptor is within the ring.
> > 
> 
> Using descriptor offset like virtio 0, won't work.
> In virtio 0, there was no reordering in the descriptor ring, so the offset was always unique.
> In the new spec, if descriptor in offset 2 completes before the descriptor in offset 1,
> It can be put in offset 1, but reusing offset 1 is not yet safe.
> 
> Also, please ignore my earlier comment about in-order completion,
> It invalidates the entire discussion.
> 
> 

Yes, using offsets only works if all descriptors are used and written back in
order. If they are, descriptor ID isn't necessary. If they aren't,
it's necessary.

-- 
MST

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [virtio-dev] Re: packed ring layout proposal v3
@ 2017-10-30 16:30                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 92+ messages in thread
From: Michael S. Tsirkin @ 2017-10-30 16:30 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Mon, Oct 30, 2017 at 06:30:56AM +0000, Ilya Lesokhin wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Monday, October 30, 2017 4:09 AM
> > To: Ilya Lesokhin <ilyal@mellanox.com>
> > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > Subject: Re: packed ring layout proposal v3
> > 
> > On Sun, Oct 29, 2017 at 02:34:56PM +0000, Ilya Lesokhin wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Sunday, October 29, 2017 4:22 PM
> > > > To: Ilya Lesokhin <ilyal@mellanox.com>
> > > > Cc: virtio-dev@lists.oasis-open.org; virtualization@lists.linux-foundation.org
> > > > Subject: Re: packed ring layout proposal v3
> > > >
> > > > If you do this whats the point of the id? Just use descriptor offset like virtio 0
> > did.
> > > >
> > >
> > > I agree that ID is pointless when requests are completed in order.
> > >
> > > But I'm not sure what you mean by descriptor offset?
> > 
> > Where the descriptor is within the ring.
> > 
> 
> Using descriptor offset like virtio 0, won't work.
> In virtio 0, there was no reordering in the descriptor ring, so the offset was always unique.
> In the new spec, if descriptor in offset 2 completes before the descriptor in offset 1,
> It can be put in offset 1, but reusing offset 1 is not yet safe.
> 
> Also, please ignore my earlier comment about in-order completion,
> It invalidates the entire discussion.
> 
> 

Yes, using offsets only works if all descriptors are used and written back in
order. If they are, descriptor ID isn't necessary. If they aren't,
it's necessary.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [virtio-dev] packed ring layout proposal v3
@ 2017-10-20  9:50 Lars Ganrot
  0 siblings, 0 replies; 92+ messages in thread
From: Lars Ganrot @ 2017-10-20  9:50 UTC (permalink / raw)
  To: virtio-dev

Hi Michael,

I'm trying to understand your Sep 28 13:32 example (text copied below). I've in-lined my questions [#lga#].

>
> Let's assume device promised to consume packets in order
>
> ring size = 2
>
> Ring is 0 initialized.
>
> Device initially polls DESC[0].flags for WRAP bit to change.
>
> driver adds:
>
> DESC[0].addr = 1234
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE | DESC_WRAP
>
> and
>
> DESC[0].addr = 5678

[#lga#] Is index 0 above a typo (makes more sense if it is 1)?

> DESC[1].id = 1
> DESC[1].flags = DESC_DRIVER | DESC_WRAP
>
> it now starts polling DESC[0] flags.
>
> Device reads 1234, executes it, does not use it.
>
> Device reads 5678, executes it, and uses it:
>
> DESC[0].id = 1
> DESC[0].flags = 0

[#lga#] Should above line be: "DESC[0].flags =  DESC[1].flags  & DESC_WRAP"?

>
> Device now polls DESC[0].flags for WRAP bit to change.
>
> Now driver sees that DRIVER bit has been cleared, so it nows that id
> is valid. I sees id 1, therefore id 0 and 1 has been read and are safe to overwrite.

[#lga#] At this point, has the device returned both buffers *(1234) and *(5678) to the driver?
[#lga#] If so, how would out-of-order completion avoid head of line blocking?
[#lga#] E.g. the device is done with id=1 and *(5678), but not id=0 and *(1234).

>
> So it writes it out. It wrapped around to beginning of ring, so it
> flips the WRAP bit to 0 on all descriptors now:
>
> DESC[0].addr = 9ABC
> DESC[0].id = 0
> DESC[0].flags = DESC_DRIVER | DESC_MORE
>
> DESC[0].addr = DEF0
> DESC[0].id = 1
> DESC[0].flags = DESC_DRIVER

[#lga#] Index typo in all 3 lines above? DESC[1] makes more sense.

>
> Next round wrap will be 1 again.
>
> To summarise:
>
> DRIVER bit is used by driver to detect device has used one or more
> descriptors.  WRAP is is used by device to detect driver has made a
> new descriptor available.

Best Regards,

-Lars
Disclaimer: This email and any files transmitted with it may contain confidential information intended for the addressee(s) only. The information is not to be surrendered or copied to unauthorized persons. If you have received this communication in error, please notify the sender immediately and delete this e-mail from your system.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2017-10-30 16:31 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-10  5:06 [virtio-dev] packed ring layout proposal v3 Michael S. Tsirkin
2017-02-08 13:37 ` packed ring layout proposal v2 Christian Borntraeger
2017-02-09 17:43   ` Michael S. Tsirkin
     [not found]   ` <20170209181955-mutt-send-email-mst@kernel.org>
2017-02-09 18:27     ` Christian Borntraeger
2017-02-08 17:41 ` [virtio-dev] " Paolo Bonzini
2017-02-08 19:59   ` Michael S. Tsirkin
     [not found]   ` <20170208214435-mutt-send-email-mst@kernel.org>
2017-02-09 15:48     ` Paolo Bonzini
2017-02-09 16:11       ` Cornelia Huck
2017-02-09 18:24       ` Michael S. Tsirkin
     [not found]       ` <20170209202203-mutt-send-email-mst@kernel.org>
2017-02-10 11:32         ` Paolo Bonzini
     [not found]         ` <c229269b-1702-ffec-62e8-002c7c142904@redhat.com>
2017-02-10 15:20           ` Michael S. Tsirkin
2017-02-10 16:17             ` Paolo Bonzini
     [not found]       ` <20170209171105.075a9d9c.cornelia.huck@de.ibm.com>
2017-02-22 16:43         ` Michael S. Tsirkin
     [not found]         ` <20170222181333-mutt-send-email-mst@kernel.org>
2017-03-07 15:53           ` Cornelia Huck
2017-03-07 20:33             ` Michael S. Tsirkin
     [not found]             ` <20170307223057-mutt-send-email-mst@kernel.org>
2017-07-10 16:27               ` Amnon Ilan
2017-07-10 16:27               ` Amnon Ilan
2017-02-22  4:27 ` packed ring layout proposal - todo list Michael S. Tsirkin
     [not found] ` <20170222054336-mutt-send-email-mst@kernel.org>
2017-02-22  9:19   ` [virtio-dev] " Gray, Mark D
     [not found]   ` <738D45BC1F695740A983F43CFE1B7EA94E93CA7E@IRSMSX108.ger.corp.intel.com>
2017-02-22 15:13     ` Michael S. Tsirkin
2017-02-28  4:29   ` Yuanhan Liu
     [not found]   ` <20170228042943.GH18844@yliu-dev.sh.intel.com>
2017-03-01  1:07     ` Michael S. Tsirkin
2017-03-08  7:09       ` Yuanhan Liu
     [not found]       ` <20170308070948.GC18844@yliu-dev.sh.intel.com>
2017-03-08  7:56         ` Yuanhan Liu
     [not found]         ` <20170308075624.GF18844@yliu-dev.sh.intel.com>
2017-03-29 12:39           ` Michael S. Tsirkin
2017-04-01  7:30             ` Yuanhan Liu
2017-02-22 14:46 ` [virtio-dev] packed ring layout proposal v2 Chien, Roger S
2017-02-28  5:02 ` Yuanhan Liu
2017-02-28  5:47 ` [RFC] packed (virtio-net) headers Yuanhan Liu
     [not found] ` <20170228050218.GI18844@yliu-dev.sh.intel.com>
2017-03-01  1:02   ` [virtio-dev] packed ring layout proposal v2 Michael S. Tsirkin
     [not found]   ` <20170301024951-mutt-send-email-mst@kernel.org>
2017-03-01  3:57     ` Yuanhan Liu
     [not found]     ` <20170301035715.GP18844@yliu-dev.sh.intel.com>
2017-03-01  4:14       ` Michael S. Tsirkin
2017-03-01  4:57         ` Yuanhan Liu
     [not found] ` <20170228054719.GJ18844@yliu-dev.sh.intel.com>
2017-03-01  1:28   ` [RFC] packed (virtio-net) headers Michael S. Tsirkin
2017-07-16  6:00 ` [virtio-dev] packed ring layout proposal v2 Lior Narkis
2017-07-18 16:23   ` Michael S. Tsirkin
2017-07-18 16:23     ` Michael S. Tsirkin
2017-07-19  7:41     ` Lior Narkis
2017-07-20 13:06       ` Michael S. Tsirkin
2017-07-20 13:06         ` Michael S. Tsirkin
2017-07-19  7:41     ` Lior Narkis
2017-07-16  6:00 ` Lior Narkis
2017-09-11  7:47 ` [virtio-dev] Re: packed ring layout proposal v3 Jason Wang
2017-09-12 16:23   ` Willem de Bruijn
2017-09-13  1:26     ` Jason Wang
2017-09-13  1:26     ` Jason Wang
2017-09-12 16:23   ` Willem de Bruijn
2017-09-11  7:47 ` Jason Wang
2017-09-12 16:20 ` [virtio-dev] " Willem de Bruijn
2017-09-12 16:20 ` Willem de Bruijn
2017-09-14  8:23 ` Ilya Lesokhin
2017-09-20  9:11 ` [virtio-dev] " Liang, Cunming
2017-09-20  9:11   ` Liang, Cunming
2017-09-25 22:24   ` Michael S. Tsirkin
2017-09-25 22:24   ` Michael S. Tsirkin
2017-09-26 23:38     ` Steven Luong (sluong)
2017-09-27 23:49       ` Michael S. Tsirkin
2017-09-27 23:49         ` Michael S. Tsirkin
2017-09-28  9:44         ` Liang, Cunming
2017-10-01  4:08           ` Michael S. Tsirkin
2017-10-01  4:08             ` Michael S. Tsirkin
2017-10-04 12:39             ` Jens Freimann
2017-10-04 12:58               ` Michael S. Tsirkin
2017-10-04 12:58               ` Michael S. Tsirkin
2017-10-10  9:56                 ` Liang, Cunming
2017-10-10  9:56                 ` Liang, Cunming
2017-10-11 12:22                   ` Jens Freimann
2017-09-28  9:44         ` Liang, Cunming
2017-09-28 21:13         ` Michael S. Tsirkin
2017-09-28 21:13         ` Michael S. Tsirkin
2017-09-26 23:38     ` Steven Luong (sluong)
2017-09-21 13:36 ` Liang, Cunming
2017-09-21 13:36   ` Liang, Cunming
2017-09-28 21:27   ` Michael S. Tsirkin
2017-09-28 21:27   ` Michael S. Tsirkin
2017-10-08  6:16 ` Ilya Lesokhin
2017-10-08  6:16 ` [virtio-dev] " Ilya Lesokhin
2017-10-25 16:20   ` Michael S. Tsirkin
2017-10-25 16:20     ` [virtio-dev] " Michael S. Tsirkin
2017-10-29  9:05     ` Ilya Lesokhin
2017-10-29  9:05     ` [virtio-dev] " Ilya Lesokhin
2017-10-29 14:21       ` Michael S. Tsirkin
2017-10-29 14:21         ` [virtio-dev] " Michael S. Tsirkin
2017-10-29 14:34         ` Ilya Lesokhin
2017-10-29 14:34         ` [virtio-dev] " Ilya Lesokhin
2017-10-30  2:08           ` Michael S. Tsirkin
2017-10-30  2:08             ` [virtio-dev] " Michael S. Tsirkin
2017-10-30  6:30             ` [virtio-dev] " Ilya Lesokhin
2017-10-30 16:30               ` Michael S. Tsirkin
2017-10-30 16:30                 ` [virtio-dev] " Michael S. Tsirkin
2017-10-30  6:30             ` Ilya Lesokhin
2017-10-20  9:50 [virtio-dev] " Lars Ganrot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.