[virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

All of lore.kernel.org
 help / color / mirror / Atom feed

* [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
@ 2019-02-01 14:23 David Riddoch
  2019-02-11  7:33 ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: David Riddoch @ 2019-02-01 14:23 UTC (permalink / raw)
  To: Virtio-Dev

All,

I'd like to propose a small extension to the packed virtqueue mode.  My 
proposal is to add an offset/wrap field, written by the driver, 
indicating how many available descriptors have been added to the ring.

The reason for wanting this is to improve performance of hardware 
devices.  Because of high read latency over a PCIe bus, it is important 
for hardware devices to read multiple ring entries in parallel.  It is 
desirable to know how many descriptors are available prior to issuing 
these reads, else you risk fetching descriptors that are not yet 
available.  As well as wasting bus bandwidth this adds complexity.

I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this 
problem, but we still have a problem.  If you rely on doorbells to tell 
you how many descriptors are available, then you have to keep doorbells 
enabled at all times.  This can result in a very high rate of doorbells 
with some drivers, which can become a severe bottleneck (because x86 
CPUs can't emit MMIOs at very high rates).

The proposed offset/wrap field allows devices to disable doorbells when 
appropriate, and determine the latest fill level via a PCIe read.

I suggest the best place to put this would be in the driver area, 
immediately after the event suppression structure.

Presumably we would like this to be an optional feature, as 
implementations of packed mode already exist in the wild.  How about 
VIRTIO_F_RING_PACKED_AVAIL_IDX?

If I prepare a patch to the spec is there still time to get this into v1.1?

Best,
David

-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
@ 2019-02-11  7:33 ` Michael S. Tsirkin
  2019-02-11  8:52   ` David Riddoch
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11  7:33 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
> All,
> 
> I'd like to propose a small extension to the packed virtqueue mode.  My
> proposal is to add an offset/wrap field, written by the driver, indicating
> how many available descriptors have been added to the ring.
> 
> The reason for wanting this is to improve performance of hardware devices. 
> Because of high read latency over a PCIe bus, it is important for hardware
> devices to read multiple ring entries in parallel.  It is desirable to know
> how many descriptors are available prior to issuing these reads, else you
> risk fetching descriptors that are not yet available.  As well as wasting
> bus bandwidth this adds complexity.
> 
> I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> problem,

Right. And this seems like the ideal solution latency-wise since
this pushes data out to device without need for round-trips
over PCIe.

> but we still have a problem.  If you rely on doorbells to tell you
> how many descriptors are available, then you have to keep doorbells enabled
> at all times.

I would say right now there are two modes and device can transition
between them at will:

1. read each descriptor twice - once speculatively, once
   to get the actual data

   optimal for driver suboptimal for device

2. enable notification for each descritor and rely on
   these notifications

   optimal for device suboptimal for driver



> This can result in a very high rate of doorbells with some
> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> MMIOs at very high rates).

Interesting. Is there any data you could share to help guide the design?
E.g. what's the highest rate of MMIO writes supported etc?

> The proposed offset/wrap field allows devices to disable doorbells when
> appropriate, and determine the latest fill level via a PCIe read.

This kind of looks like a split ring though, does it not?  The issue is
we will again need to bounce more cache lines to communicate.

So I wonder: what if we made a change to spec that would allow prefetch
of ring entries?  E.g. you would be able to read at random and if you
guessed right then you can just use what you have read, no need to
re-fetch?






> I suggest the best place to put this would be in the driver area,
> immediately after the event suppression structure.

Could you comment on why is that a good place though?

> Presumably we would like this to be an optional feature, as implementations
> of packed mode already exist in the wild.  How about
> VIRTIO_F_RING_PACKED_AVAIL_IDX?
> If I prepare a patch to the spec is there still time to get this into v1.1?

Any new feature would require another round of public review.
I personally think it's better to instead try to do 1.2 soon after,
e.g. we could try to target quarterly releases.

But ultimately that would be up to the TC vote.


> Best,
> David
> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-11  7:33 ` Michael S. Tsirkin
@ 2019-02-11  8:52   ` David Riddoch
  2019-02-11 14:24     ` Michael S. Tsirkin
  2019-04-30 22:41     ` Michael S. Tsirkin
  0 siblings, 2 replies; 37+ messages in thread
From: David Riddoch @ 2019-02-11  8:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Virtio-Dev

On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
>> All,
>>
>> I'd like to propose a small extension to the packed virtqueue mode.  My
>> proposal is to add an offset/wrap field, written by the driver, indicating
>> how many available descriptors have been added to the ring.
>>
>> The reason for wanting this is to improve performance of hardware devices.
>> Because of high read latency over a PCIe bus, it is important for hardware
>> devices to read multiple ring entries in parallel.  It is desirable to know
>> how many descriptors are available prior to issuing these reads, else you
>> risk fetching descriptors that are not yet available.  As well as wasting
>> bus bandwidth this adds complexity.
>>
>> I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
>> problem,
> Right. And this seems like the ideal solution latency-wise since
> this pushes data out to device without need for round-trips
> over PCIe.

Yes, and I'm not proposing getting rid of it.  I'd expect a PCIe device 
to use both features together.

>> but we still have a problem.  If you rely on doorbells to tell you
>> how many descriptors are available, then you have to keep doorbells enabled
>> at all times.
> I would say right now there are two modes and device can transition
> between them at will:
>
> 1. read each descriptor twice - once speculatively, once
>     to get the actual data
>
>     optimal for driver suboptimal for device

You might read each descriptor multiple times in some scenarios. Reading 
descriptors in batches is hugely important given the latency and 
overheads of PCIe (and lack of adjacent data fetching that caching gives 
you).

> 2. enable notification for each descritor and rely on
>     these notifications
>
>     optimal for device suboptimal for driver
>
>> This can result in a very high rate of doorbells with some
>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>> MMIOs at very high rates).
> Interesting. Is there any data you could share to help guide the design?
> E.g. what's the highest rate of MMIO writes supported etc?

On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.

On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On 
PCIe-remote socket ~8M/s.

This doesn't just impose a limit on aggregate packet rate: If you hit 
this bottleneck then the CPU core is back-pressured by the MMIO writes, 
and so instr/cycle takes a huge hit.

>> The proposed offset/wrap field allows devices to disable doorbells when
>> appropriate, and determine the latest fill level via a PCIe read.
> This kind of looks like a split ring though, does it not?

I don't agree, because the ring isn't split.  Split-ring is very painful 
for hardware because of the potentially random accesses to the 
descriptor table (including pointer chasing) which inhibit batching of 
descriptor fetches, as well as causing complexity in implementation.

Packed ring is a huge improvement on both dimensions, but introduces a 
specific pain point for hardware offload.

> The issue is
> we will again need to bounce more cache lines to communicate.

You'll only bounce cache lines if the device chooses to read the idx.  A 
PV device need not offer this feature.  A PCIe device will, but the cost 
to the driver of writing an in-memory idx is much smaller than the cost 
of an MMIO, which is always a strongly ordered barrier on x86/64.

With vDPA you ideally would have this feature enabled, and the device 
would sometimes be PV and sometimes PCIe.  The PV device would ignore 
the new idx field and so cache lines would not bounce then either.

Ie. The only time cache lines are shared is when sharing with a PCIe 
device, which is the scenario when this is a win.

> So I wonder: what if we made a change to spec that would allow prefetch
> of ring entries?  E.g. you would be able to read at random and if you
> guessed right then you can just use what you have read, no need to
> re-fetch?

Unless I've misunderstood I think this would imply that the driver would 
have to ensure strict ordering for each descriptor it wrote, which would 
impose a cost to the driver.  At the moment a write barrier is only 
needed before writing the flags of the first descriptor in a batch.

>> I suggest the best place to put this would be in the driver area,
>> immediately after the event suppression structure.
> Could you comment on why is that a good place though?

The new field is written by the driver, as are the other fields in the 
driver area.  Also I expect that devices might be able to read this new 
idx together with the interrupt suppression fields in a single read, 
reducing PCIe overheads.

Placing the new field immediately after the descriptor ring would also 
work, but lose the benefit of combining reads, and potentially cause 
drivers to allocate a substantially bigger buffer (as I expect the 
descriptor ring is typically a power-of-2 size and drivers allocate 
multiples of page size).

Placing the new field in a separate data structure is undesirable, as it 
would require devices to store a further 64bits per virt-queue.

Thanks,
David

>> Presumably we would like this to be an optional feature, as implementations
>> of packed mode already exist in the wild.  How about
>> VIRTIO_F_RING_PACKED_AVAIL_IDX?
>> If I prepare a patch to the spec is there still time to get this into v1.1?
> Any new feature would require another round of public review.
> I personally think it's better to instead try to do 1.2 soon after,
> e.g. we could try to target quarterly releases.
>
> But ultimately that would be up to the TC vote.

-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-11  8:52   ` David Riddoch
@ 2019-02-11 14:24     ` Michael S. Tsirkin
  2019-02-11 14:58       ` David Riddoch
  2019-04-30 22:41     ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 14:24 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> > On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
> > > All,
> > > 
> > > I'd like to propose a small extension to the packed virtqueue mode.  My
> > > proposal is to add an offset/wrap field, written by the driver, indicating
> > > how many available descriptors have been added to the ring.
> > > 
> > > The reason for wanting this is to improve performance of hardware devices.
> > > Because of high read latency over a PCIe bus, it is important for hardware
> > > devices to read multiple ring entries in parallel.  It is desirable to know
> > > how many descriptors are available prior to issuing these reads, else you
> > > risk fetching descriptors that are not yet available.  As well as wasting
> > > bus bandwidth this adds complexity.
> > > 
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem,
> > Right. And this seems like the ideal solution latency-wise since
> > this pushes data out to device without need for round-trips
> > over PCIe.
> 
> Yes, and I'm not proposing getting rid of it.  I'd expect a PCIe device to
> use both features together.
> 
> > > but we still have a problem.  If you rely on doorbells to tell you
> > > how many descriptors are available, then you have to keep doorbells enabled
> > > at all times.
> > I would say right now there are two modes and device can transition
> > between them at will:
> > 
> > 1. read each descriptor twice - once speculatively, once
> >     to get the actual data
> > 
> >     optimal for driver suboptimal for device
> 
> You might read each descriptor multiple times in some scenarios. Reading
> descriptors in batches is hugely important given the latency and overheads
> of PCIe (and lack of adjacent data fetching that caching gives you).
> 
> > 2. enable notification for each descritor and rely on
> >     these notifications
> > 
> >     optimal for device suboptimal for driver
> > 
> > > This can result in a very high rate of doorbells with some
> > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > MMIOs at very high rates).
> > Interesting. Is there any data you could share to help guide the design?
> > E.g. what's the highest rate of MMIO writes supported etc?
> 
> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> 
> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> PCIe-remote socket ~8M/s.

Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?

> This doesn't just impose a limit on aggregate packet rate: If you hit this
> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> instr/cycle takes a huge hit.
>
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > This kind of looks like a split ring though, does it not?
> 
> I don't agree, because the ring isn't split.  Split-ring is very painful for
> hardware because of the potentially random accesses to the descriptor table
> (including pointer chasing) which inhibit batching of descriptor fetches, as
> well as causing complexity in implementation.

That's different with IN_ORDER, right?


> Packed ring is a huge improvement on both dimensions, but introduces a
> specific pain point for hardware offload.
> 
> > The issue is
> > we will again need to bounce more cache lines to communicate.
> 
> You'll only bounce cache lines if the device chooses to read the idx.  A PV
> device need not offer this feature.  A PCIe device will, but the cost to the
> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> which is always a strongly ordered barrier on x86/64.
> 
> With vDPA you ideally would have this feature enabled, and the device would
> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> field and so cache lines would not bounce then either.
> 
> Ie. The only time cache lines are shared is when sharing with a PCIe device,
> which is the scenario when this is a win.

True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.


> > So I wonder: what if we made a change to spec that would allow prefetch
> > of ring entries?  E.g. you would be able to read at random and if you
> > guessed right then you can just use what you have read, no need to
> > re-fetch?
> 
> Unless I've misunderstood I think this would imply that the driver would
> have to ensure strict ordering for each descriptor it wrote, which would
> impose a cost to the driver.  At the moment a write barrier is only needed
> before writing the flags of the first descriptor in a batch.

On non-x86 right?  OTOH the extra data structure also adds more ordering
requirements.


> > > I suggest the best place to put this would be in the driver area,
> > > immediately after the event suppression structure.
> > Could you comment on why is that a good place though?
> 
> The new field is written by the driver, as are the other fields in the
> driver area.  Also I expect that devices might be able to read this new idx
> together with the interrupt suppression fields in a single read, reducing
> PCIe overheads.

OK then you need it in the same aligned 256 byte boundary.


> Placing the new field immediately after the descriptor ring would also work,
> but lose the benefit of combining reads, and potentially cause drivers to
> allocate a substantially bigger buffer (as I expect the descriptor ring is
> typically a power-of-2 size and drivers allocate multiples of page size).
> 
> Placing the new field in a separate data structure is undesirable, as it
> would require devices to store a further 64bits per virt-queue.
> 
> Thanks,
> David
> 
> 
> > > Presumably we would like this to be an optional feature, as implementations
> > > of packed mode already exist in the wild.  How about
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > Any new feature would require another round of public review.
> > I personally think it's better to instead try to do 1.2 soon after,
> > e.g. we could try to target quarterly releases.
> > 
> > But ultimately that would be up to the TC vote.
> 
> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-11 14:24     ` Michael S. Tsirkin
@ 2019-02-11 14:58       ` David Riddoch
  2019-02-12  5:08         ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: David Riddoch @ 2019-02-11 14:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Virtio-Dev

>>>> This can result in a very high rate of doorbells with some
>>>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>>>> MMIOs at very high rates).
>>> Interesting. Is there any data you could share to help guide the design?
>>> E.g. what's the highest rate of MMIO writes supported etc?
>> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>>
>> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
>> PCIe-remote socket ~8M/s.
> Is that mega bytes? Or million writes?
> With 32 bit writes are we limited to 2M then?

Sorry, this is million-writes-per-second.  The size of the write doesn't 
matter.

>> This doesn't just impose a limit on aggregate packet rate: If you hit this
>> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
>> instr/cycle takes a huge hit.
>>
>>>> The proposed offset/wrap field allows devices to disable doorbells when
>>>> appropriate, and determine the latest fill level via a PCIe read.
>>> This kind of looks like a split ring though, does it not?
>> I don't agree, because the ring isn't split.  Split-ring is very painful for
>> hardware because of the potentially random accesses to the descriptor table
>> (including pointer chasing) which inhibit batching of descriptor fetches, as
>> well as causing complexity in implementation.
> That's different with IN_ORDER, right?

Yes, IN_ORDER will give some of the benefit of packed ring, and is also 
a win because you can merge 'used' writes.  (But packed-ring still wins 
due to not having to fetch ring and descriptor table separately).

>> Packed ring is a huge improvement on both dimensions, but introduces a
>> specific pain point for hardware offload.
>>
>>> The issue is
>>> we will again need to bounce more cache lines to communicate.
>> You'll only bounce cache lines if the device chooses to read the idx.  A PV
>> device need not offer this feature.  A PCIe device will, but the cost to the
>> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
>> which is always a strongly ordered barrier on x86/64.
>>
>> With vDPA you ideally would have this feature enabled, and the device would
>> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
>> field and so cache lines would not bounce then either.
>>
>> Ie. The only time cache lines are shared is when sharing with a PCIe device,
>> which is the scenario when this is a win.
> True. OTOH on non-x86 you will need more write barriers :( It would be
> natural to say driver can reuse the barrier before flags update, but note
> that that would mean hardware still needs to support re-fetching if
> flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> does add a bit more complexity.

I would prefer to have the write barrier before writing the idx.

Note that drivers could take advantage of this feature to avoid read 
barriers when consuming descriptors: At the moment there is a 
virtio_rmb() per descriptor read.  With the proposed feature the driver 
can do just one barrier after reading idx.  I expect that on platforms 
where write barriers have a cost read barriers likely have a significant 
cost too, so this might be a win with PV devices too.

>>> So I wonder: what if we made a change to spec that would allow prefetch
>>> of ring entries?  E.g. you would be able to read at random and if you
>>> guessed right then you can just use what you have read, no need to
>>> re-fetch?
>> Unless I've misunderstood I think this would imply that the driver would
>> have to ensure strict ordering for each descriptor it wrote, which would
>> impose a cost to the driver.  At the moment a write barrier is only needed
>> before writing the flags of the first descriptor in a batch.
> On non-x86 right?  OTOH the extra data structure also adds more ordering
> requirements.

Yes on non-x86.  The extra data structure only adds an ordering once per 
request (when enabled) whereas allowing prefetch would add an ordering 
per descriptor.  The number of descriptors is always >= the number of 
requests, and can be much larger (esp. for a net rx virt-queue).

>>>> I suggest the best place to put this would be in the driver area,
>>>> immediately after the event suppression structure.
>>> Could you comment on why is that a good place though?
>> The new field is written by the driver, as are the other fields in the
>> driver area.  Also I expect that devices might be able to read this new idx
>> together with the interrupt suppression fields in a single read, reducing
>> PCIe overheads.
> OK then you need it in the same aligned 256 byte boundary.

The event suppression structs currently require 4byte align.  Are you 
suggesting we increase the align required when 
VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 
8bytes would presumably suffice.

Thanks for the feedback.

David

-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-11 14:58       ` David Riddoch
@ 2019-02-12  5:08         ` Michael S. Tsirkin
  2019-02-12  7:28           ` Jason Wang
  2019-02-12 11:40           ` David Riddoch
  0 siblings, 2 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12  5:08 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev, Jason Wang

On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > This can result in a very high rate of doorbells with some
> > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > MMIOs at very high rates).
> > > > Interesting. Is there any data you could share to help guide the design?
> > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > 
> > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > PCIe-remote socket ~8M/s.
> > Is that mega bytes? Or million writes?
> > With 32 bit writes are we limited to 2M then?
> 
> Sorry, this is million-writes-per-second.  The size of the write doesn't
> matter.

So it's not too bad, you only need one per batch after all.


Also I thought some more about this. In fact on x86/intel specifically
PCI descriptor reads are atomic with cache line granularity
and writes are ordered. So in fact you can safely prefetch
descriptors and if you see them valid, you can go ahead
and use them.

This makes the descriptor index merely a hint for performance,
which device can play with at will.


Other platforms are not like this so you need the data
but do they have the same problem?





> > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > instr/cycle takes a huge hit.
> > > 
> > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > This kind of looks like a split ring though, does it not?
> > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > hardware because of the potentially random accesses to the descriptor table
> > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > well as causing complexity in implementation.
> > That's different with IN_ORDER, right?
> 
> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> win because you can merge 'used' writes.  (But packed-ring still wins due to
> not having to fetch ring and descriptor table separately).
>
> > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > specific pain point for hardware offload.
> > > 
> > > > The issue is
> > > > we will again need to bounce more cache lines to communicate.
> > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > which is always a strongly ordered barrier on x86/64.
> > > 
> > > With vDPA you ideally would have this feature enabled, and the device would
> > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > field and so cache lines would not bounce then either.
> > > 
> > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > which is the scenario when this is a win.
> > True. OTOH on non-x86 you will need more write barriers :( It would be
> > natural to say driver can reuse the barrier before flags update, but note
> > that that would mean hardware still needs to support re-fetching if
> > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > does add a bit more complexity.
> 
> I would prefer to have the write barrier before writing the idx.

Well that's driver overhead for something device might never utilise in
a given workload. If we are optimizing let's optimize for speed.

> Note that drivers could take advantage of this feature to avoid read
> barriers when consuming descriptors: At the moment there is a virtio_rmb()
> per descriptor read.  With the proposed feature the driver can do just one
> barrier after reading idx.

Oh you want device to write a used index too? Hmm. Isn't this
contrary to your efforts to consume PCIe BW?

> I expect that on platforms where write barriers
> have a cost read barriers likely have a significant cost too, so this might
> be a win with PV devices too.

I'm not sure.  It is pretty easy to replace an rmb with a dependency
which is generally quite cheap in my testing.

But since it's supposed to benefit PV, at this point we already have
implementations so rather than speculate (pun intended), people can
write patches and experiment.

For the proposed avail idx I think Jason (CC'd) was interested in adding
something like this for vhost.


> > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > guessed right then you can just use what you have read, no need to
> > > > re-fetch?
> > > Unless I've misunderstood I think this would imply that the driver would
> > > have to ensure strict ordering for each descriptor it wrote, which would
> > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > before writing the flags of the first descriptor in a batch.
> > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > requirements.
> 
> Yes on non-x86.  The extra data structure only adds an ordering once per
> request (when enabled) whereas allowing prefetch would add an ordering per
> descriptor.  The number of descriptors is always >= the number of requests,
> and can be much larger (esp. for a net rx virt-queue).

Question is, is MMIO also slow on these non-x86 platforms?


Prefetch if successful just drastically lowers latency. You can't beat
not needing a read at all.

> > > > > I suggest the best place to put this would be in the driver area,
> > > > > immediately after the event suppression structure.
> > > > Could you comment on why is that a good place though?
> > > The new field is written by the driver, as are the other fields in the
> > > driver area.  Also I expect that devices might be able to read this new idx
> > > together with the interrupt suppression fields in a single read, reducing
> > > PCIe overheads.
> > OK then you need it in the same aligned 256 byte boundary.
> 
> The event suppression structs currently require 4byte align.  Are you
> suggesting we increase the align required when
> VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> would presumably suffice.

OK.

I am also wondering: what would the analog of this feature be for split
rings? We are burning a feature bit, might as well find a good
use for it.





> Thanks for the feedback.
> 
> David
> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12  5:08         ` Michael S. Tsirkin
@ 2019-02-12  7:28           ` Jason Wang
  2019-02-12 13:44             ` Michael S. Tsirkin
  2019-02-12 11:40           ` David Riddoch
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-12  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Riddoch; +Cc: Virtio-Dev


On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
>>>>>> This can result in a very high rate of doorbells with some
>>>>>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>>>>>> MMIOs at very high rates).
>>>>> Interesting. Is there any data you could share to help guide the design?
>>>>> E.g. what's the highest rate of MMIO writes supported etc?
>>>> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>>>>
>>>> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
>>>> PCIe-remote socket ~8M/s.
>>> Is that mega bytes? Or million writes?
>>> With 32 bit writes are we limited to 2M then?
>> Sorry, this is million-writes-per-second.  The size of the write doesn't
>> matter.
> So it's not too bad, you only need one per batch after all.
>
>
> Also I thought some more about this. In fact on x86/intel specifically
> PCI descriptor reads are atomic with cache line granularity
> and writes are ordered. So in fact you can safely prefetch
> descriptors and if you see them valid, you can go ahead
> and use them.


If I understand this correctly, unless driver set avail in order for 
several descriptors. Device could not assume that if N is avail then [0, 
N) is also available.

But prefetching indeed improve the performance when I play with vhost 
kernel implementation. It reduces the overhead of memory accessing, 
which is similar to PCI.


>
> This makes the descriptor index merely a hint for performance,
> which device can play with at will.
>
>
> Other platforms are not like this so you need the data
> but do they have the same problem?
>
>
>
>
>
>>>> This doesn't just impose a limit on aggregate packet rate: If you hit this
>>>> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
>>>> instr/cycle takes a huge hit.
>>>>
>>>>>> The proposed offset/wrap field allows devices to disable doorbells when
>>>>>> appropriate, and determine the latest fill level via a PCIe read.
>>>>> This kind of looks like a split ring though, does it not?
>>>> I don't agree, because the ring isn't split.  Split-ring is very painful for
>>>> hardware because of the potentially random accesses to the descriptor table
>>>> (including pointer chasing) which inhibit batching of descriptor fetches, as
>>>> well as causing complexity in implementation.
>>> That's different with IN_ORDER, right?
>> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
>> win because you can merge 'used' writes.  (But packed-ring still wins due to
>> not having to fetch ring and descriptor table separately).


Note, with IN_ORDER, there's no need to read available ring at all even 
for split ring. And in some cases, no need to update used ring at all.


>>
>>>> Packed ring is a huge improvement on both dimensions, but introduces a
>>>> specific pain point for hardware offload.
>>>>
>>>>> The issue is
>>>>> we will again need to bounce more cache lines to communicate.
>>>> You'll only bounce cache lines if the device chooses to read the idx.  A PV
>>>> device need not offer this feature.  A PCIe device will, but the cost to the
>>>> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
>>>> which is always a strongly ordered barrier on x86/64.
>>>>
>>>> With vDPA you ideally would have this feature enabled, and the device would
>>>> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
>>>> field and so cache lines would not bounce then either.
>>>>
>>>> Ie. The only time cache lines are shared is when sharing with a PCIe device,
>>>> which is the scenario when this is a win.
>>> True. OTOH on non-x86 you will need more write barriers :( It would be
>>> natural to say driver can reuse the barrier before flags update, but note
>>> that that would mean hardware still needs to support re-fetching if
>>> flags value is wrong. Such re-fetch is probably rare so fine by me, but it
>>> does add a bit more complexity.
>> I would prefer to have the write barrier before writing the idx.
> Well that's driver overhead for something device might never utilise in
> a given workload. If we are optimizing let's optimize for speed.
>
>> Note that drivers could take advantage of this feature to avoid read
>> barriers when consuming descriptors: At the moment there is a virtio_rmb()
>> per descriptor read.  With the proposed feature the driver can do just one
>> barrier after reading idx.
> Oh you want device to write a used index too? Hmm. Isn't this
> contrary to your efforts to consume PCIe BW?
>
>> I expect that on platforms where write barriers
>> have a cost read barriers likely have a significant cost too, so this might
>> be a win with PV devices too.
> I'm not sure.  It is pretty easy to replace an rmb with a dependency
> which is generally quite cheap in my testing.
>
> But since it's supposed to benefit PV, at this point we already have
> implementations so rather than speculate (pun intended), people can
> write patches and experiment.
>
> For the proposed avail idx I think Jason (CC'd) was interested in adding
> something like this for vhost.


Yes, it's kind of IN_ORDER + split ring I believe.

Thanks


>
>
>>>>> So I wonder: what if we made a change to spec that would allow prefetch
>>>>> of ring entries?  E.g. you would be able to read at random and if you
>>>>> guessed right then you can just use what you have read, no need to
>>>>> re-fetch?
>>>> Unless I've misunderstood I think this would imply that the driver would
>>>> have to ensure strict ordering for each descriptor it wrote, which would
>>>> impose a cost to the driver.  At the moment a write barrier is only needed
>>>> before writing the flags of the first descriptor in a batch.
>>> On non-x86 right?  OTOH the extra data structure also adds more ordering
>>> requirements.
>> Yes on non-x86.  The extra data structure only adds an ordering once per
>> request (when enabled) whereas allowing prefetch would add an ordering per
>> descriptor.  The number of descriptors is always >= the number of requests,
>> and can be much larger (esp. for a net rx virt-queue).
> Question is, is MMIO also slow on these non-x86 platforms?
>
>
> Prefetch if successful just drastically lowers latency. You can't beat
> not needing a read at all.
>
>>>>>> I suggest the best place to put this would be in the driver area,
>>>>>> immediately after the event suppression structure.
>>>>> Could you comment on why is that a good place though?
>>>> The new field is written by the driver, as are the other fields in the
>>>> driver area.  Also I expect that devices might be able to read this new idx
>>>> together with the interrupt suppression fields in a single read, reducing
>>>> PCIe overheads.
>>> OK then you need it in the same aligned 256 byte boundary.
>> The event suppression structs currently require 4byte align.  Are you
>> suggesting we increase the align required when
>> VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
>> would presumably suffice.
> OK.
>
> I am also wondering: what would the analog of this feature be for split
> rings? We are burning a feature bit, might as well find a good
> use for it.
>
>
>
>
>
>> Thanks for the feedback.
>>
>> David
>>
>> -- 
>> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12  5:08         ` Michael S. Tsirkin
  2019-02-12  7:28           ` Jason Wang
@ 2019-02-12 11:40           ` David Riddoch
  2019-02-12 12:51             ` Michael S. Tsirkin
  2019-02-12 14:01             ` Michael S. Tsirkin
  1 sibling, 2 replies; 37+ messages in thread
From: David Riddoch @ 2019-02-12 11:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Virtio-Dev, Jason Wang

On 12/02/2019 05:08, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
>>>>>> This can result in a very high rate of doorbells with some
>>>>>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>>>>>> MMIOs at very high rates).
>>>>> Interesting. Is there any data you could share to help guide the design?
>>>>> E.g. what's the highest rate of MMIO writes supported etc?
>>>> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>>>>
>>>> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
>>>> PCIe-remote socket ~8M/s.
>>> Is that mega bytes? Or million writes?
>>> With 32 bit writes are we limited to 2M then?
>> Sorry, this is million-writes-per-second.  The size of the write doesn't
>> matter.
> So it's not too bad, you only need one per batch after all.

If you have a driver that sends in batches, or uses TSO, yes it is 
fine.  But if using the Linux driver with non-TSO sends then you get a 
doorbell for each and every send.  In that world it only takes a 
moderate number of cores sending at a reasonably high rate to hit this 
bottleneck.  I agree that most people won't hit this today, but very 
high packet rates are increasingly important.

Can drivers avoid this by postponing doorbells?  Not easily, no. When 
you hit this bottleneck the only evidence you have that it is happening 
is that certain instructions (such as doorbells) take a very long time.  
The tx virt-queue doesn't fill because the NIC and link are not 
bottlenecked, so you can't spot it that way.

You could measure send rate over an interval and defer doorbells if 
sending at a high rate, but that adds the cost/complexity of timers (to 
ensure a deferred doorbell is sent reasonably promptly), and you could 
still hit the bottleneck transiently without triggering the avoidance 
measures.

> Also I thought some more about this. In fact on x86/intel specifically
> PCI descriptor reads are atomic with cache line granularity
> and writes are ordered. So in fact you can safely prefetch
> descriptors and if you see them valid, you can go ahead
> and use them.
>
> This makes the descriptor index merely a hint for performance,
> which device can play with at will.
>
> Other platforms are not like this so you need the data
> but do they have the same problem?

Yes this proposal is just about performance.  I have no data on other 
processor types.

>>>> This doesn't just impose a limit on aggregate packet rate: If you hit this
>>>> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
>>>> instr/cycle takes a huge hit.
>>>>
>>>>>> The proposed offset/wrap field allows devices to disable doorbells when
>>>>>> appropriate, and determine the latest fill level via a PCIe read.
>>>>> This kind of looks like a split ring though, does it not?
>>>> I don't agree, because the ring isn't split.  Split-ring is very painful for
>>>> hardware because of the potentially random accesses to the descriptor table
>>>> (including pointer chasing) which inhibit batching of descriptor fetches, as
>>>> well as causing complexity in implementation.
>>> That's different with IN_ORDER, right?
>> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
>> win because you can merge 'used' writes.  (But packed-ring still wins due to
>> not having to fetch ring and descriptor table separately).
>>
>>>> Packed ring is a huge improvement on both dimensions, but introduces a
>>>> specific pain point for hardware offload.
>>>>
>>>>> The issue is
>>>>> we will again need to bounce more cache lines to communicate.
>>>> You'll only bounce cache lines if the device chooses to read the idx.  A PV
>>>> device need not offer this feature.  A PCIe device will, but the cost to the
>>>> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
>>>> which is always a strongly ordered barrier on x86/64.
>>>>
>>>> With vDPA you ideally would have this feature enabled, and the device would
>>>> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
>>>> field and so cache lines would not bounce then either.
>>>>
>>>> Ie. The only time cache lines are shared is when sharing with a PCIe device,
>>>> which is the scenario when this is a win.
>>> True. OTOH on non-x86 you will need more write barriers :( It would be
>>> natural to say driver can reuse the barrier before flags update, but note
>>> that that would mean hardware still needs to support re-fetching if
>>> flags value is wrong. Such re-fetch is probably rare so fine by me, but it
>>> does add a bit more complexity.
>> I would prefer to have the write barrier before writing the idx.
> Well that's driver overhead for something device might never utilise in
> a given workload. If we are optimizing let's optimize for speed.

I think doing the barrier before writing idx is best for speed (see 
below).  But it is really desirable for complexity: Cases like this are 
easy to handle in software, but much much harder in pipelined hardware 
implementations.

>> Note that drivers could take advantage of this feature to avoid read
>> barriers when consuming descriptors: At the moment there is a virtio_rmb()
>> per descriptor read.  With the proposed feature the driver can do just one
>> barrier after reading idx.
> Oh you want device to write a used index too? Hmm. Isn't this
> contrary to your efforts to consume PCIe BW?

Sorry I mistyped: I meant that PV devices can take advantage to avoid 
read barriers.  I am not suggesting that devices write an index.

>> I expect that on platforms where write barriers
>> have a cost read barriers likely have a significant cost too, so this might
>> be a win with PV devices too.
> I'm not sure.  It is pretty easy to replace an rmb with a dependency
> which is generally quite cheap in my testing.
>
> But since it's supposed to benefit PV, at this point we already have
> implementations so rather than speculate (pun intended), people can
> write patches and experiment.

 From my point of view the primary benefit is for hardware 
implementations.  I hope that demonstrating improved PV performance 
would not be a hard gate.

> For the proposed avail idx I think Jason (CC'd) was interested in adding
> something like this for vhost.
>
>
>>>>> So I wonder: what if we made a change to spec that would allow prefetch
>>>>> of ring entries?  E.g. you would be able to read at random and if you
>>>>> guessed right then you can just use what you have read, no need to
>>>>> re-fetch?
>>>> Unless I've misunderstood I think this would imply that the driver would
>>>> have to ensure strict ordering for each descriptor it wrote, which would
>>>> impose a cost to the driver.  At the moment a write barrier is only needed
>>>> before writing the flags of the first descriptor in a batch.
>>> On non-x86 right?  OTOH the extra data structure also adds more ordering
>>> requirements.
>> Yes on non-x86.  The extra data structure only adds an ordering once per
>> request (when enabled) whereas allowing prefetch would add an ordering per
>> descriptor.  The number of descriptors is always >= the number of requests,
>> and can be much larger (esp. for a net rx virt-queue).
> Question is, is MMIO also slow on these non-x86 platforms?
>
> Prefetch if successful just drastically lowers latency. You can't beat
> not needing a read at all.

Are you talking about net rx?  Yes, devices can prefetch descriptors 
from rx virt-queues ahead of packets arriving.  That does not require 
any changes to the spec.

I was just making the point that allowing random reads of packed-ring 
descriptors adds ordering costs when writing descriptors.  I can't see 
what the benefit would be -- have I missed something?

>>>>>> I suggest the best place to put this would be in the driver area,
>>>>>> immediately after the event suppression structure.
>>>>> Could you comment on why is that a good place though?
>>>> The new field is written by the driver, as are the other fields in the
>>>> driver area.  Also I expect that devices might be able to read this new idx
>>>> together with the interrupt suppression fields in a single read, reducing
>>>> PCIe overheads.
>>> OK then you need it in the same aligned 256 byte boundary.
>> The event suppression structs currently require 4byte align.  Are you
>> suggesting we increase the align required when
>> VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
>> would presumably suffice.
> OK.
>
> I am also wondering: what would the analog of this feature be for split
> rings? We are burning a feature bit, might as well find a good
> use for it.

I don't have any suggestions.  I worry that associating the same bit 
with a split-ring feature would create an undesirable coupling: A device 
offering X for packed-ring would also necessarily have to implement Y 
for split-ring and vice versa (if both ring types are supported).


-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 11:40           ` David Riddoch
@ 2019-02-12 12:51             ` Michael S. Tsirkin
  2019-02-12 14:01             ` Michael S. Tsirkin
  1 sibling, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 12:51 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev, Jason Wang

On Tue, Feb 12, 2019 at 11:40:25AM +0000, David Riddoch wrote:
> On 12/02/2019 05:08, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > > > This can result in a very high rate of doorbells with some
> > > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > > > MMIOs at very high rates).
> > > > > > Interesting. Is there any data you could share to help guide the design?
> > > > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > > > 
> > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > > > PCIe-remote socket ~8M/s.
> > > > Is that mega bytes? Or million writes?
> > > > With 32 bit writes are we limited to 2M then?
> > > Sorry, this is million-writes-per-second.  The size of the write doesn't
> > > matter.
> > So it's not too bad, you only need one per batch after all.
> 
> If you have a driver that sends in batches, or uses TSO, yes it is fine. 
> But if using the Linux driver with non-TSO sends then you get a doorbell for
> each and every send.  In that world it only takes a moderate number of cores
> sending at a reasonably high rate to hit this bottleneck.  I agree that most
> people won't hit this today, but very high packet rates are increasingly
> important.
> 
> Can drivers avoid this by postponing doorbells?  Not easily, no. When you
> hit this bottleneck the only evidence you have that it is happening is that
> certain instructions (such as doorbells) take a very long time.  The tx
> virt-queue doesn't fill because the NIC and link are not bottlenecked, so
> you can't spot it that way.
> 
> You could measure send rate over an interval and defer doorbells if sending
> at a high rate, but that adds the cost/complexity of timers (to ensure a
> deferred doorbell is sent reasonably promptly), and you could still hit the
> bottleneck transiently without triggering the avoidance measures.
> 
> > Also I thought some more about this. In fact on x86/intel specifically
> > PCI descriptor reads are atomic with cache line granularity
> > and writes are ordered. So in fact you can safely prefetch
> > descriptors and if you see them valid, you can go ahead
> > and use them.
> > 
> > This makes the descriptor index merely a hint for performance,
> > which device can play with at will.
> > 
> > Other platforms are not like this so you need the data
> > but do they have the same problem?
> 
> Yes this proposal is just about performance.  I have no data on other
> processor types.
> 
> > > > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > > > instr/cycle takes a huge hit.
> > > > > 
> > > > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > > > This kind of looks like a split ring though, does it not?
> > > > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > > > hardware because of the potentially random accesses to the descriptor table
> > > > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > > > well as causing complexity in implementation.
> > > > That's different with IN_ORDER, right?
> > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> > > win because you can merge 'used' writes.  (But packed-ring still wins due to
> > > not having to fetch ring and descriptor table separately).
> > > 
> > > > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > > > specific pain point for hardware offload.
> > > > > 
> > > > > > The issue is
> > > > > > we will again need to bounce more cache lines to communicate.
> > > > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > > > which is always a strongly ordered barrier on x86/64.
> > > > > 
> > > > > With vDPA you ideally would have this feature enabled, and the device would
> > > > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > > > field and so cache lines would not bounce then either.
> > > > > 
> > > > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > > > which is the scenario when this is a win.
> > > > True. OTOH on non-x86 you will need more write barriers :( It would be
> > > > natural to say driver can reuse the barrier before flags update, but note
> > > > that that would mean hardware still needs to support re-fetching if
> > > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > > > does add a bit more complexity.
> > > I would prefer to have the write barrier before writing the idx.
> > Well that's driver overhead for something device might never utilise in
> > a given workload. If we are optimizing let's optimize for speed.
> 
> I think doing the barrier before writing idx is best for speed (see below). 
> But it is really desirable for complexity: Cases like this are easy to
> handle in software, but much much harder in pipelined hardware
> implementations.
> 
> > > Note that drivers could take advantage of this feature to avoid read
> > > barriers when consuming descriptors: At the moment there is a virtio_rmb()
> > > per descriptor read.  With the proposed feature the driver can do just one
> > > barrier after reading idx.
> > Oh you want device to write a used index too? Hmm. Isn't this
> > contrary to your efforts to consume PCIe BW?
> 
> Sorry I mistyped: I meant that PV devices can take advantage to avoid read
> barriers.  I am not suggesting that devices write an index.
> 
> > > I expect that on platforms where write barriers
> > > have a cost read barriers likely have a significant cost too, so this might
> > > be a win with PV devices too.
> > I'm not sure.  It is pretty easy to replace an rmb with a dependency
> > which is generally quite cheap in my testing.
> > 
> > But since it's supposed to benefit PV, at this point we already have
> > implementations so rather than speculate (pun intended), people can
> > write patches and experiment.
> 
> From my point of view the primary benefit is for hardware implementations. 
> I hope that demonstrating improved PV performance would not be a hard gate.


Probably not. I am saying it might make sense to test the specific
interface to make sure it's in a form that is good for PV.


> > For the proposed avail idx I think Jason (CC'd) was interested in adding
> > something like this for vhost.
> > 
> > 
> > > > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > > > guessed right then you can just use what you have read, no need to
> > > > > > re-fetch?
> > > > > Unless I've misunderstood I think this would imply that the driver would
> > > > > have to ensure strict ordering for each descriptor it wrote, which would
> > > > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > > > before writing the flags of the first descriptor in a batch.
> > > > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > > > requirements.
> > > Yes on non-x86.  The extra data structure only adds an ordering once per
> > > request (when enabled) whereas allowing prefetch would add an ordering per
> > > descriptor.  The number of descriptors is always >= the number of requests,
> > > and can be much larger (esp. for a net rx virt-queue).
> > Question is, is MMIO also slow on these non-x86 platforms?
> > 
> > Prefetch if successful just drastically lowers latency. You can't beat
> > not needing a read at all.
> 
> Are you talking about net rx?  Yes, devices can prefetch descriptors from rx
> virt-queues ahead of packets arriving.  That does not require any changes to
> the spec.

net tx too.

> I was just making the point that allowing random reads of packed-ring
> descriptors adds ordering costs when writing descriptors.  I can't see what
> the benefit would be -- have I missed something?

So on x86 there's no extra ordering cost if you just write
the flags last since write barriers are not needed.


> > > > > > > I suggest the best place to put this would be in the driver area,
> > > > > > > immediately after the event suppression structure.
> > > > > > Could you comment on why is that a good place though?
> > > > > The new field is written by the driver, as are the other fields in the
> > > > > driver area.  Also I expect that devices might be able to read this new idx
> > > > > together with the interrupt suppression fields in a single read, reducing
> > > > > PCIe overheads.
> > > > OK then you need it in the same aligned 256 byte boundary.
> > > The event suppression structs currently require 4byte align.  Are you
> > > suggesting we increase the align required when
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> > > would presumably suffice.
> > OK.
> > 
> > I am also wondering: what would the analog of this feature be for split
> > rings? We are burning a feature bit, might as well find a good
> > use for it.
> 
> I don't have any suggestions.  I worry that associating the same bit with a
> split-ring feature would create an undesirable coupling: A device offering X
> for packed-ring would also necessarily have to implement Y for split-ring
> and vice versa (if both ring types are supported).

Well if it's a reasonably logical coupling then it makes sense.
Like e.g. IN_ORDER. If it's a random thing then sure.


> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12  7:28           ` Jason Wang
@ 2019-02-12 13:44             ` Michael S. Tsirkin
  2019-02-13 10:00               ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 13:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Tue, Feb 12, 2019 at 03:28:26PM +0800, Jason Wang wrote:
> 
> On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > > > This can result in a very high rate of doorbells with some
> > > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > > > MMIOs at very high rates).
> > > > > > Interesting. Is there any data you could share to help guide the design?
> > > > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > > > 
> > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > > > PCIe-remote socket ~8M/s.
> > > > Is that mega bytes? Or million writes?
> > > > With 32 bit writes are we limited to 2M then?
> > > Sorry, this is million-writes-per-second.  The size of the write doesn't
> > > matter.
> > So it's not too bad, you only need one per batch after all.
> > 
> > 
> > Also I thought some more about this. In fact on x86/intel specifically
> > PCI descriptor reads are atomic with cache line granularity
> > and writes are ordered. So in fact you can safely prefetch
> > descriptors and if you see them valid, you can go ahead
> > and use them.
> 
> 
> If I understand this correctly, unless driver set avail in order for several
> descriptors. Device could not assume that if N is avail then [0, N) is also
> available.

Well you don't need that assumption. You check it and discard
prefetched data if not.


> But prefetching indeed improve the performance when I play with vhost kernel
> implementation. It reduces the overhead of memory accessing, which is
> similar to PCI.

Prefetch as in prefetchnta?


> 
> > 
> > This makes the descriptor index merely a hint for performance,
> > which device can play with at will.
> > 
> > 
> > Other platforms are not like this so you need the data
> > but do they have the same problem?
> > 
> > 
> > 
> > 
> > 
> > > > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > > > instr/cycle takes a huge hit.
> > > > > 
> > > > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > > > This kind of looks like a split ring though, does it not?
> > > > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > > > hardware because of the potentially random accesses to the descriptor table
> > > > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > > > well as causing complexity in implementation.
> > > > That's different with IN_ORDER, right?
> > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> > > win because you can merge 'used' writes.  (But packed-ring still wins due to
> > > not having to fetch ring and descriptor table separately).
> 
> 
> Note, with IN_ORDER, there's no need to read available ring at all even for
> split ring. And in some cases, no need to update used ring at all.

Yes you said so at some point but it seems that the spec
is written ambigously and others do not read it like this.
Can you clarify?


> 
> > > 
> > > > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > > > specific pain point for hardware offload.
> > > > > 
> > > > > > The issue is
> > > > > > we will again need to bounce more cache lines to communicate.
> > > > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > > > which is always a strongly ordered barrier on x86/64.
> > > > > 
> > > > > With vDPA you ideally would have this feature enabled, and the device would
> > > > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > > > field and so cache lines would not bounce then either.
> > > > > 
> > > > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > > > which is the scenario when this is a win.
> > > > True. OTOH on non-x86 you will need more write barriers :( It would be
> > > > natural to say driver can reuse the barrier before flags update, but note
> > > > that that would mean hardware still needs to support re-fetching if
> > > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > > > does add a bit more complexity.
> > > I would prefer to have the write barrier before writing the idx.
> > Well that's driver overhead for something device might never utilise in
> > a given workload. If we are optimizing let's optimize for speed.
> > 
> > > Note that drivers could take advantage of this feature to avoid read
> > > barriers when consuming descriptors: At the moment there is a virtio_rmb()
> > > per descriptor read.  With the proposed feature the driver can do just one
> > > barrier after reading idx.
> > Oh you want device to write a used index too? Hmm. Isn't this
> > contrary to your efforts to consume PCIe BW?
> > 
> > > I expect that on platforms where write barriers
> > > have a cost read barriers likely have a significant cost too, so this might
> > > be a win with PV devices too.
> > I'm not sure.  It is pretty easy to replace an rmb with a dependency
> > which is generally quite cheap in my testing.
> > 
> > But since it's supposed to benefit PV, at this point we already have
> > implementations so rather than speculate (pun intended), people can
> > write patches and experiment.
> > 
> > For the proposed avail idx I think Jason (CC'd) was interested in adding
> > something like this for vhost.
> 
> 
> Yes, it's kind of IN_ORDER + split ring I believe.
> 
> Thanks

Well IN_ORDER is only good for e.g. network. Storage wants out of order
so it could benefit from a new feature.  I am however worried about the
extra wmb implied.  Still thinking about ways to avoid adding that.




> 
> > 
> > 
> > > > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > > > guessed right then you can just use what you have read, no need to
> > > > > > re-fetch?
> > > > > Unless I've misunderstood I think this would imply that the driver would
> > > > > have to ensure strict ordering for each descriptor it wrote, which would
> > > > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > > > before writing the flags of the first descriptor in a batch.
> > > > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > > > requirements.
> > > Yes on non-x86.  The extra data structure only adds an ordering once per
> > > request (when enabled) whereas allowing prefetch would add an ordering per
> > > descriptor.  The number of descriptors is always >= the number of requests,
> > > and can be much larger (esp. for a net rx virt-queue).
> > Question is, is MMIO also slow on these non-x86 platforms?
> > 
> > 
> > Prefetch if successful just drastically lowers latency. You can't beat
> > not needing a read at all.
> > 
> > > > > > > I suggest the best place to put this would be in the driver area,
> > > > > > > immediately after the event suppression structure.
> > > > > > Could you comment on why is that a good place though?
> > > > > The new field is written by the driver, as are the other fields in the
> > > > > driver area.  Also I expect that devices might be able to read this new idx
> > > > > together with the interrupt suppression fields in a single read, reducing
> > > > > PCIe overheads.
> > > > OK then you need it in the same aligned 256 byte boundary.
> > > The event suppression structs currently require 4byte align.  Are you
> > > suggesting we increase the align required when
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> > > would presumably suffice.
> > OK.
> > 
> > I am also wondering: what would the analog of this feature be for split
> > rings? We are burning a feature bit, might as well find a good
> > use for it.
> > 
> > 
> > 
> > 
> > 
> > > Thanks for the feedback.
> > > 
> > > David
> > > 
> > > -- 
> > > David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 11:40           ` David Riddoch
  2019-02-12 12:51             ` Michael S. Tsirkin
@ 2019-02-12 14:01             ` Michael S. Tsirkin
  2019-02-12 16:47               ` David Riddoch
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 14:01 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev, Jason Wang

On Tue, Feb 12, 2019 at 11:40:25AM +0000, David Riddoch wrote:
> On 12/02/2019 05:08, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > > > This can result in a very high rate of doorbells with some
> > > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > > > MMIOs at very high rates).
> > > > > > Interesting. Is there any data you could share to help guide the design?
> > > > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > > > 
> > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > > > PCIe-remote socket ~8M/s.
> > > > Is that mega bytes? Or million writes?
> > > > With 32 bit writes are we limited to 2M then?
> > > Sorry, this is million-writes-per-second.  The size of the write doesn't
> > > matter.
> > So it's not too bad, you only need one per batch after all.
> 
> If you have a driver that sends in batches, or uses TSO, yes it is fine. 
> But if using the Linux driver with non-TSO sends then you get a doorbell for
> each and every send.  In that world it only takes a moderate number of cores
> sending at a reasonably high rate to hit this bottleneck.  I agree that most
> people won't hit this today, but very high packet rates are increasingly
> important.
> 
> Can drivers avoid this by postponing doorbells?  Not easily, no. When you
> hit this bottleneck the only evidence you have that it is happening is that
> certain instructions (such as doorbells) take a very long time.  The tx
> virt-queue doesn't fill because the NIC and link are not bottlenecked, so
> you can't spot it that way.
> 
> You could measure send rate over an interval and defer doorbells if sending
> at a high rate, but that adds the cost/complexity of timers (to ensure a
> deferred doorbell is sent reasonably promptly), and you could still hit the
> bottleneck transiently without triggering the avoidance measures.
> 
> > Also I thought some more about this. In fact on x86/intel specifically
> > PCI descriptor reads are atomic with cache line granularity
> > and writes are ordered. So in fact you can safely prefetch
> > descriptors and if you see them valid, you can go ahead
> > and use them.
> > 
> > This makes the descriptor index merely a hint for performance,
> > which device can play with at will.
> > 
> > Other platforms are not like this so you need the data
> > but do they have the same problem?
> 
> Yes this proposal is just about performance.  I have no data on other
> processor types.
> 
> > > > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > > > instr/cycle takes a huge hit.
> > > > > 
> > > > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > > > This kind of looks like a split ring though, does it not?
> > > > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > > > hardware because of the potentially random accesses to the descriptor table
> > > > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > > > well as causing complexity in implementation.
> > > > That's different with IN_ORDER, right?
> > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> > > win because you can merge 'used' writes.  (But packed-ring still wins due to
> > > not having to fetch ring and descriptor table separately).
> > > 
> > > > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > > > specific pain point for hardware offload.
> > > > > 
> > > > > > The issue is
> > > > > > we will again need to bounce more cache lines to communicate.
> > > > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > > > which is always a strongly ordered barrier on x86/64.
> > > > > 
> > > > > With vDPA you ideally would have this feature enabled, and the device would
> > > > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > > > field and so cache lines would not bounce then either.
> > > > > 
> > > > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > > > which is the scenario when this is a win.
> > > > True. OTOH on non-x86 you will need more write barriers :( It would be
> > > > natural to say driver can reuse the barrier before flags update, but note
> > > > that that would mean hardware still needs to support re-fetching if
> > > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > > > does add a bit more complexity.
> > > I would prefer to have the write barrier before writing the idx.
> > Well that's driver overhead for something device might never utilise in
> > a given workload. If we are optimizing let's optimize for speed.
> 
> I think doing the barrier before writing idx is best for speed (see below). 

I don't see it below :(

> But it is really desirable for complexity: Cases like this are easy to
> handle in software, but much much harder in pipelined hardware
> implementations.

Maybe we should use a flag in event suppression structure.
This way device can switch between being notification driven
and being driven by index reads.



> > > Note that drivers could take advantage of this feature to avoid read
> > > barriers when consuming descriptors: At the moment there is a virtio_rmb()
> > > per descriptor read.  With the proposed feature the driver can do just one
> > > barrier after reading idx.
> > Oh you want device to write a used index too? Hmm. Isn't this
> > contrary to your efforts to consume PCIe BW?
> 
> Sorry I mistyped: I meant that PV devices can take advantage to avoid read
> barriers.  I am not suggesting that devices write an index.
> 
> > > I expect that on platforms where write barriers
> > > have a cost read barriers likely have a significant cost too, so this might
> > > be a win with PV devices too.
> > I'm not sure.  It is pretty easy to replace an rmb with a dependency
> > which is generally quite cheap in my testing.
> > 
> > But since it's supposed to benefit PV, at this point we already have
> > implementations so rather than speculate (pun intended), people can
> > write patches and experiment.
> 
> From my point of view the primary benefit is for hardware implementations. 
> I hope that demonstrating improved PV performance would not be a hard gate.
> 
> > For the proposed avail idx I think Jason (CC'd) was interested in adding
> > something like this for vhost.
> > 
> > 
> > > > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > > > guessed right then you can just use what you have read, no need to
> > > > > > re-fetch?
> > > > > Unless I've misunderstood I think this would imply that the driver would
> > > > > have to ensure strict ordering for each descriptor it wrote, which would
> > > > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > > > before writing the flags of the first descriptor in a batch.
> > > > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > > > requirements.
> > > Yes on non-x86.  The extra data structure only adds an ordering once per
> > > request (when enabled) whereas allowing prefetch would add an ordering per
> > > descriptor.  The number of descriptors is always >= the number of requests,
> > > and can be much larger (esp. for a net rx virt-queue).
> > Question is, is MMIO also slow on these non-x86 platforms?
> > 
> > Prefetch if successful just drastically lowers latency. You can't beat
> > not needing a read at all.
> 
> Are you talking about net rx?  Yes, devices can prefetch descriptors from rx
> virt-queues ahead of packets arriving.  That does not require any changes to
> the spec.
> 
> I was just making the point that allowing random reads of packed-ring
> descriptors adds ordering costs when writing descriptors.  I can't see what
> the benefit would be -- have I missed something?
> 
> > > > > > > I suggest the best place to put this would be in the driver area,
> > > > > > > immediately after the event suppression structure.
> > > > > > Could you comment on why is that a good place though?
> > > > > The new field is written by the driver, as are the other fields in the
> > > > > driver area.  Also I expect that devices might be able to read this new idx
> > > > > together with the interrupt suppression fields in a single read, reducing
> > > > > PCIe overheads.
> > > > OK then you need it in the same aligned 256 byte boundary.
> > > The event suppression structs currently require 4byte align.  Are you
> > > suggesting we increase the align required when
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> > > would presumably suffice.
> > OK.
> > 
> > I am also wondering: what would the analog of this feature be for split
> > rings? We are burning a feature bit, might as well find a good
> > use for it.
> 
> I don't have any suggestions.  I worry that associating the same bit with a
> split-ring feature would create an undesirable coupling: A device offering X
> for packed-ring would also necessarily have to implement Y for split-ring
> and vice versa (if both ring types are supported).

BTW we really need to look more at how can devices support subsets
of features. Right now devices can fail FEATURES_OK but drivers do
not recover. Some dependencies are defined by spec and drivers
can force them but devices can't.



> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 14:01             ` Michael S. Tsirkin
@ 2019-02-12 16:47               ` David Riddoch
  2019-02-12 17:35                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: David Riddoch @ 2019-02-12 16:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Virtio-Dev, Jason Wang

>>>> I would prefer to have the write barrier before writing the idx.
>>> Well that's driver overhead for something device might never utilise in
>>> a given workload. If we are optimizing let's optimize for speed.
>> I think doing the barrier before writing idx is best for speed (see below).
> I don't see it below :(

Sorry, I'm not being clear.  If the write barrier is before the idx, 
then a PV device can read the idx, do a single rmb and read a whole 
bunch of descriptors.  As things stand today a PV device has to do an 
rmb for each descriptor that it reads.

I'm not sure, but this may be what Jason meant when he said "prefetch".

>> But it is really desirable for complexity: Cases like this are easy to
>> handle in software, but much much harder in pipelined hardware
>> implementations.
> Maybe we should use a flag in event suppression structure.
> This way device can switch between being notification driven
> and being driven by index reads.

On first thought it seems hard to avoid races: On receiving a doorbell 
the device wishes to transition from doorbells to polling, so 
driver->avail_idx currently has an old value; device writes 
device->flags; a little later device reads driver->avail_idx.  It might 
have a new value written by the driver, or an old value...the device 
can't tell.  Needs more thought.

>>> I am also wondering: what would the analog of this feature be for split
>>> rings? We are burning a feature bit, might as well find a good
>>> use for it.
>> I don't have any suggestions.  I worry that associating the same bit with a
>> split-ring feature would create an undesirable coupling: A device offering X
>> for packed-ring would also necessarily have to implement Y for split-ring
>> and vice versa (if both ring types are supported).
> BTW we really need to look more at how can devices support subsets
> of features. Right now devices can fail FEATURES_OK but drivers do
> not recover. Some dependencies are defined by spec and drivers
> can force them but devices can't.

Completely agree.  In order to limit complexity in hardware 
implementations it would be nice to be able to express "I support X, and 
X+Y, but not Y by itself" and similar.  There is no way to do that today.

-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 16:47               ` David Riddoch
@ 2019-02-12 17:35                 ` Michael S. Tsirkin
  2019-02-13  9:49                   ` David Riddoch
  2019-02-13 10:33                   ` Jason Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 17:35 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev, Jason Wang

On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> 
> > > > > I would prefer to have the write barrier before writing the idx.
> > > > Well that's driver overhead for something device might never utilise in
> > > > a given workload. If we are optimizing let's optimize for speed.
> > > I think doing the barrier before writing idx is best for speed (see below).
> > I don't see it below :(
> 
> Sorry, I'm not being clear.  If the write barrier is before the idx, then a
> PV device can read the idx, do a single rmb and read a whole bunch of
> descriptors.  As things stand today a PV device has to do an rmb for each
> descriptor that it reads.
> 
> I'm not sure, but this may be what Jason meant when he said "prefetch".

Oh. Right. So for PV it's easy to convert an rmb to a data dependency
instead.  It remains to be seen whether an extra cache miss per
batch is cheaper or more expensive than such a dependency
per descriptor.

I hope Jason can experiment and let us know.


> > > But it is really desirable for complexity: Cases like this are easy to
> > > handle in software, but much much harder in pipelined hardware
> > > implementations.
> > Maybe we should use a flag in event suppression structure.
> > This way device can switch between being notification driven
> > and being driven by index reads.
> 
> On first thought it seems hard to avoid races: On receiving a doorbell the
> device wishes to transition from doorbells to polling, so driver->avail_idx
> currently has an old value; device writes device->flags; a little later
> device reads driver->avail_idx.  It might have a new value written by the
> driver, or an old value...the device can't tell.  Needs more thought.


I would say device writes flags and then reads the descriptor
ring. This is exactly the same sequence that needs to happen
when enabling driver notifications.


> > > > I am also wondering: what would the analog of this feature be for split
> > > > rings? We are burning a feature bit, might as well find a good
> > > > use for it.
> > > I don't have any suggestions.  I worry that associating the same bit with a
> > > split-ring feature would create an undesirable coupling: A device offering X
> > > for packed-ring would also necessarily have to implement Y for split-ring
> > > and vice versa (if both ring types are supported).
> > BTW we really need to look more at how can devices support subsets
> > of features. Right now devices can fail FEATURES_OK but drivers do
> > not recover. Some dependencies are defined by spec and drivers
> > can force them but devices can't.
> 
> Completely agree.  In order to limit complexity in hardware implementations
> it would be nice to be able to express "I support X, and X+Y, but not Y by
> itself" and similar.  There is no way to do that today.
> 
> 
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 17:35                 ` Michael S. Tsirkin
@ 2019-02-13  9:49                   ` David Riddoch
  2019-02-13 10:33                   ` Jason Wang
  1 sibling, 0 replies; 37+ messages in thread
From: David Riddoch @ 2019-02-13  9:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Virtio-Dev, Jason Wang


>>> Maybe we should use a flag in event suppression structure.
>>> This way device can switch between being notification driven
>>> and being driven by index reads.
>> On first thought it seems hard to avoid races: On receiving a doorbell the
>> device wishes to transition from doorbells to polling, so driver->avail_idx
>> currently has an old value; device writes device->flags; a little later
>> device reads driver->avail_idx.  It might have a new value written by the
>> driver, or an old value...the device can't tell.  Needs more thought.
> I would say device writes flags and then reads the descriptor
> ring. This is exactly the same sequence that needs to happen
> when enabling driver notifications.

Yes fair enough.  I was trying to allow a device that never has to read 
descriptors speculatively, but such a device can always have the 
avail_idx update flag enabled permanently.


-- 
David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 13:44             ` Michael S. Tsirkin
@ 2019-02-13 10:00               ` Jason Wang
  2019-02-13 15:20                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-13 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/12 下午9:44, Michael S. Tsirkin wrote:
> On Tue, Feb 12, 2019 at 03:28:26PM +0800, Jason Wang wrote:
>> On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
>>> On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
>>>>>>>> This can result in a very high rate of doorbells with some
>>>>>>>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>>>>>>>> MMIOs at very high rates).
>>>>>>> Interesting. Is there any data you could share to help guide the design?
>>>>>>> E.g. what's the highest rate of MMIO writes supported etc?
>>>>>> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>>>>>>
>>>>>> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
>>>>>> PCIe-remote socket ~8M/s.
>>>>> Is that mega bytes? Or million writes?
>>>>> With 32 bit writes are we limited to 2M then?
>>>> Sorry, this is million-writes-per-second.  The size of the write doesn't
>>>> matter.
>>> So it's not too bad, you only need one per batch after all.
>>>
>>>
>>> Also I thought some more about this. In fact on x86/intel specifically
>>> PCI descriptor reads are atomic with cache line granularity
>>> and writes are ordered. So in fact you can safely prefetch
>>> descriptors and if you see them valid, you can go ahead
>>> and use them.
>>
>> If I understand this correctly, unless driver set avail in order for several
>> descriptors. Device could not assume that if N is avail then [0, N) is also
>> available.
> Well you don't need that assumption. You check it and discard
> prefetched data if not.


Ok, this is probably ok for hardware implementation. But for software 
implementation, it may result lots of re-read.


>
>
>> But prefetching indeed improve the performance when I play with vhost kernel
>> implementation. It reduces the overhead of memory accessing, which is
>> similar to PCI.
> Prefetch as in prefetchnta?


Sorry, "prefetch" is probably not accurate. "Speculation" is probably 
better.

What I did is just read more than one descriptor e.g 16. And check 
whether the last is available, if yes, assume all the rest are available.


>
>
>>> This makes the descriptor index merely a hint for performance,
>>> which device can play with at will.
>>>
>>>
>>> Other platforms are not like this so you need the data
>>> but do they have the same problem?
>>>
>>>
>>>
>>>
>>>
>>>>>> This doesn't just impose a limit on aggregate packet rate: If you hit this
>>>>>> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
>>>>>> instr/cycle takes a huge hit.
>>>>>>
>>>>>>>> The proposed offset/wrap field allows devices to disable doorbells when
>>>>>>>> appropriate, and determine the latest fill level via a PCIe read.
>>>>>>> This kind of looks like a split ring though, does it not?
>>>>>> I don't agree, because the ring isn't split.  Split-ring is very painful for
>>>>>> hardware because of the potentially random accesses to the descriptor table
>>>>>> (including pointer chasing) which inhibit batching of descriptor fetches, as
>>>>>> well as causing complexity in implementation.
>>>>> That's different with IN_ORDER, right?
>>>> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
>>>> win because you can merge 'used' writes.  (But packed-ring still wins due to
>>>> not having to fetch ring and descriptor table separately).
>>
>> Note, with IN_ORDER, there's no need to read available ring at all even for
>> split ring. And in some cases, no need to update used ring at all.
> Yes you said so at some point but it seems that the spec
> is written ambigously and others do not read it like this.
> Can you clarify?


Yes, it was unclear about how available ring was dealt with in this 
case. My understanding is, since device use the descriptor table in 
order, if last_avail_idx is N and avail_idx is M, driver could assume 
[N, M) is available. In this case, avail_idx points to idx of descriptor 
ring instead of avail ring. So there's no need for both side to read or 
write avail ring entries.

Does this make sense? 10% PPS was seen through this way when I playing 
this for vhost.


>
>
>>>>>> Packed ring is a huge improvement on both dimensions, but introduces a
>>>>>> specific pain point for hardware offload.
>>>>>>
>>>>>>> The issue is
>>>>>>> we will again need to bounce more cache lines to communicate.
>>>>>> You'll only bounce cache lines if the device chooses to read the idx.  A PV
>>>>>> device need not offer this feature.  A PCIe device will, but the cost to the
>>>>>> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
>>>>>> which is always a strongly ordered barrier on x86/64.
>>>>>>
>>>>>> With vDPA you ideally would have this feature enabled, and the device would
>>>>>> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
>>>>>> field and so cache lines would not bounce then either.
>>>>>>
>>>>>> Ie. The only time cache lines are shared is when sharing with a PCIe device,
>>>>>> which is the scenario when this is a win.
>>>>> True. OTOH on non-x86 you will need more write barriers :( It would be
>>>>> natural to say driver can reuse the barrier before flags update, but note
>>>>> that that would mean hardware still needs to support re-fetching if
>>>>> flags value is wrong. Such re-fetch is probably rare so fine by me, but it
>>>>> does add a bit more complexity.
>>>> I would prefer to have the write barrier before writing the idx.
>>> Well that's driver overhead for something device might never utilise in
>>> a given workload. If we are optimizing let's optimize for speed.
>>>
>>>> Note that drivers could take advantage of this feature to avoid read
>>>> barriers when consuming descriptors: At the moment there is a virtio_rmb()
>>>> per descriptor read.  With the proposed feature the driver can do just one
>>>> barrier after reading idx.
>>> Oh you want device to write a used index too? Hmm. Isn't this
>>> contrary to your efforts to consume PCIe BW?
>>>
>>>> I expect that on platforms where write barriers
>>>> have a cost read barriers likely have a significant cost too, so this might
>>>> be a win with PV devices too.
>>> I'm not sure.  It is pretty easy to replace an rmb with a dependency
>>> which is generally quite cheap in my testing.
>>>
>>> But since it's supposed to benefit PV, at this point we already have
>>> implementations so rather than speculate (pun intended), people can
>>> write patches and experiment.
>>>
>>> For the proposed avail idx I think Jason (CC'd) was interested in adding
>>> something like this for vhost.
>>
>> Yes, it's kind of IN_ORDER + split ring I believe.
>>
>> Thanks
> Well IN_ORDER is only good for e.g. network. Storage wants out of order
> so it could benefit from a new feature.  I am however worried about the
> extra wmb implied.  Still thinking about ways to avoid adding that.
>

Ok.

Thanks


>
>
>>>
>>>>>>> So I wonder: what if we made a change to spec that would allow prefetch
>>>>>>> of ring entries?  E.g. you would be able to read at random and if you
>>>>>>> guessed right then you can just use what you have read, no need to
>>>>>>> re-fetch?
>>>>>> Unless I've misunderstood I think this would imply that the driver would
>>>>>> have to ensure strict ordering for each descriptor it wrote, which would
>>>>>> impose a cost to the driver.  At the moment a write barrier is only needed
>>>>>> before writing the flags of the first descriptor in a batch.
>>>>> On non-x86 right?  OTOH the extra data structure also adds more ordering
>>>>> requirements.
>>>> Yes on non-x86.  The extra data structure only adds an ordering once per
>>>> request (when enabled) whereas allowing prefetch would add an ordering per
>>>> descriptor.  The number of descriptors is always >= the number of requests,
>>>> and can be much larger (esp. for a net rx virt-queue).
>>> Question is, is MMIO also slow on these non-x86 platforms?
>>>
>>>
>>> Prefetch if successful just drastically lowers latency. You can't beat
>>> not needing a read at all.
>>>
>>>>>>>> I suggest the best place to put this would be in the driver area,
>>>>>>>> immediately after the event suppression structure.
>>>>>>> Could you comment on why is that a good place though?
>>>>>> The new field is written by the driver, as are the other fields in the
>>>>>> driver area.  Also I expect that devices might be able to read this new idx
>>>>>> together with the interrupt suppression fields in a single read, reducing
>>>>>> PCIe overheads.
>>>>> OK then you need it in the same aligned 256 byte boundary.
>>>> The event suppression structs currently require 4byte align.  Are you
>>>> suggesting we increase the align required when
>>>> VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
>>>> would presumably suffice.
>>> OK.
>>>
>>> I am also wondering: what would the analog of this feature be for split
>>> rings? We are burning a feature bit, might as well find a good
>>> use for it.
>>>
>>>
>>>
>>>
>>>
>>>> Thanks for the feedback.
>>>>
>>>> David
>>>>
>>>> -- 
>>>> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 17:35                 ` Michael S. Tsirkin
  2019-02-13  9:49                   ` David Riddoch
@ 2019-02-13 10:33                   ` Jason Wang
  2019-02-13 17:30                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-13 10:33 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Riddoch; +Cc: Virtio-Dev


On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
>>>>>> I would prefer to have the write barrier before writing the idx.
>>>>> Well that's driver overhead for something device might never utilise in
>>>>> a given workload. If we are optimizing let's optimize for speed.
>>>> I think doing the barrier before writing idx is best for speed (see below).
>>> I don't see it below:(
>> Sorry, I'm not being clear.  If the write barrier is before the idx, then a
>> PV device can read the idx, do a single rmb and read a whole bunch of
>> descriptors.  As things stand today a PV device has to do an rmb for each
>> descriptor that it reads.
>>
>> I'm not sure, but this may be what Jason meant when he said "prefetch".


Yes.


> Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> instead.  It remains to be seen whether an extra cache miss per
> batch is cheaper or more expensive than such a dependency
> per descriptor.
>
> I hope Jason can experiment and let us know.
>
>

To clarify, actually I did play someting like:


if (desc_is_avail((last_avail + 32) % size))

     [last_avail, last_avail + 32] is avail [1]

else if (desc_is_avail(last_avail + 1) % size))

     last_avail is avail

else

    no new


And looks like [1] depends on the driver to do:

for all descriptors

     update desc (but not make it available for the flag)

wmb()

for all descriptors

     update desc.flag


This means if last_avail + 32 is avail, no need to check flags of 
[last_avail, last_avail + 31] since the descriptor content (except for 
the flag) is guaranteed to be correct.

I get ~10% PPS improvement.

Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-13 10:00               ` Jason Wang
@ 2019-02-13 15:20                 ` Michael S. Tsirkin
  2019-02-14  3:21                   ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 15:20 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Wed, Feb 13, 2019 at 06:00:41PM +0800, Jason Wang wrote:
> 
> On 2019/2/12 下午9:44, Michael S. Tsirkin wrote:
> > On Tue, Feb 12, 2019 at 03:28:26PM +0800, Jason Wang wrote:
> > > On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > > > > > This can result in a very high rate of doorbells with some
> > > > > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > > > > > MMIOs at very high rates).
> > > > > > > > Interesting. Is there any data you could share to help guide the design?
> > > > > > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > > > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > > > > > 
> > > > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > > > > > PCIe-remote socket ~8M/s.
> > > > > > Is that mega bytes? Or million writes?
> > > > > > With 32 bit writes are we limited to 2M then?
> > > > > Sorry, this is million-writes-per-second.  The size of the write doesn't
> > > > > matter.
> > > > So it's not too bad, you only need one per batch after all.
> > > > 
> > > > 
> > > > Also I thought some more about this. In fact on x86/intel specifically
> > > > PCI descriptor reads are atomic with cache line granularity
> > > > and writes are ordered. So in fact you can safely prefetch
> > > > descriptors and if you see them valid, you can go ahead
> > > > and use them.
> > > 
> > > If I understand this correctly, unless driver set avail in order for several
> > > descriptors. Device could not assume that if N is avail then [0, N) is also
> > > available.
> > Well you don't need that assumption. You check it and discard
> > prefetched data if not.
> 
> 
> Ok, this is probably ok for hardware implementation. But for software
> implementation, it may result lots of re-read.

That will be on a slow path where you got unlucky.


> 
> > 
> > 
> > > But prefetching indeed improve the performance when I play with vhost kernel
> > > implementation. It reduces the overhead of memory accessing, which is
> > > similar to PCI.
> > Prefetch as in prefetchnta?
> 
> 
> Sorry, "prefetch" is probably not accurate. "Speculation" is probably
> better.
> 
> What I did is just read more than one descriptor e.g 16. And check whether
> the last is available, if yes, assume all the rest are available.

So how problematic is it to check them all?  In fact you can optimize
out the rmb (or dependency barrier), just issue one barrier after
checking all.


> 
> > 
> > 
> > > > This makes the descriptor index merely a hint for performance,
> > > > which device can play with at will.
> > > > 
> > > > 
> > > > Other platforms are not like this so you need the data
> > > > but do they have the same problem?
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > > > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > > > > > instr/cycle takes a huge hit.
> > > > > > > 
> > > > > > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > > > > > This kind of looks like a split ring though, does it not?
> > > > > > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > > > > > hardware because of the potentially random accesses to the descriptor table
> > > > > > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > > > > > well as causing complexity in implementation.
> > > > > > That's different with IN_ORDER, right?
> > > > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> > > > > win because you can merge 'used' writes.  (But packed-ring still wins due to
> > > > > not having to fetch ring and descriptor table separately).
> > > 
> > > Note, with IN_ORDER, there's no need to read available ring at all even for
> > > split ring. And in some cases, no need to update used ring at all.
> > Yes you said so at some point but it seems that the spec
> > is written ambigously and others do not read it like this.
> > Can you clarify?
> 
> 
> Yes, it was unclear about how available ring was dealt with in this case. My
> understanding is, since device use the descriptor table in order, if
> last_avail_idx is N and avail_idx is M, driver could assume [N, M) is
> available. In this case, avail_idx points to idx of descriptor ring instead
> of avail ring. So there's no need for both side to read or write avail ring
> entries.
> 
> Does this make sense? 10% PPS was seen through this way when I playing this
> for vhost.

Problem is, if you add a chain of descriptors, index is incremented by 1
while the descriptor buffer advances by X. So how can you find out
without reading avail ring?

Maybe you can start with fetching based on index (so 1 above), then
fetch an extra descriptor each time you see a NEXT bit.
And given PV tends to use INDIRECT aggressively, that won't happen
too often.
Is that the idea?


> 
> > 
> > 
> > > > > > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > > > > > specific pain point for hardware offload.
> > > > > > > 
> > > > > > > > The issue is
> > > > > > > > we will again need to bounce more cache lines to communicate.
> > > > > > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > > > > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > > > > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > > > > > which is always a strongly ordered barrier on x86/64.
> > > > > > > 
> > > > > > > With vDPA you ideally would have this feature enabled, and the device would
> > > > > > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > > > > > field and so cache lines would not bounce then either.
> > > > > > > 
> > > > > > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > > > > > which is the scenario when this is a win.
> > > > > > True. OTOH on non-x86 you will need more write barriers :( It would be
> > > > > > natural to say driver can reuse the barrier before flags update, but note
> > > > > > that that would mean hardware still needs to support re-fetching if
> > > > > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > > > > > does add a bit more complexity.
> > > > > I would prefer to have the write barrier before writing the idx.
> > > > Well that's driver overhead for something device might never utilise in
> > > > a given workload. If we are optimizing let's optimize for speed.
> > > > 
> > > > > Note that drivers could take advantage of this feature to avoid read
> > > > > barriers when consuming descriptors: At the moment there is a virtio_rmb()
> > > > > per descriptor read.  With the proposed feature the driver can do just one
> > > > > barrier after reading idx.
> > > > Oh you want device to write a used index too? Hmm. Isn't this
> > > > contrary to your efforts to consume PCIe BW?
> > > > 
> > > > > I expect that on platforms where write barriers
> > > > > have a cost read barriers likely have a significant cost too, so this might
> > > > > be a win with PV devices too.
> > > > I'm not sure.  It is pretty easy to replace an rmb with a dependency
> > > > which is generally quite cheap in my testing.
> > > > 
> > > > But since it's supposed to benefit PV, at this point we already have
> > > > implementations so rather than speculate (pun intended), people can
> > > > write patches and experiment.
> > > > 
> > > > For the proposed avail idx I think Jason (CC'd) was interested in adding
> > > > something like this for vhost.
> > > 
> > > Yes, it's kind of IN_ORDER + split ring I believe.
> > > 
> > > Thanks
> > Well IN_ORDER is only good for e.g. network. Storage wants out of order
> > so it could benefit from a new feature.  I am however worried about the
> > extra wmb implied.  Still thinking about ways to avoid adding that.
> > 
> 
> Ok.
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > > > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > > > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > > > > > guessed right then you can just use what you have read, no need to
> > > > > > > > re-fetch?
> > > > > > > Unless I've misunderstood I think this would imply that the driver would
> > > > > > > have to ensure strict ordering for each descriptor it wrote, which would
> > > > > > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > > > > > before writing the flags of the first descriptor in a batch.
> > > > > > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > > > > > requirements.
> > > > > Yes on non-x86.  The extra data structure only adds an ordering once per
> > > > > request (when enabled) whereas allowing prefetch would add an ordering per
> > > > > descriptor.  The number of descriptors is always >= the number of requests,
> > > > > and can be much larger (esp. for a net rx virt-queue).
> > > > Question is, is MMIO also slow on these non-x86 platforms?
> > > > 
> > > > 
> > > > Prefetch if successful just drastically lowers latency. You can't beat
> > > > not needing a read at all.
> > > > 
> > > > > > > > > I suggest the best place to put this would be in the driver area,
> > > > > > > > > immediately after the event suppression structure.
> > > > > > > > Could you comment on why is that a good place though?
> > > > > > > The new field is written by the driver, as are the other fields in the
> > > > > > > driver area.  Also I expect that devices might be able to read this new idx
> > > > > > > together with the interrupt suppression fields in a single read, reducing
> > > > > > > PCIe overheads.
> > > > > > OK then you need it in the same aligned 256 byte boundary.
> > > > > The event suppression structs currently require 4byte align.  Are you
> > > > > suggesting we increase the align required when
> > > > > VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> > > > > would presumably suffice.
> > > > OK.
> > > > 
> > > > I am also wondering: what would the analog of this feature be for split
> > > > rings? We are burning a feature bit, might as well find a good
> > > > use for it.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > Thanks for the feedback.
> > > > > 
> > > > > David
> > > > > 
> > > > > -- 
> > > > > David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-13 10:33                   ` Jason Wang
@ 2019-02-13 17:30                     ` Michael S. Tsirkin
  2019-02-14  3:34                       ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 17:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> 
> On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > I don't see it below:(
> > > Sorry, I'm not being clear.  If the write barrier is before the idx, then a
> > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > descriptors.  As things stand today a PV device has to do an rmb for each
> > > descriptor that it reads.
> > > 
> > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> 
> 
> Yes.
> 
> 
> > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > instead.  It remains to be seen whether an extra cache miss per
> > batch is cheaper or more expensive than such a dependency
> > per descriptor.
> > 
> > I hope Jason can experiment and let us know.
> > 
> > 
> 
> To clarify, actually I did play someting like:
> 
> 
> if (desc_is_avail((last_avail + 32) % size))
> 
>     [last_avail, last_avail + 32] is avail [1]
> 
> else if (desc_is_avail(last_avail + 1) % size))
> 
>     last_avail is avail
> 
> else
> 
>    no new
> 
> 
> And looks like [1] depends on the driver to do:
> 
> for all descriptors
> 
>     update desc (but not make it available for the flag)
> 
> wmb()
> 
> for all descriptors
> 
>     update desc.flag
> 
> 
> This means if last_avail + 32 is avail, no need to check flags of
> [last_avail, last_avail + 31] since the descriptor content (except for the
> flag) is guaranteed to be correct.

We can try looking into that, sure. It's not just wmb though which is
anyway free on x86. There are cases of e.g. chained descriptors
to consider where we do not want device to see a partial chain.


> I get ~10% PPS improvement.
> 
> Thanks

I suspect most of the gain isn't in the check at all though.
It's probably reading descriptors and maybe loop unrolling.
Something like:

__attribute__((optimize("unroll-loops")))
bool fast_get_descriptors(vq, desc)
{

	copy X descriptors
	u16 aflags = 0xffff;
	u16 oflags = 0x0;
	for (i = 0; i < X; ++i) {
		aflags &= desc[i].flags;
		oflags |= desc[i].flags;
	}
	/* all flags must be same */
	if (aflags != oflags)
		return false;

	return is_avail(vq->wrapcount, aflags);
}


I also think 32 is very aggressive, I would start with 8.


With above, you could try:

	if (fast_get_descriptors(vq, desc))
	     [last_avail, last_avail + 32] is avail [1]
 
	else if (desc_is_avail(last_avail + 1) % size))
	     last_avail is avail
 
	 else
 	 	no new


-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-13 15:20                 ` Michael S. Tsirkin
@ 2019-02-14  3:21                   ` Jason Wang
  2019-02-14  3:41                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-14  3:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/13 下午11:20, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 06:00:41PM +0800, Jason Wang wrote:
>> On 2019/2/12 下午9:44, Michael S. Tsirkin wrote:
>>> On Tue, Feb 12, 2019 at 03:28:26PM +0800, Jason Wang wrote:
>>>> On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
>>>>>>>>>> This can result in a very high rate of doorbells with some
>>>>>>>>>> drivers, which can become a severe bottleneck (because x86 CPUs can't emit
>>>>>>>>>> MMIOs at very high rates).
>>>>>>>>> Interesting. Is there any data you could share to help guide the design?
>>>>>>>>> E.g. what's the highest rate of MMIO writes supported etc?
>>>>>>>> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>>>>>>>>
>>>>>>>> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
>>>>>>>> PCIe-remote socket ~8M/s.
>>>>>>> Is that mega bytes? Or million writes?
>>>>>>> With 32 bit writes are we limited to 2M then?
>>>>>> Sorry, this is million-writes-per-second.  The size of the write doesn't
>>>>>> matter.
>>>>> So it's not too bad, you only need one per batch after all.
>>>>>
>>>>>
>>>>> Also I thought some more about this. In fact on x86/intel specifically
>>>>> PCI descriptor reads are atomic with cache line granularity
>>>>> and writes are ordered. So in fact you can safely prefetch
>>>>> descriptors and if you see them valid, you can go ahead
>>>>> and use them.
>>>> If I understand this correctly, unless driver set avail in order for several
>>>> descriptors. Device could not assume that if N is avail then [0, N) is also
>>>> available.
>>> Well you don't need that assumption. You check it and discard
>>> prefetched data if not.
>>
>> Ok, this is probably ok for hardware implementation. But for software
>> implementation, it may result lots of re-read.
> That will be on a slow path where you got unlucky.


Yes.


>
>
>>>
>>>> But prefetching indeed improve the performance when I play with vhost kernel
>>>> implementation. It reduces the overhead of memory accessing, which is
>>>> similar to PCI.
>>> Prefetch as in prefetchnta?
>>
>> Sorry, "prefetch" is probably not accurate. "Speculation" is probably
>> better.
>>
>> What I did is just read more than one descriptor e.g 16. And check whether
>> the last is available, if yes, assume all the rest are available.
> So how problematic is it to check them all?  In fact you can optimize
> out the rmb (or dependency barrier), just issue one barrier after
> checking all.


Let me reply in another thread.


>
>
>>>
>>>>> This makes the descriptor index merely a hint for performance,
>>>>> which device can play with at will.
>>>>>
>>>>>
>>>>> Other platforms are not like this so you need the data
>>>>> but do they have the same problem?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>> This doesn't just impose a limit on aggregate packet rate: If you hit this
>>>>>>>> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
>>>>>>>> instr/cycle takes a huge hit.
>>>>>>>>
>>>>>>>>>> The proposed offset/wrap field allows devices to disable doorbells when
>>>>>>>>>> appropriate, and determine the latest fill level via a PCIe read.
>>>>>>>>> This kind of looks like a split ring though, does it not?
>>>>>>>> I don't agree, because the ring isn't split.  Split-ring is very painful for
>>>>>>>> hardware because of the potentially random accesses to the descriptor table
>>>>>>>> (including pointer chasing) which inhibit batching of descriptor fetches, as
>>>>>>>> well as causing complexity in implementation.
>>>>>>> That's different with IN_ORDER, right?
>>>>>> Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
>>>>>> win because you can merge 'used' writes.  (But packed-ring still wins due to
>>>>>> not having to fetch ring and descriptor table separately).
>>>> Note, with IN_ORDER, there's no need to read available ring at all even for
>>>> split ring. And in some cases, no need to update used ring at all.
>>> Yes you said so at some point but it seems that the spec
>>> is written ambigously and others do not read it like this.
>>> Can you clarify?
>>
>> Yes, it was unclear about how available ring was dealt with in this case. My
>> understanding is, since device use the descriptor table in order, if
>> last_avail_idx is N and avail_idx is M, driver could assume [N, M) is
>> available. In this case, avail_idx points to idx of descriptor ring instead
>> of avail ring. So there's no need for both side to read or write avail ring
>> entries.
>>
>> Does this make sense? 10% PPS was seen through this way when I playing this
>> for vhost.
> Problem is, if you add a chain of descriptors, index is incremented by 1
> while the descriptor buffer advances by X. So how can you find out
> without reading avail ring?


I think it's as simple as increase the avail idx by X? Since descriptor 
were used in order, device can just read the next X-1 descriptors in 
this case.

Thanks


>
> Maybe you can start with fetching based on index (so 1 above), then
> fetch an extra descriptor each time you see a NEXT bit.
> And given PV tends to use INDIRECT aggressively, that won't happen
> too often.
> Is that the idea?
>
>
>>>
>>>>>>>> Packed ring is a huge improvement on both dimensions, but introduces a
>>>>>>>> specific pain point for hardware offload.
>>>>>>>>
>>>>>>>>> The issue is
>>>>>>>>> we will again need to bounce more cache lines to communicate.
>>>>>>>> You'll only bounce cache lines if the device chooses to read the idx.  A PV
>>>>>>>> device need not offer this feature.  A PCIe device will, but the cost to the
>>>>>>>> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
>>>>>>>> which is always a strongly ordered barrier on x86/64.
>>>>>>>>
>>>>>>>> With vDPA you ideally would have this feature enabled, and the device would
>>>>>>>> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
>>>>>>>> field and so cache lines would not bounce then either.
>>>>>>>>
>>>>>>>> Ie. The only time cache lines are shared is when sharing with a PCIe device,
>>>>>>>> which is the scenario when this is a win.
>>>>>>> True. OTOH on non-x86 you will need more write barriers :( It would be
>>>>>>> natural to say driver can reuse the barrier before flags update, but note
>>>>>>> that that would mean hardware still needs to support re-fetching if
>>>>>>> flags value is wrong. Such re-fetch is probably rare so fine by me, but it
>>>>>>> does add a bit more complexity.
>>>>>> I would prefer to have the write barrier before writing the idx.
>>>>> Well that's driver overhead for something device might never utilise in
>>>>> a given workload. If we are optimizing let's optimize for speed.
>>>>>
>>>>>> Note that drivers could take advantage of this feature to avoid read
>>>>>> barriers when consuming descriptors: At the moment there is a virtio_rmb()
>>>>>> per descriptor read.  With the proposed feature the driver can do just one
>>>>>> barrier after reading idx.
>>>>> Oh you want device to write a used index too? Hmm. Isn't this
>>>>> contrary to your efforts to consume PCIe BW?
>>>>>
>>>>>> I expect that on platforms where write barriers
>>>>>> have a cost read barriers likely have a significant cost too, so this might
>>>>>> be a win with PV devices too.
>>>>> I'm not sure.  It is pretty easy to replace an rmb with a dependency
>>>>> which is generally quite cheap in my testing.
>>>>>
>>>>> But since it's supposed to benefit PV, at this point we already have
>>>>> implementations so rather than speculate (pun intended), people can
>>>>> write patches and experiment.
>>>>>
>>>>> For the proposed avail idx I think Jason (CC'd) was interested in adding
>>>>> something like this for vhost.
>>>> Yes, it's kind of IN_ORDER + split ring I believe.
>>>>
>>>> Thanks
>>> Well IN_ORDER is only good for e.g. network. Storage wants out of order
>>> so it could benefit from a new feature.  I am however worried about the
>>> extra wmb implied.  Still thinking about ways to avoid adding that.
>>>
>> Ok.
>>
>> Thanks
>>
>>
>>>
>>>>>>>>> So I wonder: what if we made a change to spec that would allow prefetch
>>>>>>>>> of ring entries?  E.g. you would be able to read at random and if you
>>>>>>>>> guessed right then you can just use what you have read, no need to
>>>>>>>>> re-fetch?
>>>>>>>> Unless I've misunderstood I think this would imply that the driver would
>>>>>>>> have to ensure strict ordering for each descriptor it wrote, which would
>>>>>>>> impose a cost to the driver.  At the moment a write barrier is only needed
>>>>>>>> before writing the flags of the first descriptor in a batch.
>>>>>>> On non-x86 right?  OTOH the extra data structure also adds more ordering
>>>>>>> requirements.
>>>>>> Yes on non-x86.  The extra data structure only adds an ordering once per
>>>>>> request (when enabled) whereas allowing prefetch would add an ordering per
>>>>>> descriptor.  The number of descriptors is always >= the number of requests,
>>>>>> and can be much larger (esp. for a net rx virt-queue).
>>>>> Question is, is MMIO also slow on these non-x86 platforms?
>>>>>
>>>>>
>>>>> Prefetch if successful just drastically lowers latency. You can't beat
>>>>> not needing a read at all.
>>>>>
>>>>>>>>>> I suggest the best place to put this would be in the driver area,
>>>>>>>>>> immediately after the event suppression structure.
>>>>>>>>> Could you comment on why is that a good place though?
>>>>>>>> The new field is written by the driver, as are the other fields in the
>>>>>>>> driver area.  Also I expect that devices might be able to read this new idx
>>>>>>>> together with the interrupt suppression fields in a single read, reducing
>>>>>>>> PCIe overheads.
>>>>>>> OK then you need it in the same aligned 256 byte boundary.
>>>>>> The event suppression structs currently require 4byte align.  Are you
>>>>>> suggesting we increase the align required when
>>>>>> VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
>>>>>> would presumably suffice.
>>>>> OK.
>>>>>
>>>>> I am also wondering: what would the analog of this feature be for split
>>>>> rings? We are burning a feature bit, might as well find a good
>>>>> use for it.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thanks for the feedback.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> -- 
>>>>>> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-13 17:30                     ` Michael S. Tsirkin
@ 2019-02-14  3:34                       ` Jason Wang
  2019-02-14  4:04                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-14  3:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
>> On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
>>> On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
>>>>>>>> I would prefer to have the write barrier before writing the idx.
>>>>>>> Well that's driver overhead for something device might never utilise in
>>>>>>> a given workload. If we are optimizing let's optimize for speed.
>>>>>> I think doing the barrier before writing idx is best for speed (see below).
>>>>> I don't see it below:(
>>>> Sorry, I'm not being clear.  If the write barrier is before the idx, then a
>>>> PV device can read the idx, do a single rmb and read a whole bunch of
>>>> descriptors.  As things stand today a PV device has to do an rmb for each
>>>> descriptor that it reads.
>>>>
>>>> I'm not sure, but this may be what Jason meant when he said "prefetch".
>>
>> Yes.
>>
>>
>>> Oh. Right. So for PV it's easy to convert an rmb to a data dependency
>>> instead.  It remains to be seen whether an extra cache miss per
>>> batch is cheaper or more expensive than such a dependency
>>> per descriptor.
>>>
>>> I hope Jason can experiment and let us know.
>>>
>>>
>> To clarify, actually I did play someting like:
>>
>>
>> if (desc_is_avail((last_avail + 32) % size))
>>
>>      [last_avail, last_avail + 32] is avail [1]
>>
>> else if (desc_is_avail(last_avail + 1) % size))
>>
>>      last_avail is avail
>>
>> else
>>
>>     no new
>>
>>
>> And looks like [1] depends on the driver to do:
>>
>> for all descriptors
>>
>>      update desc (but not make it available for the flag)
>>
>> wmb()
>>
>> for all descriptors
>>
>>      update desc.flag
>>
>>
>> This means if last_avail + 32 is avail, no need to check flags of
>> [last_avail, last_avail + 31] since the descriptor content (except for the
>> flag) is guaranteed to be correct.
> We can try looking into that, sure. It's not just wmb though which is
> anyway free on x86. There are cases of e.g. chained descriptors
> to consider where we do not want device to see a partial chain.


If we saw the last descriptor is in a chain, it's still doesn't harm, 
just read more?


>
>
>> I get ~10% PPS improvement.
>>
>> Thanks
> I suspect most of the gain isn't in the check at all though.


The gain was:

1) batching reading descriptors, saving the times of userspace access

2) reduce the rmb(), otherwise you need rmb per descriptor


> It's probably reading descriptors and maybe loop unrolling.


I'm not quite understand why unrolling help in this case.


> Something like:
>
> __attribute__((optimize("unroll-loops")))
> bool fast_get_descriptors(vq, desc)
> {
>
> 	copy X descriptors
> 	u16 aflags = 0xffff;
> 	u16 oflags = 0x0;
> 	for (i = 0; i < X; ++i) {
> 		aflags &= desc[i].flags;
> 		oflags |= desc[i].flags;
> 	}
> 	/* all flags must be same */
> 	if (aflags != oflags)
> 		return false;
>
> 	return is_avail(vq->wrapcount, aflags);
> }


This may work. It should be a little bit less efficient but without any 
assumption of driver if I understand correctly.


>
>
> I also think 32 is very aggressive, I would start with 8.


Maybe, or adding some heuristic to predict it.

Thanks


>
>
> With above, you could try:
>
> 	if (fast_get_descriptors(vq, desc))
> 	     [last_avail, last_avail + 32] is avail [1]
>   
> 	else if (desc_is_avail(last_avail + 1) % size))
> 	     last_avail is avail
>   
> 	 else
>   	 	no new
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-14  3:21                   ` Jason Wang
@ 2019-02-14  3:41                     ` Michael S. Tsirkin
  2019-02-15  3:59                       ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-14  3:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Thu, Feb 14, 2019 at 11:21:54AM +0800, Jason Wang wrote:
> 
> On 2019/2/13 下午11:20, Michael S. Tsirkin wrote:
> > On Wed, Feb 13, 2019 at 06:00:41PM +0800, Jason Wang wrote:
> > > On 2019/2/12 下午9:44, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 12, 2019 at 03:28:26PM +0800, Jason Wang wrote:
> > > > > On 2019/2/12 下午1:08, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 11, 2019 at 02:58:30PM +0000, David Riddoch wrote:
> > > > > > > > > > > This can result in a very high rate of doorbells with some
> > > > > > > > > > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > > > > > > > > > MMIOs at very high rates).
> > > > > > > > > > Interesting. Is there any data you could share to help guide the design?
> > > > > > > > > > E.g. what's the highest rate of MMIO writes supported etc?
> > > > > > > > > On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> > > > > > > > > 
> > > > > > > > > On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> > > > > > > > > PCIe-remote socket ~8M/s.
> > > > > > > > Is that mega bytes? Or million writes?
> > > > > > > > With 32 bit writes are we limited to 2M then?
> > > > > > > Sorry, this is million-writes-per-second.  The size of the write doesn't
> > > > > > > matter.
> > > > > > So it's not too bad, you only need one per batch after all.
> > > > > > 
> > > > > > 
> > > > > > Also I thought some more about this. In fact on x86/intel specifically
> > > > > > PCI descriptor reads are atomic with cache line granularity
> > > > > > and writes are ordered. So in fact you can safely prefetch
> > > > > > descriptors and if you see them valid, you can go ahead
> > > > > > and use them.
> > > > > If I understand this correctly, unless driver set avail in order for several
> > > > > descriptors. Device could not assume that if N is avail then [0, N) is also
> > > > > available.
> > > > Well you don't need that assumption. You check it and discard
> > > > prefetched data if not.
> > > 
> > > Ok, this is probably ok for hardware implementation. But for software
> > > implementation, it may result lots of re-read.
> > That will be on a slow path where you got unlucky.
> 
> 
> Yes.
> 
> 
> > 
> > 
> > > > 
> > > > > But prefetching indeed improve the performance when I play with vhost kernel
> > > > > implementation. It reduces the overhead of memory accessing, which is
> > > > > similar to PCI.
> > > > Prefetch as in prefetchnta?
> > > 
> > > Sorry, "prefetch" is probably not accurate. "Speculation" is probably
> > > better.
> > > 
> > > What I did is just read more than one descriptor e.g 16. And check whether
> > > the last is available, if yes, assume all the rest are available.
> > So how problematic is it to check them all?  In fact you can optimize
> > out the rmb (or dependency barrier), just issue one barrier after
> > checking all.
> 
> 
> Let me reply in another thread.
> 
> 
> > 
> > 
> > > > 
> > > > > > This makes the descriptor index merely a hint for performance,
> > > > > > which device can play with at will.
> > > > > > 
> > > > > > 
> > > > > > Other platforms are not like this so you need the data
> > > > > > but do they have the same problem?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > > This doesn't just impose a limit on aggregate packet rate: If you hit this
> > > > > > > > > bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> > > > > > > > > instr/cycle takes a huge hit.
> > > > > > > > > 
> > > > > > > > > > > The proposed offset/wrap field allows devices to disable doorbells when
> > > > > > > > > > > appropriate, and determine the latest fill level via a PCIe read.
> > > > > > > > > > This kind of looks like a split ring though, does it not?
> > > > > > > > > I don't agree, because the ring isn't split.  Split-ring is very painful for
> > > > > > > > > hardware because of the potentially random accesses to the descriptor table
> > > > > > > > > (including pointer chasing) which inhibit batching of descriptor fetches, as
> > > > > > > > > well as causing complexity in implementation.
> > > > > > > > That's different with IN_ORDER, right?
> > > > > > > Yes, IN_ORDER will give some of the benefit of packed ring, and is also a
> > > > > > > win because you can merge 'used' writes.  (But packed-ring still wins due to
> > > > > > > not having to fetch ring and descriptor table separately).
> > > > > Note, with IN_ORDER, there's no need to read available ring at all even for
> > > > > split ring. And in some cases, no need to update used ring at all.
> > > > Yes you said so at some point but it seems that the spec
> > > > is written ambigously and others do not read it like this.
> > > > Can you clarify?
> > > 
> > > Yes, it was unclear about how available ring was dealt with in this case. My
> > > understanding is, since device use the descriptor table in order, if
> > > last_avail_idx is N and avail_idx is M, driver could assume [N, M) is
> > > available. In this case, avail_idx points to idx of descriptor ring instead
> > > of avail ring. So there's no need for both side to read or write avail ring
> > > entries.
> > > 
> > > Does this make sense? 10% PPS was seen through this way when I playing this
> > > for vhost.
> > Problem is, if you add a chain of descriptors, index is incremented by 1
> > while the descriptor buffer advances by X. So how can you find out
> > without reading avail ring?
> 
> 
> I think it's as simple as increase the avail idx by X? Since descriptor were
> used in order, device can just read the next X-1 descriptors in this case.
> 
> Thanks

Right so a spec change would be needed, it's not transparent to guest.



> 
> > 
> > Maybe you can start with fetching based on index (so 1 above), then
> > fetch an extra descriptor each time you see a NEXT bit.
> > And given PV tends to use INDIRECT aggressively, that won't happen
> > too often.
> > Is that the idea?
> > 
> > 
> > > > 
> > > > > > > > > Packed ring is a huge improvement on both dimensions, but introduces a
> > > > > > > > > specific pain point for hardware offload.
> > > > > > > > > 
> > > > > > > > > > The issue is
> > > > > > > > > > we will again need to bounce more cache lines to communicate.
> > > > > > > > > You'll only bounce cache lines if the device chooses to read the idx.  A PV
> > > > > > > > > device need not offer this feature.  A PCIe device will, but the cost to the
> > > > > > > > > driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> > > > > > > > > which is always a strongly ordered barrier on x86/64.
> > > > > > > > > 
> > > > > > > > > With vDPA you ideally would have this feature enabled, and the device would
> > > > > > > > > sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> > > > > > > > > field and so cache lines would not bounce then either.
> > > > > > > > > 
> > > > > > > > > Ie. The only time cache lines are shared is when sharing with a PCIe device,
> > > > > > > > > which is the scenario when this is a win.
> > > > > > > > True. OTOH on non-x86 you will need more write barriers :( It would be
> > > > > > > > natural to say driver can reuse the barrier before flags update, but note
> > > > > > > > that that would mean hardware still needs to support re-fetching if
> > > > > > > > flags value is wrong. Such re-fetch is probably rare so fine by me, but it
> > > > > > > > does add a bit more complexity.
> > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > 
> > > > > > > Note that drivers could take advantage of this feature to avoid read
> > > > > > > barriers when consuming descriptors: At the moment there is a virtio_rmb()
> > > > > > > per descriptor read.  With the proposed feature the driver can do just one
> > > > > > > barrier after reading idx.
> > > > > > Oh you want device to write a used index too? Hmm. Isn't this
> > > > > > contrary to your efforts to consume PCIe BW?
> > > > > > 
> > > > > > > I expect that on platforms where write barriers
> > > > > > > have a cost read barriers likely have a significant cost too, so this might
> > > > > > > be a win with PV devices too.
> > > > > > I'm not sure.  It is pretty easy to replace an rmb with a dependency
> > > > > > which is generally quite cheap in my testing.
> > > > > > 
> > > > > > But since it's supposed to benefit PV, at this point we already have
> > > > > > implementations so rather than speculate (pun intended), people can
> > > > > > write patches and experiment.
> > > > > > 
> > > > > > For the proposed avail idx I think Jason (CC'd) was interested in adding
> > > > > > something like this for vhost.
> > > > > Yes, it's kind of IN_ORDER + split ring I believe.
> > > > > 
> > > > > Thanks
> > > > Well IN_ORDER is only good for e.g. network. Storage wants out of order
> > > > so it could benefit from a new feature.  I am however worried about the
> > > > extra wmb implied.  Still thinking about ways to avoid adding that.
> > > > 
> > > Ok.
> > > 
> > > Thanks
> > > 
> > > 
> > > > 
> > > > > > > > > > So I wonder: what if we made a change to spec that would allow prefetch
> > > > > > > > > > of ring entries?  E.g. you would be able to read at random and if you
> > > > > > > > > > guessed right then you can just use what you have read, no need to
> > > > > > > > > > re-fetch?
> > > > > > > > > Unless I've misunderstood I think this would imply that the driver would
> > > > > > > > > have to ensure strict ordering for each descriptor it wrote, which would
> > > > > > > > > impose a cost to the driver.  At the moment a write barrier is only needed
> > > > > > > > > before writing the flags of the first descriptor in a batch.
> > > > > > > > On non-x86 right?  OTOH the extra data structure also adds more ordering
> > > > > > > > requirements.
> > > > > > > Yes on non-x86.  The extra data structure only adds an ordering once per
> > > > > > > request (when enabled) whereas allowing prefetch would add an ordering per
> > > > > > > descriptor.  The number of descriptors is always >= the number of requests,
> > > > > > > and can be much larger (esp. for a net rx virt-queue).
> > > > > > Question is, is MMIO also slow on these non-x86 platforms?
> > > > > > 
> > > > > > 
> > > > > > Prefetch if successful just drastically lowers latency. You can't beat
> > > > > > not needing a read at all.
> > > > > > 
> > > > > > > > > > > I suggest the best place to put this would be in the driver area,
> > > > > > > > > > > immediately after the event suppression structure.
> > > > > > > > > > Could you comment on why is that a good place though?
> > > > > > > > > The new field is written by the driver, as are the other fields in the
> > > > > > > > > driver area.  Also I expect that devices might be able to read this new idx
> > > > > > > > > together with the interrupt suppression fields in a single read, reducing
> > > > > > > > > PCIe overheads.
> > > > > > > > OK then you need it in the same aligned 256 byte boundary.
> > > > > > > The event suppression structs currently require 4byte align.  Are you
> > > > > > > suggesting we increase the align required when
> > > > > > > VIRTIO_F_RING_PACKED_AVAIL_IDX is negotiated?  Sounds good to me, but 8bytes
> > > > > > > would presumably suffice.
> > > > > > OK.
> > > > > > 
> > > > > > I am also wondering: what would the analog of this feature be for split
> > > > > > rings? We are burning a feature bit, might as well find a good
> > > > > > use for it.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Thanks for the feedback.
> > > > > > > 
> > > > > > > David
> > > > > > > 
> > > > > > > -- 
> > > > > > > David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> > > > > > > 
> > > > > > > 
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-14  3:34                       ` Jason Wang
@ 2019-02-14  4:04                         ` Michael S. Tsirkin
  2019-02-19  6:33                           ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-14  4:04 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> 
> On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > > > I don't see it below:(
> > > > > Sorry, I'm not being clear.  If the write barrier is before the idx, then a
> > > > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > > > descriptors.  As things stand today a PV device has to do an rmb for each
> > > > > descriptor that it reads.
> > > > > 
> > > > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> > > 
> > > Yes.
> > > 
> > > 
> > > > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > > > instead.  It remains to be seen whether an extra cache miss per
> > > > batch is cheaper or more expensive than such a dependency
> > > > per descriptor.
> > > > 
> > > > I hope Jason can experiment and let us know.
> > > > 
> > > > 
> > > To clarify, actually I did play someting like:
> > > 
> > > 
> > > if (desc_is_avail((last_avail + 32) % size))
> > > 
> > >      [last_avail, last_avail + 32] is avail [1]
> > > 
> > > else if (desc_is_avail(last_avail + 1) % size))
> > > 
> > >      last_avail is avail
> > > 
> > > else
> > > 
> > >     no new
> > > 
> > > 
> > > And looks like [1] depends on the driver to do:
> > > 
> > > for all descriptors
> > > 
> > >      update desc (but not make it available for the flag)
> > > 
> > > wmb()
> > > 
> > > for all descriptors
> > > 
> > >      update desc.flag
> > > 
> > > 
> > > This means if last_avail + 32 is avail, no need to check flags of
> > > [last_avail, last_avail + 31] since the descriptor content (except for the
> > > flag) is guaranteed to be correct.
> > We can try looking into that, sure. It's not just wmb though which is
> > anyway free on x86. There are cases of e.g. chained descriptors
> > to consider where we do not want device to see a partial chain.
> 
> 
> If we saw the last descriptor is in a chain, it's still doesn't harm, just
> read more?

So it's possible to add a variant that sticks a wmb after each flag
update for sure. However if you are writing out batches that's an
upfront cost and what you gain is prefetch on the other side which is
speculative.


> 
> > 
> > 
> > > I get ~10% PPS improvement.
> > > 
> > > Thanks
> > I suspect most of the gain isn't in the check at all though.
> 
> 
> The gain was:
> 
> 1) batching reading descriptors, saving the times of userspace access

Yes that needs to be fixed. The way linux vhost driver does userspace
accesses right now with stac/clac and memory barriers within a loop is
plain silly.  Designing a host/guest interface to work around that kind
of inefficiency makes no sense, it's purely an implementation issue.

> 2) reduce the rmb(), otherwise you need rmb per descriptor

Well not really, you can
- validate a batch then do a single rmb
- replace rmb with a dependency

> 
> > It's probably reading descriptors and maybe loop unrolling.
> 
> 
> I'm not quite understand why unrolling help in this case.

The amount of operations in this loop is tiny but not tiny
enough for gcc to decide it's free. Take a look


#include <stdio.h>

struct desc {
        unsigned long long a;
        unsigned len;
        unsigned short type;
        unsigned short flags;
};
#ifdef UNROLL
__attribute__((optimize("unroll-loops")))
#endif
int foo(struct desc *desc)
{
     unsigned short aflags = 0xffff;
     unsigned short oflags = 0x0;

     int i;

        for (i=0; i < 4; ++i) {
             aflags &= desc[i].flags;
             oflags |= desc[i].flags;
        }

        return aflags == oflags;
}

$ gcc -Wall -DUNROLL -O2 -c l.c

0000000000000000 <foo>:
foo():
   0:   48 8d 47 0e             lea    0xe(%rdi),%rax
   4:   31 c9                   xor    %ecx,%ecx
   6:   48 83 c7 4e             add    $0x4e,%rdi
   a:   be ff ff ff ff          mov    $0xffffffff,%esi
   f:   0f b7 10                movzwl (%rax),%edx
  12:   48 83 c0 10             add    $0x10,%rax
  16:   21 d6                   and    %edx,%esi
  18:   09 d1                   or     %edx,%ecx
  1a:   48 39 f8                cmp    %rdi,%rax
  1d:   75 f0                   jne    f <foo+0xf>
  1f:   31 c0                   xor    %eax,%eax
  21:   66 39 ce                cmp    %cx,%si
  24:   0f 94 c0                sete   %al
  27:   c3                      retq   



$ gcc -Wall -DUNROLL -O2 -c l.c

Disassembly of section .text:

0000000000000000 <foo>:
foo():
   0:   0f b7 47 1e             movzwl 0x1e(%rdi),%eax
   4:   0f b7 77 2e             movzwl 0x2e(%rdi),%esi
   8:   0f b7 4f 0e             movzwl 0xe(%rdi),%ecx
   c:   89 c2                   mov    %eax,%edx
   e:   09 f0                   or     %esi,%eax
  10:   21 f2                   and    %esi,%edx
  12:   09 c8                   or     %ecx,%eax
  14:   21 ca                   and    %ecx,%edx
  16:   0f b7 4f 3e             movzwl 0x3e(%rdi),%ecx
  1a:   09 c8                   or     %ecx,%eax
  1c:   21 ca                   and    %ecx,%edx
  1e:   66 39 c2                cmp    %ax,%dx
  21:   0f 94 c0                sete   %al
  24:   0f b6 c0                movzbl %al,%eax
  27:   c3                      retq   


See? gcc just does not unroll agressively enough.

> 
> > Something like:
> > 
> > __attribute__((optimize("unroll-loops")))
> > bool fast_get_descriptors(vq, desc)
> > {
> > 
> > 	copy X descriptors
> > 	u16 aflags = 0xffff;
> > 	u16 oflags = 0x0;
> > 	for (i = 0; i < X; ++i) {
> > 		aflags &= desc[i].flags;
> > 		oflags |= desc[i].flags;
> > 	}
> > 	/* all flags must be same */
> > 	if (aflags != oflags)
> > 		return false;
> > 
> > 	return is_avail(vq->wrapcount, aflags);
> > }
> 
> 
> This may work. It should be a little bit less efficient but without any
> assumption of driver if I understand correctly.
> 
> 
> > 
> > 
> > I also think 32 is very aggressive, I would start with 8.
> 
> 
> Maybe, or adding some heuristic to predict it.
> 
> Thanks

The reason I say 8 is because it's two cache lines which
is typically what cpu pulls in anyway (1+linear pfetch).
Anything more might have more cost as it might
put more pressure on the cache. Further
static sized short loops can be unrolled by compiler if
you ask it nicely, which saves a bunch of branching,
speculation etc. Dynamic sized ones can't,
you end up with extra math and branches in the loop.


> 
> > 
> > 
> > With above, you could try:
> > 
> > 	if (fast_get_descriptors(vq, desc))
> > 	     [last_avail, last_avail + 32] is avail [1]
> > 	else if (desc_is_avail(last_avail + 1) % size))
> > 	     last_avail is avail
> > 	 else
> >   	 	no new
> > 
> > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-14  3:41                     ` Michael S. Tsirkin
@ 2019-02-15  3:59                       ` Jason Wang
  2019-02-15  4:23                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-15  3:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/14 上午11:41, Michael S. Tsirkin wrote:
>> I think it's as simple as increase the avail idx by X? Since descriptor were
>> used in order, device can just read the next X-1 descriptors in this case.
>>
>> Thanks
> Right so a spec change would be needed, it's not transparent to guest.


With the change, IN_ORDER + split_ring becomes something like submission 
queue (descriptor ring) + completion queue (used ring). And used ring 
access could be eliminated sometime, e.g for net for TX, we don't care 
about used len.

What's more interesting is, this avail idx optimization is not required 
for out of order completion which means it could be used for e.g block 
or SCSI device.

Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-15  3:59                       ` Jason Wang
@ 2019-02-15  4:23                         ` Michael S. Tsirkin
  2019-02-19  6:21                           ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-15  4:23 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Fri, Feb 15, 2019 at 11:59:55AM +0800, Jason Wang wrote:
> 
> On 2019/2/14 上午11:41, Michael S. Tsirkin wrote:
> > > I think it's as simple as increase the avail idx by X? Since descriptor were
> > > used in order, device can just read the next X-1 descriptors in this case.
> > > 
> > > Thanks
> > Right so a spec change would be needed, it's not transparent to guest.
> 
> 
> With the change, IN_ORDER + split_ring becomes something like submission
> queue (descriptor ring) + completion queue (used ring). And used ring access
> could be eliminated sometime, e.g for net for TX, we don't care about used
> len.

Oh yes but fundamentally this becomes very close to packed ring. So I'm
not sure yet another option is justified by a small gain in PPS,
especially considering that it depends on in order and so doesn't
support zero copy.

> What's more interesting is, this avail idx optimization is not required for
> out of order completion which means it could be used for e.g block or SCSI
> device.

I don't get the last sentence. This only works for in order right?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-15  4:23                         ` Michael S. Tsirkin
@ 2019-02-19  6:21                           ` Jason Wang
  2019-02-19 14:18                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-19  6:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/15 下午12:23, Michael S. Tsirkin wrote:
> On Fri, Feb 15, 2019 at 11:59:55AM +0800, Jason Wang wrote:
>> On 2019/2/14 上午11:41, Michael S. Tsirkin wrote:
>>>> I think it's as simple as increase the avail idx by X? Since descriptor were
>>>> used in order, device can just read the next X-1 descriptors in this case.
>>>>
>>>> Thanks
>>> Right so a spec change would be needed, it's not transparent to guest.
>>
>> With the change, IN_ORDER + split_ring becomes something like submission
>> queue (descriptor ring) + completion queue (used ring). And used ring access
>> could be eliminated sometime, e.g for net for TX, we don't care about used
>> len.
> Oh yes but fundamentally this becomes very close to packed ring.


Looks not, it's a two rings vs one ring.


> So I'm
> not sure yet another option is justified by a small gain in PPS,


10% is done with less than 10 lines of code, I suspect maybe dpdk can 
see more.


> especially considering that it depends on in order and so doesn't
> support zero copy.


It looks to me in order is not the blocker, even if we allow out of 
order, we may still suffer from HOL.


>
>> What's more interesting is, this avail idx optimization is not required for
>> out of order completion which means it could be used for e.g block or SCSI
>> device.
> I don't get the last sentence. This only works for in order right?


Yes, you are right.

Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-14  4:04                         ` Michael S. Tsirkin
@ 2019-02-19  6:33                           ` Jason Wang
  2019-02-19 14:27                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2019-02-19  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev


On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
>> On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
>>> On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
>>>> On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
>>>>> On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
>>>>>>>>>> I would prefer to have the write barrier before writing the idx.
>>>>>>>>> Well that's driver overhead for something device might never utilise in
>>>>>>>>> a given workload. If we are optimizing let's optimize for speed.
>>>>>>>> I think doing the barrier before writing idx is best for speed (see below).
>>>>>>> I don't see it below:(
>>>>>> Sorry, I'm not being clear.  If the write barrier is before the idx, then a
>>>>>> PV device can read the idx, do a single rmb and read a whole bunch of
>>>>>> descriptors.  As things stand today a PV device has to do an rmb for each
>>>>>> descriptor that it reads.
>>>>>>
>>>>>> I'm not sure, but this may be what Jason meant when he said "prefetch".
>>>> Yes.
>>>>
>>>>
>>>>> Oh. Right. So for PV it's easy to convert an rmb to a data dependency
>>>>> instead.  It remains to be seen whether an extra cache miss per
>>>>> batch is cheaper or more expensive than such a dependency
>>>>> per descriptor.
>>>>>
>>>>> I hope Jason can experiment and let us know.
>>>>>
>>>>>
>>>> To clarify, actually I did play someting like:
>>>>
>>>>
>>>> if (desc_is_avail((last_avail + 32) % size))
>>>>
>>>>       [last_avail, last_avail + 32] is avail [1]
>>>>
>>>> else if (desc_is_avail(last_avail + 1) % size))
>>>>
>>>>       last_avail is avail
>>>>
>>>> else
>>>>
>>>>      no new
>>>>
>>>>
>>>> And looks like [1] depends on the driver to do:
>>>>
>>>> for all descriptors
>>>>
>>>>       update desc (but not make it available for the flag)
>>>>
>>>> wmb()
>>>>
>>>> for all descriptors
>>>>
>>>>       update desc.flag
>>>>
>>>>
>>>> This means if last_avail + 32 is avail, no need to check flags of
>>>> [last_avail, last_avail + 31] since the descriptor content (except for the
>>>> flag) is guaranteed to be correct.
>>> We can try looking into that, sure. It's not just wmb though which is
>>> anyway free on x86. There are cases of e.g. chained descriptors
>>> to consider where we do not want device to see a partial chain.
>>
>> If we saw the last descriptor is in a chain, it's still doesn't harm, just
>> read more?
> So it's possible to add a variant that sticks a wmb after each flag
> update for sure. However if you are writing out batches that's an
> upfront cost and what you gain is prefetch on the other side which is
> speculative.


Yes.


>
>
>>>
>>>> I get ~10% PPS improvement.
>>>>
>>>> Thanks
>>> I suspect most of the gain isn't in the check at all though.
>>
>> The gain was:
>>
>> 1) batching reading descriptors, saving the times of userspace access
> Yes that needs to be fixed. The way linux vhost driver does userspace
> accesses right now with stac/clac and memory barriers within a loop is
> plain silly.  Designing a host/guest interface to work around that kind
> of inefficiency makes no sense, it's purely an implementation issue.


Actually, it has other advantages, consider the case when driver is 
faster, batch reading may reduce the cache contention on the descriptor 
ring (when the ring is almost full).


>
>> 2) reduce the rmb(), otherwise you need rmb per descriptor
> Well not really, you can
> - validate a batch then do a single rmb
> - replace rmb with a dependency
>
>>> It's probably reading descriptors and maybe loop unrolling.
>>
>> I'm not quite understand why unrolling help in this case.
> The amount of operations in this loop is tiny but not tiny
> enough for gcc to decide it's free. Take a look
>
>
> #include <stdio.h>
>
> struct desc {
>          unsigned long long a;
>          unsigned len;
>          unsigned short type;
>          unsigned short flags;
> };
> #ifdef UNROLL
> __attribute__((optimize("unroll-loops")))
> #endif
> int foo(struct desc *desc)
> {
>       unsigned short aflags = 0xffff;
>       unsigned short oflags = 0x0;
>
>       int i;
>
>          for (i=0; i < 4; ++i) {
>               aflags &= desc[i].flags;
>               oflags |= desc[i].flags;
>          }
>
>          return aflags == oflags;
> }
>
> $ gcc -Wall -DUNROLL -O2 -c l.c


I think you mean without -DUNROLL.


>
> 0000000000000000 <foo>:
> foo():
>     0:   48 8d 47 0e             lea    0xe(%rdi),%rax
>     4:   31 c9                   xor    %ecx,%ecx
>     6:   48 83 c7 4e             add    $0x4e,%rdi
>     a:   be ff ff ff ff          mov    $0xffffffff,%esi
>     f:   0f b7 10                movzwl (%rax),%edx
>    12:   48 83 c0 10             add    $0x10,%rax
>    16:   21 d6                   and    %edx,%esi
>    18:   09 d1                   or     %edx,%ecx
>    1a:   48 39 f8                cmp    %rdi,%rax
>    1d:   75 f0                   jne    f <foo+0xf>
>    1f:   31 c0                   xor    %eax,%eax
>    21:   66 39 ce                cmp    %cx,%si
>    24:   0f 94 c0                sete   %al
>    27:   c3                      retq
>
>
>
> $ gcc -Wall -DUNROLL -O2 -c l.c
>
> Disassembly of section .text:
>
> 0000000000000000 <foo>:
> foo():
>     0:   0f b7 47 1e             movzwl 0x1e(%rdi),%eax
>     4:   0f b7 77 2e             movzwl 0x2e(%rdi),%esi
>     8:   0f b7 4f 0e             movzwl 0xe(%rdi),%ecx
>     c:   89 c2                   mov    %eax,%edx
>     e:   09 f0                   or     %esi,%eax
>    10:   21 f2                   and    %esi,%edx
>    12:   09 c8                   or     %ecx,%eax
>    14:   21 ca                   and    %ecx,%edx
>    16:   0f b7 4f 3e             movzwl 0x3e(%rdi),%ecx
>    1a:   09 c8                   or     %ecx,%eax
>    1c:   21 ca                   and    %ecx,%edx
>    1e:   66 39 c2                cmp    %ax,%dx
>    21:   0f 94 c0                sete   %al
>    24:   0f b6 c0                movzbl %al,%eax
>    27:   c3                      retq
>
>
> See? gcc just does not unroll agressively enough.


Ok, so this is just for avoiding branches.


>
>>> Something like:
>>>
>>> __attribute__((optimize("unroll-loops")))
>>> bool fast_get_descriptors(vq, desc)
>>> {
>>>
>>> 	copy X descriptors
>>> 	u16 aflags = 0xffff;
>>> 	u16 oflags = 0x0;
>>> 	for (i = 0; i < X; ++i) {
>>> 		aflags &= desc[i].flags;
>>> 		oflags |= desc[i].flags;
>>> 	}
>>> 	/* all flags must be same */
>>> 	if (aflags != oflags)
>>> 		return false;
>>>
>>> 	return is_avail(vq->wrapcount, aflags);
>>> }
>>
>> This may work. It should be a little bit less efficient but without any
>> assumption of driver if I understand correctly.
>>
>>
>>>
>>> I also think 32 is very aggressive, I would start with 8.
>>
>> Maybe, or adding some heuristic to predict it.
>>
>> Thanks
> The reason I say 8 is because it's two cache lines which
> is typically what cpu pulls in anyway (1+linear pfetch).
> Anything more might have more cost as it might
> put more pressure on the cache. Further
> static sized short loops can be unrolled by compiler if
> you ask it nicely, which saves a bunch of branching,
> speculation etc. Dynamic sized ones can't,
> you end up with extra math and branches in the loop.


You're probably right here, but need some benchmark to check.

Thanks


>
>
>>>
>>> With above, you could try:
>>>
>>> 	if (fast_get_descriptors(vq, desc))
>>> 	     [last_avail, last_avail + 32] is avail [1]
>>> 	else if (desc_is_avail(last_avail + 1) % size))
>>> 	     last_avail is avail
>>> 	 else
>>>    	 	no new
>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-19  6:21                           ` Jason Wang
@ 2019-02-19 14:18                             ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 14:18 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Tue, Feb 19, 2019 at 02:21:01PM +0800, Jason Wang wrote:
> 
> On 2019/2/15 下午12:23, Michael S. Tsirkin wrote:
> > On Fri, Feb 15, 2019 at 11:59:55AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午11:41, Michael S. Tsirkin wrote:
> > > > > I think it's as simple as increase the avail idx by X? Since descriptor were
> > > > > used in order, device can just read the next X-1 descriptors in this case.
> > > > > 
> > > > > Thanks
> > > > Right so a spec change would be needed, it's not transparent to guest.
> > > 
> > > With the change, IN_ORDER + split_ring becomes something like submission
> > > queue (descriptor ring) + completion queue (used ring). And used ring access
> > > could be eliminated sometime, e.g for net for TX, we don't care about used
> > > len.
> > Oh yes but fundamentally this becomes very close to packed ring.
> 
> 
> Looks not, it's a two rings vs one ring.
> 
> 
> > So I'm
> > not sure yet another option is justified by a small gain in PPS,
> 
> 
> 10% is done with less than 10 lines of code, I suspect maybe dpdk can see
> more.
> 
> 
> > especially considering that it depends on in order and so doesn't
> > support zero copy.
> 
> 
> It looks to me in order is not the blocker, even if we allow out of order,
> we may still suffer from HOL.

HOL isn't an issue with e.g. a dedicated VF.


> 
> > 
> > > What's more interesting is, this avail idx optimization is not required for
> > > out of order completion which means it could be used for e.g block or SCSI
> > > device.
> > I don't get the last sentence. This only works for in order right?
> 
> 
> Yes, you are right.
> 
> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-19  6:33                           ` Jason Wang
@ 2019-02-19 14:27                             ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-19 14:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: David Riddoch, Virtio-Dev

On Tue, Feb 19, 2019 at 02:33:04PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > > > > > I don't see it below:(
> > > > > > > Sorry, I'm not being clear.  If the write barrier is before the idx, then a
> > > > > > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > > > > > descriptors.  As things stand today a PV device has to do an rmb for each
> > > > > > > descriptor that it reads.
> > > > > > > 
> > > > > > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> > > > > Yes.
> > > > > 
> > > > > 
> > > > > > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > > > > > instead.  It remains to be seen whether an extra cache miss per
> > > > > > batch is cheaper or more expensive than such a dependency
> > > > > > per descriptor.
> > > > > > 
> > > > > > I hope Jason can experiment and let us know.
> > > > > > 
> > > > > > 
> > > > > To clarify, actually I did play someting like:
> > > > > 
> > > > > 
> > > > > if (desc_is_avail((last_avail + 32) % size))
> > > > > 
> > > > >       [last_avail, last_avail + 32] is avail [1]
> > > > > 
> > > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > > 
> > > > >       last_avail is avail
> > > > > 
> > > > > else
> > > > > 
> > > > >      no new
> > > > > 
> > > > > 
> > > > > And looks like [1] depends on the driver to do:
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc (but not make it available for the flag)
> > > > > 
> > > > > wmb()
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc.flag
> > > > > 
> > > > > 
> > > > > This means if last_avail + 32 is avail, no need to check flags of
> > > > > [last_avail, last_avail + 31] since the descriptor content (except for the
> > > > > flag) is guaranteed to be correct.
> > > > We can try looking into that, sure. It's not just wmb though which is
> > > > anyway free on x86. There are cases of e.g. chained descriptors
> > > > to consider where we do not want device to see a partial chain.
> > > 
> > > If we saw the last descriptor is in a chain, it's still doesn't harm, just
> > > read more?
> > So it's possible to add a variant that sticks a wmb after each flag
> > update for sure. However if you are writing out batches that's an
> > upfront cost and what you gain is prefetch on the other side which is
> > speculative.
> 
> 
> Yes.
> 
> 
> > 
> > 
> > > > 
> > > > > I get ~10% PPS improvement.
> > > > > 
> > > > > Thanks
> > > > I suspect most of the gain isn't in the check at all though.
> > > 
> > > The gain was:
> > > 
> > > 1) batching reading descriptors, saving the times of userspace access
> > Yes that needs to be fixed. The way linux vhost driver does userspace
> > accesses right now with stac/clac and memory barriers within a loop is
> > plain silly.  Designing a host/guest interface to work around that kind
> > of inefficiency makes no sense, it's purely an implementation issue.
> 
> 
> Actually, it has other advantages, consider the case when driver is faster,
> batch reading may reduce the cache contention on the descriptor ring (when
> the ring is almost full).

A shared index implies always doing a cache miss though.
By comparison speculating a read ahead of a cache line
will sometimes bring you a cache.

> 
> > 
> > > 2) reduce the rmb(), otherwise you need rmb per descriptor
> > Well not really, you can
> > - validate a batch then do a single rmb
> > - replace rmb with a dependency
> > 
> > > > It's probably reading descriptors and maybe loop unrolling.
> > > 
> > > I'm not quite understand why unrolling help in this case.
> > The amount of operations in this loop is tiny but not tiny
> > enough for gcc to decide it's free. Take a look
> > 
> > 
> > #include <stdio.h>
> > 
> > struct desc {
> >          unsigned long long a;
> >          unsigned len;
> >          unsigned short type;
> >          unsigned short flags;
> > };
> > #ifdef UNROLL
> > __attribute__((optimize("unroll-loops")))
> > #endif
> > int foo(struct desc *desc)
> > {
> >       unsigned short aflags = 0xffff;
> >       unsigned short oflags = 0x0;
> > 
> >       int i;
> > 
> >          for (i=0; i < 4; ++i) {
> >               aflags &= desc[i].flags;
> >               oflags |= desc[i].flags;
> >          }
> > 
> >          return aflags == oflags;
> > }
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> 
> 
> I think you mean without -DUNROLL.


Since there's a jump I think you are right.

> 
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   48 8d 47 0e             lea    0xe(%rdi),%rax
> >     4:   31 c9                   xor    %ecx,%ecx
> >     6:   48 83 c7 4e             add    $0x4e,%rdi
> >     a:   be ff ff ff ff          mov    $0xffffffff,%esi
> >     f:   0f b7 10                movzwl (%rax),%edx
> >    12:   48 83 c0 10             add    $0x10,%rax
> >    16:   21 d6                   and    %edx,%esi
> >    18:   09 d1                   or     %edx,%ecx
> >    1a:   48 39 f8                cmp    %rdi,%rax
> >    1d:   75 f0                   jne    f <foo+0xf>
> >    1f:   31 c0                   xor    %eax,%eax
> >    21:   66 39 ce                cmp    %cx,%si
> >    24:   0f 94 c0                sete   %al
> >    27:   c3                      retq
> > 
> > 
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> > 
> > Disassembly of section .text:
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   0f b7 47 1e             movzwl 0x1e(%rdi),%eax
> >     4:   0f b7 77 2e             movzwl 0x2e(%rdi),%esi
> >     8:   0f b7 4f 0e             movzwl 0xe(%rdi),%ecx
> >     c:   89 c2                   mov    %eax,%edx
> >     e:   09 f0                   or     %esi,%eax
> >    10:   21 f2                   and    %esi,%edx
> >    12:   09 c8                   or     %ecx,%eax
> >    14:   21 ca                   and    %ecx,%edx
> >    16:   0f b7 4f 3e             movzwl 0x3e(%rdi),%ecx
> >    1a:   09 c8                   or     %ecx,%eax
> >    1c:   21 ca                   and    %ecx,%edx
> >    1e:   66 39 c2                cmp    %ax,%dx
> >    21:   0f 94 c0                sete   %al
> >    24:   0f b6 c0                movzbl %al,%eax
> >    27:   c3                      retq
> > 
> > 
> > See? gcc just does not unroll agressively enough.
> 
> 
> Ok, so this is just for avoiding branches.

Which is important for a variety of reasons, e.g. linux adds in tricks
to block speculation so branches might not be predicted even on intel.
A pipeline stall on each read isn't good for speed.

> 
> > 
> > > > Something like:
> > > > 
> > > > __attribute__((optimize("unroll-loops")))
> > > > bool fast_get_descriptors(vq, desc)
> > > > {
> > > > 
> > > > 	copy X descriptors
> > > > 	u16 aflags = 0xffff;
> > > > 	u16 oflags = 0x0;
> > > > 	for (i = 0; i < X; ++i) {
> > > > 		aflags &= desc[i].flags;
> > > > 		oflags |= desc[i].flags;
> > > > 	}
> > > > 	/* all flags must be same */
> > > > 	if (aflags != oflags)
> > > > 		return false;
> > > > 
> > > > 	return is_avail(vq->wrapcount, aflags);
> > > > }
> > > 
> > > This may work. It should be a little bit less efficient but without any
> > > assumption of driver if I understand correctly.
> > > 
> > > 
> > > > 
> > > > I also think 32 is very aggressive, I would start with 8.
> > > 
> > > Maybe, or adding some heuristic to predict it.
> > > 
> > > Thanks
> > The reason I say 8 is because it's two cache lines which
> > is typically what cpu pulls in anyway (1+linear pfetch).
> > Anything more might have more cost as it might
> > put more pressure on the cache. Further
> > static sized short loops can be unrolled by compiler if
> > you ask it nicely, which saves a bunch of branching,
> > speculation etc. Dynamic sized ones can't,
> > you end up with extra math and branches in the loop.
> 
> 
> You're probably right here, but need some benchmark to check.
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > With above, you could try:
> > > > 
> > > > 	if (fast_get_descriptors(vq, desc))
> > > > 	     [last_avail, last_avail + 32] is avail [1]
> > > > 	else if (desc_is_avail(last_avail + 1) % size))
> > > > 	     last_avail is avail
> > > > 	 else
> > > >    	 	no new
> > > > 
> > > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-11  8:52   ` David Riddoch
  2019-02-11 14:24     ` Michael S. Tsirkin
@ 2019-04-30 22:41     ` Michael S. Tsirkin
  2019-06-06 12:34       ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-04-30 22:41 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> > > Presumably we would like this to be an optional feature, as implementations
> > > of packed mode already exist in the wild.  How about
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > Any new feature would require another round of public review.
> > I personally think it's better to instead try to do 1.2 soon after,
> > e.g. we could try to target quarterly releases.
> > 
> > But ultimately that would be up to the TC vote.

Hi David!
Now that 1.1 is out, how about proposing this for 1.2?
It's probably a very good time for such a proposal.
Do you have the time to spec it out?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-04-30 22:41     ` Michael S. Tsirkin
@ 2019-06-06 12:34       ` Michael S. Tsirkin
  2019-06-17 14:41         ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-06-06 12:34 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

On Tue, Apr 30, 2019 at 06:41:14PM -0400, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> > > > Presumably we would like this to be an optional feature, as implementations
> > > > of packed mode already exist in the wild.  How about
> > > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > > Any new feature would require another round of public review.
> > > I personally think it's better to instead try to do 1.2 soon after,
> > > e.g. we could try to target quarterly releases.
> > > 
> > > But ultimately that would be up to the TC vote.
> 
> Hi David!
> Now that 1.1 is out, how about proposing this for 1.2?
> It's probably a very good time for such a proposal.
> Do you have the time to spec it out?

Any interest at all? Or are we good for 1.2?

> -- 
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-06-06 12:34       ` Michael S. Tsirkin
@ 2019-06-17 14:41         ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-06-17 14:41 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

On Thu, Jun 06, 2019 at 08:34:17AM -0400, Michael S. Tsirkin wrote:
> On Tue, Apr 30, 2019 at 06:41:14PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> > > > > Presumably we would like this to be an optional feature, as implementations
> > > > > of packed mode already exist in the wild.  How about
> > > > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > > > Any new feature would require another round of public review.
> > > > I personally think it's better to instead try to do 1.2 soon after,
> > > > e.g. we could try to target quarterly releases.
> > > > 
> > > > But ultimately that would be up to the TC vote.
> > 
> > Hi David!
> > Now that 1.1 is out, how about proposing this for 1.2?
> > It's probably a very good time for such a proposal.
> > Do you have the time to spec it out?
> 
> Any interest at all? Or are we good for 1.2?

ping

> > -- 
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 20:03     ` Rob Miller
@ 2019-02-13 17:38       ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-13 17:38 UTC (permalink / raw)
  To: Rob Miller; +Cc: David Riddoch, Virtio-Dev

On Tue, Feb 12, 2019 at 03:03:06PM -0500, Rob Miller wrote:
> 
> 
> On Tue, Feb 12, 2019 at 1:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Fri, Feb 01, 2019 at 09:43:02AM -0800, Rob Miller wrote:
>     > Agreed that this is needed.
>     >
>     > I would also like to suggest splitting the F_IN_ORDER into
>     > F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
>     > which can be more of a scatter/gather than tx. This would allow
>     > batchmode for tx at least in packed rings.
> 
>     I'm not sure what does this buy us. Are you interested in
>     out of order tx with in order rx then?
> 
> Other way around. It is easier to guarantee that TX pkts are processed in order
> but rx is quite more difficult. The way some rx aggregators work is that  as
> pkts are received from the wire, a "timer" and flow detector is setup. This
> allows for other incoming packet with same TCP header (flow) to be egg'd into
> one large packet with multiple segments. When a timeout fires (no other
> segments arrive within time) or the TCP header changes, the whole list of RX
> buffers is then sent up to the driver for processing. However during this
> gathering period, multiple flows, say from other TCP sources could be arriving.
> The hardware allocates rx buffers(descriptors) as the fragments hit the
> receiver, hence they will be used out-of-order and batch mode on RX isn't
> possible.

OK so that's for in-order. Sounds reasonable enough, would you like to
propose a spec enhancement on virtio-comment?  From what you describe it
would seem that it is not up to driver, so it should be up to the
device, where some VQs describe themselves as out of order on a device
which is otherwise in order.

> 
>     > Finally, i would suggest a means to specify a given rings ring mode
>     > and packed leans more towards TX, whilst split can be either really
>     > depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
>     > we can have RX & TX, split out.
>     >
>     > Sent from my iPhone
> 
>     Before we jump there a bit more justification would be nice.
> 
> The above description of rx aggregation, where buffer are allocated for each rx
> pkt segment as they hit the receiver . If there were 2 ingressing flows from
> different sources, say nicely interleaved, then i would expect to see rx desc
> usage as follows:
> 
> buffer 0 - flow 0
> buffer 1- flow 1
> buffer 2 - flow 0
> buffer 3 - flow 1
> 
> I would think this lends itself to a split virtqueue mode of operation better
> than ring buffer.

I'm not sure I follow why. Split virtqueue uses less cache for
when only a small number of descriptors are outstanding.
It pays for this with indirection and more cache line bounces.
Its main advantage is thus for PV case when guest and host
share a cache so generally ~1 descriptor in flight.

> 
> 
>     E.g. doing this change in software isn't a lot of work. How about
>     a software patch with some performance gains measured?
>     Failing that, some back of the napkin calculations showing
>     the potential gains and costs?
> 
> Yea I could do that. I wanted to bounce the idea around a bit first to
> understand more if there are holes in my logic. 
> 
> 
> 
>     > > On Feb 1, 2019, at 9:23 AM, David Riddoch <driddoch@solarflare.com>
>     wrote:
>     > >
>     > > All,
>     > >
>     > > I'd like to propose a small extension to the packed virtqueue mode.  My
>     > > proposal is to add an offset/wrap field, written by the driver,
>     > > indicating how many available descriptors have been added to the ring.
>     > >
>     > > The reason for wanting this is to improve performance of hardware
>     > > devices.  Because of high read latency over a PCIe bus, it is important
>     > > for hardware devices to read multiple ring entries in parallel.  It is
>     > > desirable to know how many descriptors are available prior to issuing
>     > > these reads, else you risk fetching descriptors that are not yet
>     > > available.  As well as wasting bus bandwidth this adds complexity.
>     > >
>     > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
>     > > problem, but we still have a problem.  If you rely on doorbells to tell
>     > > you how many descriptors are available, then you have to keep doorbells
>     > > enabled at all times.  This can result in a very high rate of doorbells
>     > > with some drivers, which can become a severe bottleneck (because x86
>     > > CPUs can't emit MMIOs at very high rates).
>     > >
>     > > The proposed offset/wrap field allows devices to disable doorbells when
>     > > appropriate, and determine the latest fill level via a PCIe read.
>     > >
>     > > I suggest the best place to put this would be in the driver area,
>     > > immediately after the event suppression structure.
>     > >
>     > > Presumably we would like this to be an optional feature, as
>     > > implementations of packed mode already exist in the wild.  How about
>     > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
>     > >
>     > > If I prepare a patch to the spec is there still time to get this into
>     v1.1?
>     > >
>     > > Best,
>     > > David
>     > >
>     > > --
>     > > David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
>     > >
>     > >
>     > > ---------------------------------------------------------------------
>     > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>     > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>     > >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>     > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-12 18:55   ` Michael S. Tsirkin
@ 2019-02-12 20:03     ` Rob Miller
  2019-02-13 17:38       ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Rob Miller @ 2019-02-12 20:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Riddoch, Virtio-Dev

[-- Attachment #1: Type: text/plain, Size: 4806 bytes --]

On Tue, Feb 12, 2019 at 1:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Fri, Feb 01, 2019 at 09:43:02AM -0800, Rob Miller wrote:
> > Agreed that this is needed.
> >
> > I would also like to suggest splitting the F_IN_ORDER into
> > F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
> > which can be more of a scatter/gather than tx. This would allow
> > batchmode for tx at least in packed rings.
>
> I'm not sure what does this buy us. Are you interested in
> out of order tx with in order rx then?
>
Other way around. It is easier to guarantee that TX pkts are processed in
order but rx is quite more difficult. The way some rx aggregators work is
that  as pkts are received from the wire, a "timer" and flow detector is
setup. This allows for other incoming packet with same TCP header (flow) to
be egg'd into one large packet with multiple segments. When a timeout fires
(no other segments arrive within time) or the TCP header changes, the whole
list of RX buffers is then sent up to the driver for processing. However
during this gathering period, multiple flows, say from other TCP sources
could be arriving. The hardware allocates rx buffers(descriptors) as the
fragments hit the receiver, hence they will be used out-of-order and batch
mode on RX isn't possible.

>
> > Finally, i would suggest a means to specify a given rings ring mode
> > and packed leans more towards TX, whilst split can be either really
> > depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
> > we can have RX & TX, split out.
> >
> > Sent from my iPhone
>
> Before we jump there a bit more justification would be nice.
>
The above description of rx aggregation, where buffer are allocated for
each rx pkt segment as they hit the receiver . If there were 2 ingressing
flows from different sources, say nicely interleaved, then i would expect
to see rx desc usage as follows:

buffer 0 - flow 0
buffer 1- flow 1
buffer 2 - flow 0
buffer 3 - flow 1

I would think this lends itself to a split virtqueue mode of operation
better than ring buffer.

>
>
> E.g. doing this change in software isn't a lot of work. How about
> a software patch with some performance gains measured?
> Failing that, some back of the napkin calculations showing
> the potential gains and costs?
>
Yea I could do that. I wanted to bounce the idea around a bit first to
understand more if there are holes in my logic.

>
>
> > > On Feb 1, 2019, at 9:23 AM, David Riddoch <driddoch@solarflare.com>
> wrote:
> > >
> > > All,
> > >
> > > I'd like to propose a small extension to the packed virtqueue mode.  My
> > > proposal is to add an offset/wrap field, written by the driver,
> > > indicating how many available descriptors have been added to the ring.
> > >
> > > The reason for wanting this is to improve performance of hardware
> > > devices.  Because of high read latency over a PCIe bus, it is important
> > > for hardware devices to read multiple ring entries in parallel.  It is
> > > desirable to know how many descriptors are available prior to issuing
> > > these reads, else you risk fetching descriptors that are not yet
> > > available.  As well as wasting bus bandwidth this adds complexity.
> > >
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem, but we still have a problem.  If you rely on doorbells to tell
> > > you how many descriptors are available, then you have to keep doorbells
> > > enabled at all times.  This can result in a very high rate of doorbells
> > > with some drivers, which can become a severe bottleneck (because x86
> > > CPUs can't emit MMIOs at very high rates).
> > >
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > >
> > > I suggest the best place to put this would be in the driver area,
> > > immediately after the event suppression structure.
> > >
> > > Presumably we would like this to be an optional feature, as
> > > implementations of packed mode already exist in the wild.  How about
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > >
> > > If I prepare a patch to the spec is there still time to get this into
> v1.1?
> > >
> > > Best,
> > > David
> > >
> > > --
> > > David Riddoch  <driddoch@solarflare.com> -- Chief Architect,
> Solarflare
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

[-- Attachment #2: Type: text/html, Size: 6654 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-04  5:36   ` Stefan Hajnoczi
@ 2019-02-12 18:58     ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 18:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Rob Miller, David Riddoch, Virtio-Dev

On Mon, Feb 04, 2019 at 01:36:28PM +0800, Stefan Hajnoczi wrote:
> On Fri, Feb 01, 2019 at 09:43:02AM -0800, Rob Miller wrote:
> > Agreed that this is needed.
> > 
> > I would also like to suggest splitting the F_IN_ORDER into
> > F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
> > which can be more of a scatter/gather than tx. This would allow
> > batchmode for tx at least in packed rings.
> > 
> > Finally, i would suggest a means to specify a given rings ring mode
> > and packed leans more towards TX, whilst split can be either really
> > depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
> > we can have RX & TX, split out.
> 
> Device types beside virtio-net might also want per-virtqueue F_IN_ORDER
> and other features,

Do you know that they will? Now that we already have implementations,
how about a prototype patch showing the performance gains?

> so let's find a way to make it independent of
> virtio-net concepts like rx/tx.

Add another field along with VQ PA. It's not rocket science.

> The per-virtqueue in-order flag could live in struct
> pvirtq_event_suppress (VIRTIO 1.1 2.7.14 Event Suppression Structure
> Format).

That's for dynamic things though.

> Or a new structure could be used for per-virtqueue configuration.  Doing
> this is a little tricky if you want to select split vs packed ring
> layout on a per-virtqueue basis, since this structure is part of the
> split/packed ring layout.  It would probably be necessary to add vring
> mode selection to the transport (PCI, MMIO, CCW) instead.
> 
> Stefan

Yes.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-01 17:43 ` Rob Miller
  2019-02-04  5:36   ` Stefan Hajnoczi
@ 2019-02-12 18:55   ` Michael S. Tsirkin
  2019-02-12 20:03     ` Rob Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 18:55 UTC (permalink / raw)
  To: Rob Miller; +Cc: David Riddoch, Virtio-Dev

On Fri, Feb 01, 2019 at 09:43:02AM -0800, Rob Miller wrote:
> Agreed that this is needed.
> 
> I would also like to suggest splitting the F_IN_ORDER into
> F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
> which can be more of a scatter/gather than tx. This would allow
> batchmode for tx at least in packed rings.

I'm not sure what does this buy us. Are you interested in
out of order tx with in order rx then?

> Finally, i would suggest a means to specify a given rings ring mode
> and packed leans more towards TX, whilst split can be either really
> depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
> we can have RX & TX, split out.
> 
> Sent from my iPhone

Before we jump there a bit more justification would be nice.


E.g. doing this change in software isn't a lot of work. How about
a software patch with some performance gains measured?
Failing that, some back of the napkin calculations showing
the potential gains and costs?


> > On Feb 1, 2019, at 9:23 AM, David Riddoch <driddoch@solarflare.com> wrote:
> >
> > All,
> >
> > I'd like to propose a small extension to the packed virtqueue mode.  My
> > proposal is to add an offset/wrap field, written by the driver,
> > indicating how many available descriptors have been added to the ring.
> >
> > The reason for wanting this is to improve performance of hardware
> > devices.  Because of high read latency over a PCIe bus, it is important
> > for hardware devices to read multiple ring entries in parallel.  It is
> > desirable to know how many descriptors are available prior to issuing
> > these reads, else you risk fetching descriptors that are not yet
> > available.  As well as wasting bus bandwidth this adds complexity.
> >
> > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > problem, but we still have a problem.  If you rely on doorbells to tell
> > you how many descriptors are available, then you have to keep doorbells
> > enabled at all times.  This can result in a very high rate of doorbells
> > with some drivers, which can become a severe bottleneck (because x86
> > CPUs can't emit MMIOs at very high rates).
> >
> > The proposed offset/wrap field allows devices to disable doorbells when
> > appropriate, and determine the latest fill level via a PCIe read.
> >
> > I suggest the best place to put this would be in the driver area,
> > immediately after the event suppression structure.
> >
> > Presumably we would like this to be an optional feature, as
> > implementations of packed mode already exist in the wild.  How about
> > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> >
> > If I prepare a patch to the spec is there still time to get this into v1.1?
> >
> > Best,
> > David
> >
> > --
> > David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
  2019-02-01 17:43 ` Rob Miller
@ 2019-02-04  5:36   ` Stefan Hajnoczi
  2019-02-12 18:58     ` Michael S. Tsirkin
  2019-02-12 18:55   ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Hajnoczi @ 2019-02-04  5:36 UTC (permalink / raw)
  To: Rob Miller; +Cc: David Riddoch, Virtio-Dev

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

On Fri, Feb 01, 2019 at 09:43:02AM -0800, Rob Miller wrote:
> Agreed that this is needed.
> 
> I would also like to suggest splitting the F_IN_ORDER into
> F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
> which can be more of a scatter/gather than tx. This would allow
> batchmode for tx at least in packed rings.
> 
> Finally, i would suggest a means to specify a given rings ring mode
> and packed leans more towards TX, whilst split can be either really
> depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
> we can have RX & TX, split out.

Device types beside virtio-net might also want per-virtqueue F_IN_ORDER
and other features, so let's find a way to make it independent of
virtio-net concepts like rx/tx.

The per-virtqueue in-order flag could live in struct
pvirtq_event_suppress (VIRTIO 1.1 2.7.14 Event Suppression Structure
Format).

Or a new structure could be used for per-virtqueue configuration.  Doing
this is a little tricky if you want to select split vs packed ring
layout on a per-virtqueue basis, since this structure is part of the
split/packed ring layout.  It would probably be necessary to add vring
mode selection to the transport (PCI, MMIO, CCW) instead.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
       [not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
@ 2019-02-01 17:43 ` Rob Miller
  2019-02-04  5:36   ` Stefan Hajnoczi
  2019-02-12 18:55   ` Michael S. Tsirkin
  0 siblings, 2 replies; 37+ messages in thread
From: Rob Miller @ 2019-02-01 17:43 UTC (permalink / raw)
  To: David Riddoch; +Cc: Virtio-Dev

Agreed that this is needed.

I would also like to suggest splitting the F_IN_ORDER into
F_RX_IN_ORDER and F_TX_IN_ORDER to support hw LRO implementations,
which can be more of a scatter/gather than tx. This would allow
batchmode for tx at least in packed rings.

Finally, i would suggest a means to specify a given rings ring mode
and packed leans more towards TX, whilst split can be either really
depending upon LRO, jumbo, rx buff size, ect.. just like F_IN_ORDER,
we can have RX & TX, split out.

Sent from my iPhone

> On Feb 1, 2019, at 9:23 AM, David Riddoch <driddoch@solarflare.com> wrote:
>
> All,
>
> I'd like to propose a small extension to the packed virtqueue mode.  My
> proposal is to add an offset/wrap field, written by the driver,
> indicating how many available descriptors have been added to the ring.
>
> The reason for wanting this is to improve performance of hardware
> devices.  Because of high read latency over a PCIe bus, it is important
> for hardware devices to read multiple ring entries in parallel.  It is
> desirable to know how many descriptors are available prior to issuing
> these reads, else you risk fetching descriptors that are not yet
> available.  As well as wasting bus bandwidth this adds complexity.
>
> I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> problem, but we still have a problem.  If you rely on doorbells to tell
> you how many descriptors are available, then you have to keep doorbells
> enabled at all times.  This can result in a very high rate of doorbells
> with some drivers, which can become a severe bottleneck (because x86
> CPUs can't emit MMIOs at very high rates).
>
> The proposed offset/wrap field allows devices to disable doorbells when
> appropriate, and determine the latest fill level via a PCIe read.
>
> I suggest the best place to put this would be in the driver area,
> immediately after the event suppression structure.
>
> Presumably we would like this to be an optional feature, as
> implementations of packed mode already exist in the wild.  How about
> VIRTIO_F_RING_PACKED_AVAIL_IDX?
>
> If I prepare a patch to the spec is there still time to get this into v1.1?
>
> Best,
> David
>
> --
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2019-06-17 14:41 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11  7:33 ` Michael S. Tsirkin
2019-02-11  8:52   ` David Riddoch
2019-02-11 14:24     ` Michael S. Tsirkin
2019-02-11 14:58       ` David Riddoch
2019-02-12  5:08         ` Michael S. Tsirkin
2019-02-12  7:28           ` Jason Wang
2019-02-12 13:44             ` Michael S. Tsirkin
2019-02-13 10:00               ` Jason Wang
2019-02-13 15:20                 ` Michael S. Tsirkin
2019-02-14  3:21                   ` Jason Wang
2019-02-14  3:41                     ` Michael S. Tsirkin
2019-02-15  3:59                       ` Jason Wang
2019-02-15  4:23                         ` Michael S. Tsirkin
2019-02-19  6:21                           ` Jason Wang
2019-02-19 14:18                             ` Michael S. Tsirkin
2019-02-12 11:40           ` David Riddoch
2019-02-12 12:51             ` Michael S. Tsirkin
2019-02-12 14:01             ` Michael S. Tsirkin
2019-02-12 16:47               ` David Riddoch
2019-02-12 17:35                 ` Michael S. Tsirkin
2019-02-13  9:49                   ` David Riddoch
2019-02-13 10:33                   ` Jason Wang
2019-02-13 17:30                     ` Michael S. Tsirkin
2019-02-14  3:34                       ` Jason Wang
2019-02-14  4:04                         ` Michael S. Tsirkin
2019-02-19  6:33                           ` Jason Wang
2019-02-19 14:27                             ` Michael S. Tsirkin
2019-04-30 22:41     ` Michael S. Tsirkin
2019-06-06 12:34       ` Michael S. Tsirkin
2019-06-17 14:41         ` Michael S. Tsirkin
     [not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04  5:36   ` Stefan Hajnoczi
2019-02-12 18:58     ` Michael S. Tsirkin
2019-02-12 18:55   ` Michael S. Tsirkin
2019-02-12 20:03     ` Rob Miller
2019-02-13 17:38       ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.