Virtio BoF minutes from KVM Forum 2017

All of lore.kernel.org
 help / color / mirror / Atom feed

* Virtio BoF minutes from KVM Forum 2017
@ 2017-10-29 12:52 ` Jens Freimann
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Freimann @ 2017-10-29 12:52 UTC (permalink / raw)
  To: virtio-dev, virtualization, virtio-comment


Virtio BoF minutes KVM Forum 2017

Attendees: Amnon Ilan, Maxime Coqueline, Vlad Yasevich, Malcolm Crossley,
	   David Vrabel, Ilya Lesokhin, Cunming Lian, Jens Freimann

Topics: packed ring layout with respect to hardware implementations

References:
https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
https://lists.oasis-open.org/archives/virtio-dev/201709/msg00013.html

Malcolm  Crossley, David Vrabel: 
- keep in mind not to only optimize for network with small frame sizes.
  Storage has much larger sizes
- is there really no cacheline ping pong, because we are overwriting the same
  cache line? 4 descs in one line, once we access two at the same time it will
  cause cache coherency, messages, no?
- interesting quirk, because we flip a bit, but intel doesn't support writing
  single bytes, it will always be a full dword. will that be a problem? 
- interesting to look into NVME protocols, it seems to solve some of the same
  problems hardware-wise
- vmware vmxnet3 has a separate data ring for when they have bigger amounts of
  data. not to copy, but still interesting

Steve: 
- is the _MORE flag from packed ring layout proposal still in use? what is it's meaning?  

Ilya: 
- you might have more completions than descriptors available
- partial descriptor chains are a problem for hardware because you might have
  to read a bunch of conscriptors twice 
- how would you do deal with a big buffer that cointains a large number of
  small packets with respect to completions?
- is one bit for completion enough? right now it means descriptor was actually
  used. how to we signal when it was completed?
- concerned about not being able to do scatter/gatter with the ring layout.
  Network drivers heavily using indirect buffers.  
- for a hardware implementation a completion ring is a very convenient form for
  some use cases, so we want an efficient implementation for them. If we had an
  inline descriptor then a completion ring is just a normal ring and we won't
  need another ring type.
- doesn't like the fact that we need to do a linear scan to find the length of
  a descriptor chain. It would be nice if we could have the length of the chain
  in the first descriptor (i.e. the number of chained descriptors, not the number
  of posted descriptors which can be deduced from the id field)


Vlad: 
- there were discussions about having a bigger descriptor. then we would
  have more space to put things like a vnet header into the descriptor. It would
  also mean less conflicts with accessing the same cache line. (descriptors already
  grew to 16 bytes, do we need more?)
- was playing around with the idea of different ring types for different devices
  e.g. scsi, net. starting with generic information then comes protocol
  specific data. Ilya agrees. length of descriptor would be flexibla by adding a
  descriptor length field.  

How to continue / TODOs:
 - do benchmarking with bigger frame sizes on fast enough NICs
 - turn prototype code into a RFC series (work in progress)
 - more people interested to join monthly meetings

Open questions:
- Do we need an (optional) completion ring?
- Is there a situation where 4 descriptors in a cache line is a problem because
  we access the same cache line, causing cache ping-pong?
- Interrupt suppression requires device to do a memory read after writing out
  descriptors?  Will that be too costly? Let driver write out index?


regards
Jens

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] Virtio BoF minutes from KVM Forum 2017
@ 2017-10-29 12:52 ` Jens Freimann
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Freimann @ 2017-10-29 12:52 UTC (permalink / raw)
  To: virtio-dev, virtualization, virtio-comment


Virtio BoF minutes KVM Forum 2017

Attendees: Amnon Ilan, Maxime Coqueline, Vlad Yasevich, Malcolm Crossley,
	   David Vrabel, Ilya Lesokhin, Cunming Lian, Jens Freimann

Topics: packed ring layout with respect to hardware implementations

References:
https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
https://lists.oasis-open.org/archives/virtio-dev/201709/msg00013.html

Malcolm  Crossley, David Vrabel: 
- keep in mind not to only optimize for network with small frame sizes.
  Storage has much larger sizes
- is there really no cacheline ping pong, because we are overwriting the same
  cache line? 4 descs in one line, once we access two at the same time it will
  cause cache coherency, messages, no?
- interesting quirk, because we flip a bit, but intel doesn't support writing
  single bytes, it will always be a full dword. will that be a problem? 
- interesting to look into NVME protocols, it seems to solve some of the same
  problems hardware-wise
- vmware vmxnet3 has a separate data ring for when they have bigger amounts of
  data. not to copy, but still interesting

Steve: 
- is the _MORE flag from packed ring layout proposal still in use? what is it's meaning?  

Ilya: 
- you might have more completions than descriptors available
- partial descriptor chains are a problem for hardware because you might have
  to read a bunch of conscriptors twice 
- how would you do deal with a big buffer that cointains a large number of
  small packets with respect to completions?
- is one bit for completion enough? right now it means descriptor was actually
  used. how to we signal when it was completed?
- concerned about not being able to do scatter/gatter with the ring layout.
  Network drivers heavily using indirect buffers.  
- for a hardware implementation a completion ring is a very convenient form for
  some use cases, so we want an efficient implementation for them. If we had an
  inline descriptor then a completion ring is just a normal ring and we won't
  need another ring type.
- doesn't like the fact that we need to do a linear scan to find the length of
  a descriptor chain. It would be nice if we could have the length of the chain
  in the first descriptor (i.e. the number of chained descriptors, not the number
  of posted descriptors which can be deduced from the id field)


Vlad: 
- there were discussions about having a bigger descriptor. then we would
  have more space to put things like a vnet header into the descriptor. It would
  also mean less conflicts with accessing the same cache line. (descriptors already
  grew to 16 bytes, do we need more?)
- was playing around with the idea of different ring types for different devices
  e.g. scsi, net. starting with generic information then comes protocol
  specific data. Ilya agrees. length of descriptor would be flexibla by adding a
  descriptor length field.  

How to continue / TODOs:
 - do benchmarking with bigger frame sizes on fast enough NICs
 - turn prototype code into a RFC series (work in progress)
 - more people interested to join monthly meetings

Open questions:
- Do we need an (optional) completion ring?
- Is there a situation where 4 descriptors in a cache line is a problem because
  we access the same cache line, causing cache ping-pong?
- Interrupt suppression requires device to do a memory read after writing out
  descriptors?  Will that be too costly? Let driver write out index?


regards
Jens

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Virtio BoF minutes from KVM Forum 2017
  2017-10-29 12:52 ` [virtio-dev] " Jens Freimann
  (?)
  (?)
@ 2017-11-01 14:59 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-01 14:59 UTC (permalink / raw)
  To: Jens Freimann; +Cc: virtio-dev, virtio-comment, virtualization

On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> Ilya: - you might have more completions than descriptors available
> - partial descriptor chains are a problem for hardware because you might have
>  to read a bunch of conscriptors twice - how would you do deal with a big
> buffer that cointains a large number of
>  small packets with respect to completions?
> - is one bit for completion enough? right now it means descriptor was actually
>  used. how to we signal when it was completed?

I am not sure I understand the difference. Under virtio, driver makes a
descriptor available, then device reads/writes memory depending on
descriptor type, then marks it as used.

What does completed mean?

> - concerned about not being able to do scatter/gatter with the ring layout.
>  Network drivers heavily using indirect buffers.  - for a hardware
> implementation a completion ring is a very convenient form for
>  some use cases, so we want an efficient implementation for them. If we had an
>  inline descriptor then a completion ring is just a normal ring and we won't
>  need another ring type.
> - doesn't like the fact that we need to do a linear scan to find the length of
>  a descriptor chain. It would be nice if we could have the length of the chain
>  in the first descriptor (i.e. the number of chained descriptors, not the number
>  of posted descriptors which can be deduced from the id field)

Not responding to rest of points since I don't understand the basic
assumption above yet.

-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] Re: Virtio BoF minutes from KVM Forum 2017
  2017-10-29 12:52 ` [virtio-dev] " Jens Freimann
  (?)
@ 2017-11-01 14:59 ` Michael S. Tsirkin
  2017-11-01 15:52     ` [virtio-dev] " Ilya Lesokhin
  -1 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-01 14:59 UTC (permalink / raw)
  To: Jens Freimann; +Cc: virtio-dev, virtualization, virtio-comment

On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> Ilya: - you might have more completions than descriptors available
> - partial descriptor chains are a problem for hardware because you might have
>  to read a bunch of conscriptors twice - how would you do deal with a big
> buffer that cointains a large number of
>  small packets with respect to completions?
> - is one bit for completion enough? right now it means descriptor was actually
>  used. how to we signal when it was completed?

I am not sure I understand the difference. Under virtio, driver makes a
descriptor available, then device reads/writes memory depending on
descriptor type, then marks it as used.

What does completed mean?

> - concerned about not being able to do scatter/gatter with the ring layout.
>  Network drivers heavily using indirect buffers.  - for a hardware
> implementation a completion ring is a very convenient form for
>  some use cases, so we want an efficient implementation for them. If we had an
>  inline descriptor then a completion ring is just a normal ring and we won't
>  need another ring type.
> - doesn't like the fact that we need to do a linear scan to find the length of
>  a descriptor chain. It would be nice if we could have the length of the chain
>  in the first descriptor (i.e. the number of chained descriptors, not the number
>  of posted descriptors which can be deduced from the id field)

Not responding to rest of points since I don't understand the basic
assumption above yet.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 14:59 ` [virtio-dev] " Michael S. Tsirkin
@ 2017-11-01 15:52     ` Ilya Lesokhin
  0 siblings, 0 replies; 13+ messages in thread
From: Ilya Lesokhin @ 2017-11-01 15:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization, virtio-comment

On Wednesday, November 01, 2017 4:59 PM, Michael S. Tsirkin wrote:

> On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> > Ilya: - you might have more completions than descriptors available
> > - partial descriptor chains are a problem for hardware because you
> > might have  to read a bunch of conscriptors twice - how would you do
> > deal with a big buffer that cointains a large number of  small packets
> > with respect to completions?
> > - is one bit for completion enough? right now it means descriptor was
> > actually  used. how to we signal when it was completed?
> 
> I am not sure I understand the difference. Under virtio, driver makes a
> descriptor available, then device reads/writes memory depending on descriptor
> type, then marks it as used.
> 
> What does completed mean?
> 

During the BOF, someone raised the point that there is no indication that the HW has
Read the descriptor. I think after some discussion we've agreed that it's not a useful indication.

My issues with the current completion or used notifications are as follows:
1. There is no room for extra metadata such as checksum or flow tag.
You could put that in the descriptor payload but it's somewhat inconvenient.
You have to either use and additional descriptor for metadata per chain.
Or putting it in one of the buffers and forcing the lifetime of the metadata and data to be the same.

2. Current format assumes 1-1 corresponds between descriptors and completions.
You did offer a skipping optimization for many descriptors -> 1 completion.
But it is somewhat inefficient.
And you didn't offer a solution for 1 descriptor -> multiple completions.
Mellanox has a feature called striding RQ where you post a large buffer and
The NIC fills it with multiple back to back packets with padding.
Each packet generates its own completion.

3. There is a usage model where you have multiple produce rings
And a single completion ring.
You could implement the completion ring using an additional virtio ring,  but 
The current model will require an extra indirection as it force you to write into 
The buffers the descriptor in the completion ring point to. Rather than writing the
Completion into the ring itself.
Additionally the device is still required to write to the original producer ring 
in addition to the completion ring.

I think the best and most flexible design is to have variable size descriptor that
start with a dword header.
The dword header will include - an ownership bit, an opcode and descriptor length.
The opcode and the "length" dwords following the header will be device specific.

The owner bit meaning changes on each ring wrap around so the device doesn't
Need to update.

Each device (or device class) can choose whether completions are reported directly inside 
the descriptors in that ring or in a separate completion ring. 

completions rings can be implemented in an efficient manner with this design.
The driver will initialize a dedicated completion ring with empty completion sized descriptors.
And the device will write the completions directly into the ring.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] RE: Virtio BoF minutes from KVM Forum 2017
@ 2017-11-01 15:52     ` Ilya Lesokhin
  0 siblings, 0 replies; 13+ messages in thread
From: Ilya Lesokhin @ 2017-11-01 15:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, virtio-comment, virtualization, Jens Freimann

On Wednesday, November 01, 2017 4:59 PM, Michael S. Tsirkin wrote:

> On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> > Ilya: - you might have more completions than descriptors available
> > - partial descriptor chains are a problem for hardware because you
> > might have  to read a bunch of conscriptors twice - how would you do
> > deal with a big buffer that cointains a large number of  small packets
> > with respect to completions?
> > - is one bit for completion enough? right now it means descriptor was
> > actually  used. how to we signal when it was completed?
> 
> I am not sure I understand the difference. Under virtio, driver makes a
> descriptor available, then device reads/writes memory depending on descriptor
> type, then marks it as used.
> 
> What does completed mean?
> 

During the BOF, someone raised the point that there is no indication that the HW has
Read the descriptor. I think after some discussion we've agreed that it's not a useful indication.

My issues with the current completion or used notifications are as follows:
1. There is no room for extra metadata such as checksum or flow tag.
You could put that in the descriptor payload but it's somewhat inconvenient.
You have to either use and additional descriptor for metadata per chain.
Or putting it in one of the buffers and forcing the lifetime of the metadata and data to be the same.

2. Current format assumes 1-1 corresponds between descriptors and completions.
You did offer a skipping optimization for many descriptors -> 1 completion.
But it is somewhat inefficient.
And you didn't offer a solution for 1 descriptor -> multiple completions.
Mellanox has a feature called striding RQ where you post a large buffer and
The NIC fills it with multiple back to back packets with padding.
Each packet generates its own completion.

3. There is a usage model where you have multiple produce rings
And a single completion ring.
You could implement the completion ring using an additional virtio ring,  but 
The current model will require an extra indirection as it force you to write into 
The buffers the descriptor in the completion ring point to. Rather than writing the
Completion into the ring itself.
Additionally the device is still required to write to the original producer ring 
in addition to the completion ring.

I think the best and most flexible design is to have variable size descriptor that
start with a dword header.
The dword header will include - an ownership bit, an opcode and descriptor length.
The opcode and the "length" dwords following the header will be device specific.

The owner bit meaning changes on each ring wrap around so the device doesn't
Need to update.

Each device (or device class) can choose whether completions are reported directly inside 
the descriptors in that ring or in a separate completion ring. 

completions rings can be implemented in an efficient manner with this design.
The driver will initialize a dedicated completion ring with empty completion sized descriptors.
And the device will write the completions directly into the ring.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 15:52     ` [virtio-dev] " Ilya Lesokhin
  (?)
@ 2017-11-01 17:35     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-01 17:35 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization, virtio-comment

On Wed, Nov 01, 2017 at 03:52:12PM +0000, Ilya Lesokhin wrote:
> On Wednesday, November 01, 2017 4:59 PM, Michael S. Tsirkin wrote:
> 
> > On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> > > Ilya: - you might have more completions than descriptors available
> > > - partial descriptor chains are a problem for hardware because you
> > > might have  to read a bunch of conscriptors twice - how would you do
> > > deal with a big buffer that cointains a large number of  small packets
> > > with respect to completions?
> > > - is one bit for completion enough? right now it means descriptor was
> > > actually  used. how to we signal when it was completed?
> > 
> > I am not sure I understand the difference. Under virtio, driver makes a
> > descriptor available, then device reads/writes memory depending on descriptor
> > type, then marks it as used.
> > 
> > What does completed mean?
> > 
> 
> During the BOF, someone raised the point that there is no indication that the HW has
> Read the descriptor. I think after some discussion we've agreed that it's not a useful indication.
> 
> My issues with the current completion or used notifications are as follows:
> 1. There is no room for extra metadata such as checksum or flow tag.
> You could put that in the descriptor payload but it's somewhat inconvenient.
> You have to either use and additional descriptor for metadata per chain.
> Or putting it in one of the buffers and forcing the lifetime of the metadata and data to be the same.

That's true. It would be easy to make descriptors e.g. 32 bytes each, so
you can add extra data in there. Or if we can live with wasting some
bytes per descriptor, we could add a descriptor flag that marks the
address field as meta-data. You could then chain it with a regular
descriptor for data. However all in all the simplest option is probably
in the virtio header which can be linear with the packet.

> 2. Current format assumes 1-1 corresponds between descriptors and completions.
> You did offer a skipping optimization for many descriptors -> 1 completion.

Yes - that's because the claim was that if using descriptors in-order,
device is doing many unnecessary writes.

> But it is somewhat inefficient.

It seems to outperform other options on a micro-benchmark but sure,
what do you propose?

> And you didn't offer a solution for 1 descriptor -> multiple completions.
> Mellanox has a feature called striding RQ where you post a large buffer and
> The NIC fills it with multiple back to back packets with padding.
> Each packet generates its own completion.

I suspect a good way to do this would be to just pass offsets within the
buffer back and forth. I agree sticking such small messages in a
separate buffer is not ideal. How about an option of replacing PA
with this data?

> 3. There is a usage model where you have multiple produce rings
> And a single completion ring.

What good is it though? It seems to perform worse than combining
producer and consumer in my testing.

> You could implement the completion ring using an additional virtio ring,  but 
> The current model will require an extra indirection as it force you to write into 
> The buffers the descriptor in the completion ring point to. Rather than writing the
> Completion into the ring itself.
> Additionally the device is still required to write to the original producer ring 
> in addition to the completion ring.
> 
> I think the best and most flexible design is to have variable size descriptor that
> start with a dword header.
> The dword header will include - an ownership bit, an opcode and descriptor length.
> The opcode and the "length" dwords following the header will be device specific.

This means that device needs to do two reads just to decode the
descriptor fully. This conflicts with feedback Intel has been giving on
list which is to try and reduce number of reads. With header linear with
the packet, you need two reads to start transmitting the packet.

> The owner bit meaning changes on each ring wrap around so the device doesn't
> Need to update.

Seems to look like the avail bit in the kvm forum presentation.

> Each device (or device class) can choose whether completions are reported directly inside 
> the descriptors in that ring or in a separate completion ring. 
> 
> completions rings can be implemented in an efficient manner with this design.
> The driver will initialize a dedicated completion ring with empty completion sized descriptors.
> And the device will write the completions directly into the ring.

I assume when you say completion you mean used entries, if I'm wrong
please correct me.  In fact with the proposal in the kvm forum
presentation it is possible to write used entries in a separate address
as opposed to overlapping the available entries.  If you are going to
support skipping writing back some used descriptors then accounting
would have to change slightly since driver won't be able to reset used
flags then.  But in the past in all tests I've written this separate
ring underperforms a shared ring.

> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] Re: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 15:52     ` [virtio-dev] " Ilya Lesokhin
  (?)
  (?)
@ 2017-11-01 17:35     ` Michael S. Tsirkin
  2017-11-01 18:12       ` Ilya Lesokhin
  2017-11-01 18:12       ` [virtio-dev] " Ilya Lesokhin
  -1 siblings, 2 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-01 17:35 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtio-comment, virtualization, Jens Freimann

On Wed, Nov 01, 2017 at 03:52:12PM +0000, Ilya Lesokhin wrote:
> On Wednesday, November 01, 2017 4:59 PM, Michael S. Tsirkin wrote:
> 
> > On Sun, Oct 29, 2017 at 01:52:25PM +0100, Jens Freimann wrote:
> > > Ilya: - you might have more completions than descriptors available
> > > - partial descriptor chains are a problem for hardware because you
> > > might have  to read a bunch of conscriptors twice - how would you do
> > > deal with a big buffer that cointains a large number of  small packets
> > > with respect to completions?
> > > - is one bit for completion enough? right now it means descriptor was
> > > actually  used. how to we signal when it was completed?
> > 
> > I am not sure I understand the difference. Under virtio, driver makes a
> > descriptor available, then device reads/writes memory depending on descriptor
> > type, then marks it as used.
> > 
> > What does completed mean?
> > 
> 
> During the BOF, someone raised the point that there is no indication that the HW has
> Read the descriptor. I think after some discussion we've agreed that it's not a useful indication.
> 
> My issues with the current completion or used notifications are as follows:
> 1. There is no room for extra metadata such as checksum or flow tag.
> You could put that in the descriptor payload but it's somewhat inconvenient.
> You have to either use and additional descriptor for metadata per chain.
> Or putting it in one of the buffers and forcing the lifetime of the metadata and data to be the same.

That's true. It would be easy to make descriptors e.g. 32 bytes each, so
you can add extra data in there. Or if we can live with wasting some
bytes per descriptor, we could add a descriptor flag that marks the
address field as meta-data. You could then chain it with a regular
descriptor for data. However all in all the simplest option is probably
in the virtio header which can be linear with the packet.

> 2. Current format assumes 1-1 corresponds between descriptors and completions.
> You did offer a skipping optimization for many descriptors -> 1 completion.

Yes - that's because the claim was that if using descriptors in-order,
device is doing many unnecessary writes.

> But it is somewhat inefficient.

It seems to outperform other options on a micro-benchmark but sure,
what do you propose?

> And you didn't offer a solution for 1 descriptor -> multiple completions.
> Mellanox has a feature called striding RQ where you post a large buffer and
> The NIC fills it with multiple back to back packets with padding.
> Each packet generates its own completion.

I suspect a good way to do this would be to just pass offsets within the
buffer back and forth. I agree sticking such small messages in a
separate buffer is not ideal. How about an option of replacing PA
with this data?

> 3. There is a usage model where you have multiple produce rings
> And a single completion ring.

What good is it though? It seems to perform worse than combining
producer and consumer in my testing.

> You could implement the completion ring using an additional virtio ring,  but 
> The current model will require an extra indirection as it force you to write into 
> The buffers the descriptor in the completion ring point to. Rather than writing the
> Completion into the ring itself.
> Additionally the device is still required to write to the original producer ring 
> in addition to the completion ring.
> 
> I think the best and most flexible design is to have variable size descriptor that
> start with a dword header.
> The dword header will include - an ownership bit, an opcode and descriptor length.
> The opcode and the "length" dwords following the header will be device specific.

This means that device needs to do two reads just to decode the
descriptor fully. This conflicts with feedback Intel has been giving on
list which is to try and reduce number of reads. With header linear with
the packet, you need two reads to start transmitting the packet.

> The owner bit meaning changes on each ring wrap around so the device doesn't
> Need to update.

Seems to look like the avail bit in the kvm forum presentation.

> Each device (or device class) can choose whether completions are reported directly inside 
> the descriptors in that ring or in a separate completion ring. 
> 
> completions rings can be implemented in an efficient manner with this design.
> The driver will initialize a dedicated completion ring with empty completion sized descriptors.
> And the device will write the completions directly into the ring.

I assume when you say completion you mean used entries, if I'm wrong
please correct me.  In fact with the proposal in the kvm forum
presentation it is possible to write used entries in a separate address
as opposed to overlapping the available entries.  If you are going to
support skipping writing back some used descriptors then accounting
would have to change slightly since driver won't be able to reset used
flags then.  But in the past in all tests I've written this separate
ring underperforms a shared ring.

> 
> 
> 
> 
> 
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 17:35     ` [virtio-dev] " Michael S. Tsirkin
@ 2017-11-01 18:12       ` Ilya Lesokhin
  2017-11-01 18:12       ` [virtio-dev] " Ilya Lesokhin
  1 sibling, 0 replies; 13+ messages in thread
From: Ilya Lesokhin @ 2017-11-01 18:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Wednesday, November 1, 2017 7:35 PM, Michael S. Tsirkin wrote:
> > You have to either use and additional descriptor for metadata per chain.
> > Or putting it in one of the buffers and forcing the lifetime of the metadata
> and data to be the same.
> 
> That's true. It would be easy to make descriptors e.g. 32 bytes each, so
> you can add extra data in there. Or if we can live with wasting some
> bytes per descriptor, we could add a descriptor flag that marks the
> address field as meta-data. You could then chain it with a regular
> descriptor for data. However all in all the simplest option is probably
> in the virtio header which can be linear with the packet.
[I.L] In the current proposal descriptor size == SGE (scatter gather entry) size.
I'm not sure that's a good idea.
For example we are considering having an RX ring were you just post a list
of PFN's so a sge is only 8 bytes.

I might be wrong here, so please correct me if that's the not the case.
but I've gotten the impression that due to DPDK limitations you've focused 
on the use case where you have 1 SGE.
I'm not sure that a representative workload for network devices,
As LSO is an important offload.

And the storage guys also complained about this issue.


> I suspect a good way to do this would be to just pass offsets within the
> buffer back and forth. I agree sticking such small messages in a
> separate buffer is not ideal. How about an option of replacing PA
> with this data?
[I.L] PA?

> 
> 
> 
> > 3. There is a usage model where you have multiple produce rings
> > And a single completion ring.
> 
> What good is it though? It seems to perform worse than combining
> producer and consumer in my testing.
> 
[I.L] It might be that for virtio-net a single ring is better
But are you really going to argue that its better in all possible use cases?

> 
> > You could implement the completion ring using an additional virtio ring,
> but
> > The current model will require an extra indirection as it force you to write
> into
> > The buffers the descriptor in the completion ring point to. Rather than
> writing the
> > Completion into the ring itself.
> > Additionally the device is still required to write to the original producer ring
> > in addition to the completion ring.
> >
> > I think the best and most flexible design is to have variable size descriptor
> that
> > start with a dword header.
> > The dword header will include - an ownership bit, an opcode and
> descriptor length.
> > The opcode and the "length" dwords following the header will be device
> specific.
> 
> This means that device needs to do two reads just to decode the
> descriptor fully. This conflicts with feedback Intel has been giving on
> list which is to try and reduce number of reads. With header linear with
> the packet, you need two reads to start transmitting the packet.
> 
[I.L] The device can do a single large read and do the parsing afterword's.
You could also use the doorbell to tell the device how much to read.


> Seems to look like the avail bit in the kvm forum presentation.
> 	
[I.L] I don't want to argue over the name. The main difference in my 
proposal is that the device doesn't need to write to the descriptor.
If it wants to you can define a separate bit for that.

> > Each device (or device class) can choose whether completions are reported
> directly inside
> > the descriptors in that ring or in a separate completion ring.
> >
> > completions rings can be implemented in an efficient manner with this
> design.
> > The driver will initialize a dedicated completion ring with empty completion
> sized descriptors.
> > And the device will write the completions directly into the ring.
> 
> I assume when you say completion you mean used entries, if I'm wrong
> please correct me.  In fact with the proposal in the kvm forum
> presentation it is possible to write used entries in a separate address
> as opposed to overlapping the available entries.  If you are going to
> support skipping writing back some used descriptors then accounting
> would have to change slightly since driver won't be able to reset used
> flags then.  But in the past in all tests I've written this separate
> ring underperforms a shared ring.
> 
[I.L] A completion is a used entry + device specific metadata.
I don't remember seeing an option to write used entries in a separate address,
I'll appreciate it if you can point me to the right direction.

Regarding the shared ring vs separate ring, I can't really argue with you
as I haven't done the relevant measurement.
I'm just saying it might not always be optimal in all use case,
So you should consider leaving both options open.

Its entirely possible that for virtio-net you want a single ring,
Where for PV-RDMA you want separate rings. 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] RE: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 17:35     ` [virtio-dev] " Michael S. Tsirkin
  2017-11-01 18:12       ` Ilya Lesokhin
@ 2017-11-01 18:12       ` Ilya Lesokhin
  2017-11-02  3:40           ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 1 reply; 13+ messages in thread
From: Ilya Lesokhin @ 2017-11-01 18:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization, Jens Freimann

On Wednesday, November 1, 2017 7:35 PM, Michael S. Tsirkin wrote:
> > You have to either use and additional descriptor for metadata per chain.
> > Or putting it in one of the buffers and forcing the lifetime of the metadata
> and data to be the same.
> 
> That's true. It would be easy to make descriptors e.g. 32 bytes each, so
> you can add extra data in there. Or if we can live with wasting some
> bytes per descriptor, we could add a descriptor flag that marks the
> address field as meta-data. You could then chain it with a regular
> descriptor for data. However all in all the simplest option is probably
> in the virtio header which can be linear with the packet.
[I.L] In the current proposal descriptor size == SGE (scatter gather entry) size.
I'm not sure that's a good idea.
For example we are considering having an RX ring were you just post a list
of PFN's so a sge is only 8 bytes.

I might be wrong here, so please correct me if that's the not the case.
but I've gotten the impression that due to DPDK limitations you've focused 
on the use case where you have 1 SGE.
I'm not sure that a representative workload for network devices,
As LSO is an important offload.

And the storage guys also complained about this issue.


> I suspect a good way to do this would be to just pass offsets within the
> buffer back and forth. I agree sticking such small messages in a
> separate buffer is not ideal. How about an option of replacing PA
> with this data?
[I.L] PA?

> 
> 
> 
> > 3. There is a usage model where you have multiple produce rings
> > And a single completion ring.
> 
> What good is it though? It seems to perform worse than combining
> producer and consumer in my testing.
> 
[I.L] It might be that for virtio-net a single ring is better
But are you really going to argue that its better in all possible use cases?

> 
> > You could implement the completion ring using an additional virtio ring,
> but
> > The current model will require an extra indirection as it force you to write
> into
> > The buffers the descriptor in the completion ring point to. Rather than
> writing the
> > Completion into the ring itself.
> > Additionally the device is still required to write to the original producer ring
> > in addition to the completion ring.
> >
> > I think the best and most flexible design is to have variable size descriptor
> that
> > start with a dword header.
> > The dword header will include - an ownership bit, an opcode and
> descriptor length.
> > The opcode and the "length" dwords following the header will be device
> specific.
> 
> This means that device needs to do two reads just to decode the
> descriptor fully. This conflicts with feedback Intel has been giving on
> list which is to try and reduce number of reads. With header linear with
> the packet, you need two reads to start transmitting the packet.
> 
[I.L] The device can do a single large read and do the parsing afterword's.
You could also use the doorbell to tell the device how much to read.


> Seems to look like the avail bit in the kvm forum presentation.
> 	
[I.L] I don't want to argue over the name. The main difference in my 
proposal is that the device doesn't need to write to the descriptor.
If it wants to you can define a separate bit for that.

> > Each device (or device class) can choose whether completions are reported
> directly inside
> > the descriptors in that ring or in a separate completion ring.
> >
> > completions rings can be implemented in an efficient manner with this
> design.
> > The driver will initialize a dedicated completion ring with empty completion
> sized descriptors.
> > And the device will write the completions directly into the ring.
> 
> I assume when you say completion you mean used entries, if I'm wrong
> please correct me.  In fact with the proposal in the kvm forum
> presentation it is possible to write used entries in a separate address
> as opposed to overlapping the available entries.  If you are going to
> support skipping writing back some used descriptors then accounting
> would have to change slightly since driver won't be able to reset used
> flags then.  But in the past in all tests I've written this separate
> ring underperforms a shared ring.
> 
[I.L] A completion is a used entry + device specific metadata.
I don't remember seeing an option to write used entries in a separate address,
I'll appreciate it if you can point me to the right direction.

Regarding the shared ring vs separate ring, I can't really argue with you
as I haven't done the relevant measurement.
I'm just saying it might not always be optimal in all use case,
So you should consider leaving both options open.

Its entirely possible that for virtio-net you want a single ring,
Where for PV-RDMA you want separate rings. 


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Virtio BoF minutes from KVM Forum 2017
  2017-11-01 18:12       ` [virtio-dev] " Ilya Lesokhin
@ 2017-11-02  3:40           ` Michael S. Tsirkin
  0 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-02  3:40 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization

On Wed, Nov 01, 2017 at 06:12:09PM +0000, Ilya Lesokhin wrote:
> On Wednesday, November 1, 2017 7:35 PM, Michael S. Tsirkin wrote:
> > > You have to either use and additional descriptor for metadata per chain.
> > > Or putting it in one of the buffers and forcing the lifetime of the metadata
> > and data to be the same.
> > 
> > That's true. It would be easy to make descriptors e.g. 32 bytes each, so
> > you can add extra data in there. Or if we can live with wasting some
> > bytes per descriptor, we could add a descriptor flag that marks the
> > address field as meta-data. You could then chain it with a regular
> > descriptor for data. However all in all the simplest option is probably
> > in the virtio header which can be linear with the packet.
> [I.L] In the current proposal descriptor size == SGE (scatter gather entry) size.
> I'm not sure that's a good idea.
> For example we are considering having an RX ring were you just post a list
> of PFN's so a sge is only 8 bytes.

You mean without length, flags etc?  So when you are concerned about
memory usage because you have many users for buffers (like e.g. with
Linux networking), then sizing buffers dynamically helps a lot. Single
user cases like DPDK or more recently XDP are different and they
can afford making all buffers same size.

For sure 8 byte entries would reduce cache pressure.  Question is
how to we handle so much variety in the ring layouts. Thoughts?

> I might be wrong here, so please correct me if that's the not the case.
> but I've gotten the impression that due to DPDK limitations you've focused 
> on the use case where you have 1 SGE.
> I'm not sure that a representative workload for network devices,
> As LSO is an important offload.

I don't think we focused on DPDK limitations.  For sure, lots of people
use LSO or other buffers with many s/g entries. But it also works pretty
well more or less whatever you do as you are able to pass a single large
packet then - so the per-packet overhead is generally amortized.

So if you are doing LSO, I'd say just use indirect.

> And the storage guys also complained about this issue.

Interesting. What was the complaint exactly.

> 
> > I suspect a good way to do this would be to just pass offsets within the
> > buffer back and forth. I agree sticking such small messages in a
> > separate buffer is not ideal. How about an option of replacing PA
> > with this data?
> [I.L] PA?

Sorry, I really meant the addr field in the descriptor.

> > 
> > 
> > 
> > > 3. There is a usage model where you have multiple produce rings
> > > And a single completion ring.
> > 
> > What good is it though? It seems to perform worse than combining
> > producer and consumer in my testing.
> > 
> [I.L] It might be that for virtio-net a single ring is better
> But are you really going to argue that its better in all possible use cases?
> 
> > 
> > > You could implement the completion ring using an additional virtio ring,
> > but
> > > The current model will require an extra indirection as it force you to write
> > into
> > > The buffers the descriptor in the completion ring point to. Rather than
> > writing the
> > > Completion into the ring itself.
> > > Additionally the device is still required to write to the original producer ring
> > > in addition to the completion ring.
> > >
> > > I think the best and most flexible design is to have variable size descriptor
> > that
> > > start with a dword header.
> > > The dword header will include - an ownership bit, an opcode and
> > descriptor length.
> > > The opcode and the "length" dwords following the header will be device
> > specific.
> > 
> > This means that device needs to do two reads just to decode the
> > descriptor fully. This conflicts with feedback Intel has been giving on
> > list which is to try and reduce number of reads. With header linear with
> > the packet, you need two reads to start transmitting the packet.
> > 
> [I.L] The device can do a single large read and do the parsing afterword's.

For sure but that wastes some pcie bandwidth.

> You could also use the doorbell to tell the device how much to read.

We currently use that to pass address of last descriptor.


> 
> > Seems to look like the avail bit in the kvm forum presentation.
> > 	
> [I.L] I don't want to argue over the name. The main difference in my 
> proposal is that the device doesn't need to write to the descriptor.
> If it wants to you can define a separate bit for that.

A theoretical analysis shows less cache line bounces
if device writes and driver writes go to same location.
A micro-benchmark and dpdk tests seem to match that.

If you want to split them, how about a test showing
either a benefit for software or an explanation about why it's
significantly different for hardware than software?


> > > Each device (or device class) can choose whether completions are reported
> > directly inside
> > > the descriptors in that ring or in a separate completion ring.
> > >
> > > completions rings can be implemented in an efficient manner with this
> > design.
> > > The driver will initialize a dedicated completion ring with empty completion
> > sized descriptors.
> > > And the device will write the completions directly into the ring.
> > 
> > I assume when you say completion you mean used entries, if I'm wrong
> > please correct me.  In fact with the proposal in the kvm forum
> > presentation it is possible to write used entries in a separate address
> > as opposed to overlapping the available entries.  If you are going to
> > support skipping writing back some used descriptors then accounting
> > would have to change slightly since driver won't be able to reset used
> > flags then.  But in the past in all tests I've written this separate
> > ring underperforms a shared ring.
> > 
> [I.L] A completion is a used entry + device specific metadata.
> I don't remember seeing an option to write used entries in a separate address,
> I'll appreciate it if you can point me to the right direction.

It wasn't described in the talk.

But it's simply this: driver detects used entry by detecting a used bit
flip.  If device does not use the option to skip writing back some used
entries then there's no need for used entries written by device and by
driver to overlap. If device does skip then we need them to overlap as
driver also needs to reset the used flag so used != avail.

> Regarding the shared ring vs separate ring, I can't really argue with you
> as I haven't done the relevant measurement.
> I'm just saying it might not always be optimal in all use case,
> So you should consider leaving both options open.
> 
> Its entirely possible that for virtio-net you want a single ring,
> Where for PV-RDMA you want separate rings. 

Well RDMA consortium decided a low-level API for cards will help
application portability, and that spec has a concept of completion
queues which are shared between request queues.  So the combined ring
optimization kind of goes out the window for that kind of device :) I'm
not sure just splitting out used rings will be enough though.

It's not a great fit for virtio right now, if someone's interested
in changing rings to match that use-case I'm all ears.

-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [virtio-dev] Re: Virtio BoF minutes from KVM Forum 2017
@ 2017-11-02  3:40           ` Michael S. Tsirkin
  0 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2017-11-02  3:40 UTC (permalink / raw)
  To: Ilya Lesokhin; +Cc: virtio-dev, virtualization, Jens Freimann

On Wed, Nov 01, 2017 at 06:12:09PM +0000, Ilya Lesokhin wrote:
> On Wednesday, November 1, 2017 7:35 PM, Michael S. Tsirkin wrote:
> > > You have to either use and additional descriptor for metadata per chain.
> > > Or putting it in one of the buffers and forcing the lifetime of the metadata
> > and data to be the same.
> > 
> > That's true. It would be easy to make descriptors e.g. 32 bytes each, so
> > you can add extra data in there. Or if we can live with wasting some
> > bytes per descriptor, we could add a descriptor flag that marks the
> > address field as meta-data. You could then chain it with a regular
> > descriptor for data. However all in all the simplest option is probably
> > in the virtio header which can be linear with the packet.
> [I.L] In the current proposal descriptor size == SGE (scatter gather entry) size.
> I'm not sure that's a good idea.
> For example we are considering having an RX ring were you just post a list
> of PFN's so a sge is only 8 bytes.

You mean without length, flags etc?  So when you are concerned about
memory usage because you have many users for buffers (like e.g. with
Linux networking), then sizing buffers dynamically helps a lot. Single
user cases like DPDK or more recently XDP are different and they
can afford making all buffers same size.

For sure 8 byte entries would reduce cache pressure.  Question is
how to we handle so much variety in the ring layouts. Thoughts?

> I might be wrong here, so please correct me if that's the not the case.
> but I've gotten the impression that due to DPDK limitations you've focused 
> on the use case where you have 1 SGE.
> I'm not sure that a representative workload for network devices,
> As LSO is an important offload.

I don't think we focused on DPDK limitations.  For sure, lots of people
use LSO or other buffers with many s/g entries. But it also works pretty
well more or less whatever you do as you are able to pass a single large
packet then - so the per-packet overhead is generally amortized.

So if you are doing LSO, I'd say just use indirect.

> And the storage guys also complained about this issue.

Interesting. What was the complaint exactly.

> 
> > I suspect a good way to do this would be to just pass offsets within the
> > buffer back and forth. I agree sticking such small messages in a
> > separate buffer is not ideal. How about an option of replacing PA
> > with this data?
> [I.L] PA?

Sorry, I really meant the addr field in the descriptor.

> > 
> > 
> > 
> > > 3. There is a usage model where you have multiple produce rings
> > > And a single completion ring.
> > 
> > What good is it though? It seems to perform worse than combining
> > producer and consumer in my testing.
> > 
> [I.L] It might be that for virtio-net a single ring is better
> But are you really going to argue that its better in all possible use cases?
> 
> > 
> > > You could implement the completion ring using an additional virtio ring,
> > but
> > > The current model will require an extra indirection as it force you to write
> > into
> > > The buffers the descriptor in the completion ring point to. Rather than
> > writing the
> > > Completion into the ring itself.
> > > Additionally the device is still required to write to the original producer ring
> > > in addition to the completion ring.
> > >
> > > I think the best and most flexible design is to have variable size descriptor
> > that
> > > start with a dword header.
> > > The dword header will include - an ownership bit, an opcode and
> > descriptor length.
> > > The opcode and the "length" dwords following the header will be device
> > specific.
> > 
> > This means that device needs to do two reads just to decode the
> > descriptor fully. This conflicts with feedback Intel has been giving on
> > list which is to try and reduce number of reads. With header linear with
> > the packet, you need two reads to start transmitting the packet.
> > 
> [I.L] The device can do a single large read and do the parsing afterword's.

For sure but that wastes some pcie bandwidth.

> You could also use the doorbell to tell the device how much to read.

We currently use that to pass address of last descriptor.


> 
> > Seems to look like the avail bit in the kvm forum presentation.
> > 	
> [I.L] I don't want to argue over the name. The main difference in my 
> proposal is that the device doesn't need to write to the descriptor.
> If it wants to you can define a separate bit for that.

A theoretical analysis shows less cache line bounces
if device writes and driver writes go to same location.
A micro-benchmark and dpdk tests seem to match that.

If you want to split them, how about a test showing
either a benefit for software or an explanation about why it's
significantly different for hardware than software?


> > > Each device (or device class) can choose whether completions are reported
> > directly inside
> > > the descriptors in that ring or in a separate completion ring.
> > >
> > > completions rings can be implemented in an efficient manner with this
> > design.
> > > The driver will initialize a dedicated completion ring with empty completion
> > sized descriptors.
> > > And the device will write the completions directly into the ring.
> > 
> > I assume when you say completion you mean used entries, if I'm wrong
> > please correct me.  In fact with the proposal in the kvm forum
> > presentation it is possible to write used entries in a separate address
> > as opposed to overlapping the available entries.  If you are going to
> > support skipping writing back some used descriptors then accounting
> > would have to change slightly since driver won't be able to reset used
> > flags then.  But in the past in all tests I've written this separate
> > ring underperforms a shared ring.
> > 
> [I.L] A completion is a used entry + device specific metadata.
> I don't remember seeing an option to write used entries in a separate address,
> I'll appreciate it if you can point me to the right direction.

It wasn't described in the talk.

But it's simply this: driver detects used entry by detecting a used bit
flip.  If device does not use the option to skip writing back some used
entries then there's no need for used entries written by device and by
driver to overlap. If device does skip then we need them to overlap as
driver also needs to reset the used flag so used != avail.

> Regarding the shared ring vs separate ring, I can't really argue with you
> as I haven't done the relevant measurement.
> I'm just saying it might not always be optimal in all use case,
> So you should consider leaving both options open.
> 
> Its entirely possible that for virtio-net you want a single ring,
> Where for PV-RDMA you want separate rings. 

Well RDMA consortium decided a low-level API for cards will help
application portability, and that spec has a concept of completion
queues which are shared between request queues.  So the combined ring
optimization kind of goes out the window for that kind of device :) I'm
not sure just splitting out used rings will be enough though.

It's not a great fit for virtio right now, if someone's interested
in changing rings to match that use-case I'm all ears.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Virtio BoF minutes from KVM Forum 2017
  2017-11-02  3:40           ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2017-11-02  8:18           ` Ilya Lesokhin
  -1 siblings, 0 replies; 13+ messages in thread
From: Ilya Lesokhin @ 2017-11-02  8:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, virtualization

On Thursday, November 02, 2017 5:40 AM, Michael S. Tsirkin wrote:

> > [I.L] In the current proposal descriptor size == SGE (scatter gather entry) size.
> > I'm not sure that's a good idea.
> > For example we are considering having an RX ring were you just post a
> > list of PFN's so a sge is only 8 bytes.
> 
> You mean without length, flags etc?  So when you are concerned about memory
> usage because you have many users for buffers (like e.g. with Linux
> networking), then sizing buffers dynamically helps a lot. Single user cases like
> DPDK or more recently XDP are different and they can afford making all buffers
> same size.
> 
[I.L] Yes, no length or flags, we just fill it with packets back to back, so 
Memory usage wise it quite efficient.

> For sure 8 byte entries would reduce cache pressure.  Question is how to we
> handle so much variety in the ring layouts. Thoughts?
> 
[I.L] I agree that there are downsides to too much variety in the ring layouts.

It does seem that to get to most out of the ring layout I suggested
will require changing to the vring API to work with descriptors (== work requests) 
rather than sg lists.

I guess I need to think about it a bit more.

> I don't think we focused on DPDK limitations.  For sure, lots of people use LSO or
> other buffers with many s/g entries. But it also works pretty well more or less
> whatever you do as you are able to pass a single large packet then - so the per-
> packet overhead is generally amortized.
> 
[I.L] I see you point, I'll have to think about it some more.

> > And the storage guys also complained about this issue.
> 
> Interesting. What was the complaint exactly.
> 
[I.L]  That scatter gather is Important and that we shouldn't
Optimize for the single SGE case.
But the point you made earlier about amortization also applies here.


> > [I.L] The device can do a single large read and do the parsing afterword's.
> 
> For sure but that wastes some pcie bandwidth.
> 
> > You could also use the doorbell to tell the device how much to read.
> 
> We currently use that to pass address of last descriptor.
[I.L] If you read up to the last descriptor you don't waste any pcie bandwidth.

> >
> > > Seems to look like the avail bit in the kvm forum presentation.
> > >
> > [I.L] I don't want to argue over the name. The main difference in my
> > proposal is that the device doesn't need to write to the descriptor.
> > If it wants to you can define a separate bit for that.
> 
> A theoretical analysis shows less cache line bounces if device writes and driver
> writes go to same location.
> A micro-benchmark and dpdk tests seem to match that.
> 
> If you want to split them, how about a test showing either a benefit for software
> or an explanation about why it's significantly different for hardware than
> software?
> 
[I.L]  The separate bit can be in the same cacheline.

> > I don't remember seeing an option to write used entries in a separate
> > address, I'll appreciate it if you can point me to the right direction.
> 
> It wasn't described in the talk.
> 
> But it's simply this: driver detects used entry by detecting a used bit flip.  If
> device does not use the option to skip writing back some used entries then
> there's no need for used entries written by device and by driver to overlap. If
> device does skip then we need them to overlap as driver also needs to reset the
> used flag so used != avail.
> 
[I.L]  I don't follow, how does the device inform the driver the a descriptor has been
Processed?

> > Regarding the shared ring vs separate ring, I can't really argue with
> > you as I haven't done the relevant measurement.
> > I'm just saying it might not always be optimal in all use case, So you
> > should consider leaving both options open.
> >
> > It's entirely possible that for virtio-net you want a single ring,
> > Where for PV-RDMA you want separate rings.
> 
> Well RDMA consortium decided a low-level API for cards will help application
> portability, and that spec has a concept of completion queues which are shared
> between request queues.  So the combined ring optimization kind of goes out
> the window for that kind of device :) I'm not sure just splitting out used rings will
> be enough though.
> 
I didn't say the splitting the used ring is going to be enough. I suggested two things:
1. don't force the device to write to the request queue.
2. Allow efficient implementation of completion queues 
Through support for inline descriptors.


In any case, You've given me some valuable feedback
and I have better understanding of why you went with the
current ring layout.

Thanks,
Ilya

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-02  8:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-29 12:52 Virtio BoF minutes from KVM Forum 2017 Jens Freimann
2017-10-29 12:52 ` [virtio-dev] " Jens Freimann
2017-11-01 14:59 ` [virtio-dev] " Michael S. Tsirkin
2017-11-01 15:52   ` Ilya Lesokhin
2017-11-01 15:52     ` [virtio-dev] " Ilya Lesokhin
2017-11-01 17:35     ` Michael S. Tsirkin
2017-11-01 17:35     ` [virtio-dev] " Michael S. Tsirkin
2017-11-01 18:12       ` Ilya Lesokhin
2017-11-01 18:12       ` [virtio-dev] " Ilya Lesokhin
2017-11-02  3:40         ` Michael S. Tsirkin
2017-11-02  3:40           ` [virtio-dev] " Michael S. Tsirkin
2017-11-02  8:18           ` Ilya Lesokhin
2017-11-01 14:59 ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.