All of lore.kernel.org
 help / color / mirror / Atom feed
* thoughts stac/clac and get user for vhost
@ 2018-12-25 16:41 Michael S. Tsirkin
  2018-12-26  4:03 ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-12-25 16:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

Hi!
I was just wondering: packed ring batches things naturally.
E.g.

user_access_begin
check descriptor valid
smp_rmb
copy descriptor
user_access_end

So packed layout should show the gain with this approach.
That could be motivation enough to finally enable vhost packed ring
support.

Thoughts?

-- 
MST

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2018-12-25 16:41 thoughts stac/clac and get user for vhost Michael S. Tsirkin
@ 2018-12-26  4:03 ` Jason Wang
  2018-12-26 15:06   ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2018-12-26  4:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> Hi!
> I was just wondering: packed ring batches things naturally.
> E.g.
>
> user_access_begin
> check descriptor valid
> smp_rmb
> copy descriptor
> user_access_end


But without speculation on the descriptor (which may only work for 
in-order or even a violation of spec). Only one two access of a single 
descriptor could be batched. For split ring, we can batch more since we 
know how many descriptors is pending. (avail_idx - last_avail_idx).

Anything I miss?

Thanks


>
> So packed layout should show the gain with this approach.
> That could be motivation enough to finally enable vhost packed ring
> support.
>
> Thoughts?
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2018-12-26  4:03 ` Jason Wang
@ 2018-12-26 15:06   ` Michael S. Tsirkin
  2018-12-27  9:55     ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-12-26 15:06 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > Hi!
> > I was just wondering: packed ring batches things naturally.
> > E.g.
> > 
> > user_access_begin
> > check descriptor valid
> > smp_rmb
> > copy descriptor
> > user_access_end
> 
> 
> But without speculation on the descriptor (which may only work for in-order
> or even a violation of spec). Only one two access of a single descriptor
> could be batched. For split ring, we can batch more since we know how many
> descriptors is pending. (avail_idx - last_avail_idx).
> 
> Anything I miss?
> 
> Thanks
> 

just check more descriptors in a loop:

 user_access_begin
 for (i = 0; i < 16; ++i) {
	 if (!descriptor valid)
		break;
	 smp_rmb
	 copy descriptor
 }
 user_access_end

you don't really need to know how many there are
ahead of the time as you still copy them 1 by one.


> > 
> > So packed layout should show the gain with this approach.
> > That could be motivation enough to finally enable vhost packed ring
> > support.
> > 
> > Thoughts?
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2018-12-26 15:06   ` Michael S. Tsirkin
@ 2018-12-27  9:55     ` Jason Wang
  2018-12-30 18:40       ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2018-12-27  9:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>> Hi!
>>> I was just wondering: packed ring batches things naturally.
>>> E.g.
>>>
>>> user_access_begin
>>> check descriptor valid
>>> smp_rmb
>>> copy descriptor
>>> user_access_end
>>
>> But without speculation on the descriptor (which may only work for in-order
>> or even a violation of spec). Only one two access of a single descriptor
>> could be batched. For split ring, we can batch more since we know how many
>> descriptors is pending. (avail_idx - last_avail_idx).
>>
>> Anything I miss?
>>
>> Thanks
>>
> just check more descriptors in a loop:
>
>   user_access_begin
>   for (i = 0; i < 16; ++i) {
> 	 if (!descriptor valid)
> 		break;
> 	 smp_rmb
> 	 copy descriptor
>   }
>   user_access_end
>
> you don't really need to know how many there are
> ahead of the time as you still copy them 1 by one.


So let's see the case of split ring


user_access_begin

n = avail_idx - last_avail_idx (1)

n = MIN(n, 16)

smp_rmb

read n entries from avail_ring (2)

for (i =0; i <n; i++)

     copy descriptor (3)

user_access_end


Consider for the case of heavy workload. So for packed ring, we have 32 
times of userspace access and 16 times of smp_rmb()

For split ring we have

(1) 1 time

(2) 2 times at most

(3) 16 times

19 times of userspace access and 1 times of smp_rmb(). In fact 2 could 
be eliminated with in order. 3 could be batched completely with in order 
and partially when out of order.

I don't see how packed ring help here especially consider lfence on x86 
is more than memory fence, it prevents speculation in fact.

Thanks


>
>
>>> So packed layout should show the gain with this approach.
>>> That could be motivation enough to finally enable vhost packed ring
>>> support.
>>>
>>> Thoughts?
>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2018-12-27  9:55     ` Jason Wang
@ 2018-12-30 18:40       ` Michael S. Tsirkin
  2019-01-02  3:25         ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-12-30 18:40 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > > > Hi!
> > > > I was just wondering: packed ring batches things naturally.
> > > > E.g.
> > > > 
> > > > user_access_begin
> > > > check descriptor valid
> > > > smp_rmb
> > > > copy descriptor
> > > > user_access_end
> > > 
> > > But without speculation on the descriptor (which may only work for in-order
> > > or even a violation of spec). Only one two access of a single descriptor
> > > could be batched. For split ring, we can batch more since we know how many
> > > descriptors is pending. (avail_idx - last_avail_idx).
> > > 
> > > Anything I miss?
> > > 
> > > Thanks
> > > 
> > just check more descriptors in a loop:
> > 
> >   user_access_begin
> >   for (i = 0; i < 16; ++i) {
> > 	 if (!descriptor valid)
> > 		break;
> > 	 smp_rmb
> > 	 copy descriptor
> >   }
> >   user_access_end
> > 
> > you don't really need to know how many there are
> > ahead of the time as you still copy them 1 by one.
> 
> 
> So let's see the case of split ring
> 
> 
> user_access_begin
> 
> n = avail_idx - last_avail_idx (1)
> 
> n = MIN(n, 16)
> 
> smp_rmb
> 
> read n entries from avail_ring (2)
> 
> for (i =0; i <n; i++)
> 
>     copy descriptor (3)
> 
> user_access_end
> 
> 
> Consider for the case of heavy workload. So for packed ring, we have 32
> times of userspace access and 16 times of smp_rmb()
> 
> For split ring we have
> 
> (1) 1 time
> 
> (2) 2 times at most
> 
> (3) 16 times
> 
> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
> eliminated with in order. 3 could be batched completely with in order and
> partially when out of order.
> 
> I don't see how packed ring help here especially consider lfence on x86 is
> more than memory fence, it prevents speculation in fact.
> 
> Thanks

So on x86 at least RMB is free, this is why I never bothered optimizing
it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
more than the extra indirection in the split ring?

But my point was really fundamental - if ring accesses are expensive
then we should batch them. Right now we have an API that gets
an iovec directly. That limits the optimizations you can do.

The translation works like this:

ring -> valid descriptors -> iovecs

We should have APIs for each step that work in batches.



> 
> > 
> > 
> > > > So packed layout should show the gain with this approach.
> > > > That could be motivation enough to finally enable vhost packed ring
> > > > support.
> > > > 
> > > > Thoughts?
> > > > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2018-12-30 18:40       ` Michael S. Tsirkin
@ 2019-01-02  3:25         ` Jason Wang
  2019-01-04 21:25           ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2019-01-02  3:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>> Hi!
>>>>> I was just wondering: packed ring batches things naturally.
>>>>> E.g.
>>>>>
>>>>> user_access_begin
>>>>> check descriptor valid
>>>>> smp_rmb
>>>>> copy descriptor
>>>>> user_access_end
>>>> But without speculation on the descriptor (which may only work for in-order
>>>> or even a violation of spec). Only one two access of a single descriptor
>>>> could be batched. For split ring, we can batch more since we know how many
>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>
>>>> Anything I miss?
>>>>
>>>> Thanks
>>>>
>>> just check more descriptors in a loop:
>>>
>>>    user_access_begin
>>>    for (i = 0; i < 16; ++i) {
>>> 	 if (!descriptor valid)
>>> 		break;
>>> 	 smp_rmb
>>> 	 copy descriptor
>>>    }
>>>    user_access_end
>>>
>>> you don't really need to know how many there are
>>> ahead of the time as you still copy them 1 by one.
>>
>> So let's see the case of split ring
>>
>>
>> user_access_begin
>>
>> n = avail_idx - last_avail_idx (1)
>>
>> n = MIN(n, 16)
>>
>> smp_rmb
>>
>> read n entries from avail_ring (2)
>>
>> for (i =0; i <n; i++)
>>
>>      copy descriptor (3)
>>
>> user_access_end
>>
>>
>> Consider for the case of heavy workload. So for packed ring, we have 32
>> times of userspace access and 16 times of smp_rmb()
>>
>> For split ring we have
>>
>> (1) 1 time
>>
>> (2) 2 times at most
>>
>> (3) 16 times
>>
>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>> eliminated with in order. 3 could be batched completely with in order and
>> partially when out of order.
>>
>> I don't see how packed ring help here especially consider lfence on x86 is
>> more than memory fence, it prevents speculation in fact.
>>
>> Thanks
> So on x86 at least RMB is free, this is why I never bothered optimizing
> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> more than the extra indirection in the split ring?


I don't know, but obviously, RMB has a chance to damage the performance 
more or less. But even on arch where the RMB is free, packed ring still 
does not show obvious advantage.


>
> But my point was really fundamental - if ring accesses are expensive
> then we should batch them.


I don't object the batching, the reason that they are expensive could be:

1) unnecessary overhead caused by speculation barrier and check likes SMAP

2) cache contention

So it does not conflict with the effort that I did to remove 1). My plan 
is: for metadata, try to eliminate all the 1) completely. For data, we 
can do batch copying to amortize its effort. For avail/descriptor 
batching, we can try to it on top.


>   Right now we have an API that gets
> an iovec directly. That limits the optimizations you can do.
>
> The translation works like this:
>
> ring -> valid descriptors -> iovecs
>
> We should have APIs for each step that work in batches.
>

Yes.

Thanks


>
>>>
>>>>> So packed layout should show the gain with this approach.
>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>> support.
>>>>>
>>>>> Thoughts?
>>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-02  3:25         ` Jason Wang
@ 2019-01-04 21:25           ` Michael S. Tsirkin
  2019-01-07  4:26             ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-04 21:25 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
> 
> On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> > On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
> > > On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> > > > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> > > > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > > > > > Hi!
> > > > > > I was just wondering: packed ring batches things naturally.
> > > > > > E.g.
> > > > > > 
> > > > > > user_access_begin
> > > > > > check descriptor valid
> > > > > > smp_rmb
> > > > > > copy descriptor
> > > > > > user_access_end
> > > > > But without speculation on the descriptor (which may only work for in-order
> > > > > or even a violation of spec). Only one two access of a single descriptor
> > > > > could be batched. For split ring, we can batch more since we know how many
> > > > > descriptors is pending. (avail_idx - last_avail_idx).
> > > > > 
> > > > > Anything I miss?
> > > > > 
> > > > > Thanks
> > > > > 
> > > > just check more descriptors in a loop:
> > > > 
> > > >    user_access_begin
> > > >    for (i = 0; i < 16; ++i) {
> > > > 	 if (!descriptor valid)
> > > > 		break;
> > > > 	 smp_rmb
> > > > 	 copy descriptor
> > > >    }
> > > >    user_access_end
> > > > 
> > > > you don't really need to know how many there are
> > > > ahead of the time as you still copy them 1 by one.
> > > 
> > > So let's see the case of split ring
> > > 
> > > 
> > > user_access_begin
> > > 
> > > n = avail_idx - last_avail_idx (1)
> > > 
> > > n = MIN(n, 16)
> > > 
> > > smp_rmb
> > > 
> > > read n entries from avail_ring (2)
> > > 
> > > for (i =0; i <n; i++)
> > > 
> > >      copy descriptor (3)
> > > 
> > > user_access_end
> > > 
> > > 
> > > Consider for the case of heavy workload. So for packed ring, we have 32
> > > times of userspace access and 16 times of smp_rmb()
> > > 
> > > For split ring we have
> > > 
> > > (1) 1 time
> > > 
> > > (2) 2 times at most
> > > 
> > > (3) 16 times
> > > 
> > > 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
> > > eliminated with in order. 3 could be batched completely with in order and
> > > partially when out of order.
> > > 
> > > I don't see how packed ring help here especially consider lfence on x86 is
> > > more than memory fence, it prevents speculation in fact.
> > > 
> > > Thanks
> > So on x86 at least RMB is free, this is why I never bothered optimizing
> > it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> > more than the extra indirection in the split ring?
> 
> 
> I don't know, but obviously, RMB has a chance to damage the performance more
> or less. But even on arch where the RMB is free, packed ring still does not
> show obvious advantage.

People do measure gains with a PMD on host+guest.
So it's a question of optimizing the packed ring implementation in Linux.


> 
> > 
> > But my point was really fundamental - if ring accesses are expensive
> > then we should batch them.
> 
> 
> I don't object the batching, the reason that they are expensive could be:
> 
> 1) unnecessary overhead caused by speculation barrier and check likes SMAP
> 2) cache contention
> 
> So it does not conflict with the effort that I did to remove 1). My plan is:
> for metadata, try to eliminate all the 1) completely. For data, we can do
> batch copying to amortize its effort. For avail/descriptor batching, we can
> try to it on top.
> 
> 
> >   Right now we have an API that gets
> > an iovec directly. That limits the optimizations you can do.
> > 
> > The translation works like this:
> > 
> > ring -> valid descriptors -> iovecs
> > 
> > We should have APIs for each step that work in batches.
> > 
> 
> Yes.
> 
> Thanks
> 
> 
> > 
> > > > 
> > > > > > So packed layout should show the gain with this approach.
> > > > > > That could be motivation enough to finally enable vhost packed ring
> > > > > > support.
> > > > > > 
> > > > > > Thoughts?
> > > > > > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-04 21:25           ` Michael S. Tsirkin
@ 2019-01-07  4:26             ` Jason Wang
  2019-01-07  5:42               ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2019-01-07  4:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
> On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
>> On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
>>> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>>>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>>>> Hi!
>>>>>>> I was just wondering: packed ring batches things naturally.
>>>>>>> E.g.
>>>>>>>
>>>>>>> user_access_begin
>>>>>>> check descriptor valid
>>>>>>> smp_rmb
>>>>>>> copy descriptor
>>>>>>> user_access_end
>>>>>> But without speculation on the descriptor (which may only work for in-order
>>>>>> or even a violation of spec). Only one two access of a single descriptor
>>>>>> could be batched. For split ring, we can batch more since we know how many
>>>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>>>
>>>>>> Anything I miss?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>> just check more descriptors in a loop:
>>>>>
>>>>>     user_access_begin
>>>>>     for (i = 0; i < 16; ++i) {
>>>>> 	 if (!descriptor valid)
>>>>> 		break;
>>>>> 	 smp_rmb
>>>>> 	 copy descriptor
>>>>>     }
>>>>>     user_access_end
>>>>>
>>>>> you don't really need to know how many there are
>>>>> ahead of the time as you still copy them 1 by one.
>>>> So let's see the case of split ring
>>>>
>>>>
>>>> user_access_begin
>>>>
>>>> n = avail_idx - last_avail_idx (1)
>>>>
>>>> n = MIN(n, 16)
>>>>
>>>> smp_rmb
>>>>
>>>> read n entries from avail_ring (2)
>>>>
>>>> for (i =0; i <n; i++)
>>>>
>>>>       copy descriptor (3)
>>>>
>>>> user_access_end
>>>>
>>>>
>>>> Consider for the case of heavy workload. So for packed ring, we have 32
>>>> times of userspace access and 16 times of smp_rmb()
>>>>
>>>> For split ring we have
>>>>
>>>> (1) 1 time
>>>>
>>>> (2) 2 times at most
>>>>
>>>> (3) 16 times
>>>>
>>>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>>>> eliminated with in order. 3 could be batched completely with in order and
>>>> partially when out of order.
>>>>
>>>> I don't see how packed ring help here especially consider lfence on x86 is
>>>> more than memory fence, it prevents speculation in fact.
>>>>
>>>> Thanks
>>> So on x86 at least RMB is free, this is why I never bothered optimizing
>>> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
>>> more than the extra indirection in the split ring?
>>
>> I don't know, but obviously, RMB has a chance to damage the performance more
>> or less. But even on arch where the RMB is free, packed ring still does not
>> show obvious advantage.
> People do measure gains with a PMD on host+guest.
> So it's a question of optimizing the packed ring implementation in Linux.


Well, 2%-3% difference is not quite a lot.

I think it's not hard to let split ring faster have some small 
optimizations on the code itself.

Thanks


>
>
>>> But my point was really fundamental - if ring accesses are expensive
>>> then we should batch them.
>>
>> I don't object the batching, the reason that they are expensive could be:
>>
>> 1) unnecessary overhead caused by speculation barrier and check likes SMAP
>> 2) cache contention
>>
>> So it does not conflict with the effort that I did to remove 1). My plan is:
>> for metadata, try to eliminate all the 1) completely. For data, we can do
>> batch copying to amortize its effort. For avail/descriptor batching, we can
>> try to it on top.
>>
>>
>>>    Right now we have an API that gets
>>> an iovec directly. That limits the optimizations you can do.
>>>
>>> The translation works like this:
>>>
>>> ring -> valid descriptors -> iovecs
>>>
>>> We should have APIs for each step that work in batches.
>>>
>> Yes.
>>
>> Thanks
>>
>>
>>>>>>> So packed layout should show the gain with this approach.
>>>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>>>> support.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-07  4:26             ` Jason Wang
@ 2019-01-07  5:42               ` Michael S. Tsirkin
  2019-01-07  6:54                 ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07  5:42 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

On Mon, Jan 07, 2019 at 12:26:51PM +0800, Jason Wang wrote:
> 
> On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
> > On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
> > > On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
> > > > > On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> > > > > > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > > > > > > > Hi!
> > > > > > > > I was just wondering: packed ring batches things naturally.
> > > > > > > > E.g.
> > > > > > > > 
> > > > > > > > user_access_begin
> > > > > > > > check descriptor valid
> > > > > > > > smp_rmb
> > > > > > > > copy descriptor
> > > > > > > > user_access_end
> > > > > > > But without speculation on the descriptor (which may only work for in-order
> > > > > > > or even a violation of spec). Only one two access of a single descriptor
> > > > > > > could be batched. For split ring, we can batch more since we know how many
> > > > > > > descriptors is pending. (avail_idx - last_avail_idx).
> > > > > > > 
> > > > > > > Anything I miss?
> > > > > > > 
> > > > > > > Thanks
> > > > > > > 
> > > > > > just check more descriptors in a loop:
> > > > > > 
> > > > > >     user_access_begin
> > > > > >     for (i = 0; i < 16; ++i) {
> > > > > > 	 if (!descriptor valid)
> > > > > > 		break;
> > > > > > 	 smp_rmb
> > > > > > 	 copy descriptor
> > > > > >     }
> > > > > >     user_access_end
> > > > > > 
> > > > > > you don't really need to know how many there are
> > > > > > ahead of the time as you still copy them 1 by one.
> > > > > So let's see the case of split ring
> > > > > 
> > > > > 
> > > > > user_access_begin
> > > > > 
> > > > > n = avail_idx - last_avail_idx (1)
> > > > > 
> > > > > n = MIN(n, 16)
> > > > > 
> > > > > smp_rmb
> > > > > 
> > > > > read n entries from avail_ring (2)
> > > > > 
> > > > > for (i =0; i <n; i++)
> > > > > 
> > > > >       copy descriptor (3)
> > > > > 
> > > > > user_access_end
> > > > > 
> > > > > 
> > > > > Consider for the case of heavy workload. So for packed ring, we have 32
> > > > > times of userspace access and 16 times of smp_rmb()
> > > > > 
> > > > > For split ring we have
> > > > > 
> > > > > (1) 1 time
> > > > > 
> > > > > (2) 2 times at most
> > > > > 
> > > > > (3) 16 times
> > > > > 
> > > > > 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
> > > > > eliminated with in order. 3 could be batched completely with in order and
> > > > > partially when out of order.
> > > > > 
> > > > > I don't see how packed ring help here especially consider lfence on x86 is
> > > > > more than memory fence, it prevents speculation in fact.
> > > > > 
> > > > > Thanks
> > > > So on x86 at least RMB is free, this is why I never bothered optimizing
> > > > it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> > > > more than the extra indirection in the split ring?
> > > 
> > > I don't know, but obviously, RMB has a chance to damage the performance more
> > > or less. But even on arch where the RMB is free, packed ring still does not
> > > show obvious advantage.
> > People do measure gains with a PMD on host+guest.
> > So it's a question of optimizing the packed ring implementation in Linux.
> 
> 
> Well, 2%-3% difference is not quite a lot.

People reported a 10% gain with tiny packets, others reported more.

Again, packed ring is faster sometimes by a factor of 3x but
virtio is just virtio, there's a lot going on besides
just passing the buffer addresses guest to host,
and a different ring layout won't help with that.


> I think it's not hard to let split ring faster have some small optimizations
> on the code itself.
> 
> Thanks

Speed up the split ring support in virtio pmd in dpdk? There have been
several people working on that for a while now. It seems more likely
that we can speed up the newer packed ring code.  E.g. things like
prefetch have a much better chance to work will with the packed layout,
with split one it was a wash IIRC.

> 
> > 
> > 
> > > > But my point was really fundamental - if ring accesses are expensive
> > > > then we should batch them.
> > > 
> > > I don't object the batching, the reason that they are expensive could be:
> > > 
> > > 1) unnecessary overhead caused by speculation barrier and check likes SMAP
> > > 2) cache contention
> > > 
> > > So it does not conflict with the effort that I did to remove 1). My plan is:
> > > for metadata, try to eliminate all the 1) completely. For data, we can do
> > > batch copying to amortize its effort. For avail/descriptor batching, we can
> > > try to it on top.
> > > 
> > > 
> > > >    Right now we have an API that gets
> > > > an iovec directly. That limits the optimizations you can do.
> > > > 
> > > > The translation works like this:
> > > > 
> > > > ring -> valid descriptors -> iovecs
> > > > 
> > > > We should have APIs for each step that work in batches.
> > > > 
> > > Yes.
> > > 
> > > Thanks
> > > 
> > > 
> > > > > > > > So packed layout should show the gain with this approach.
> > > > > > > > That could be motivation enough to finally enable vhost packed ring
> > > > > > > > support.
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-07  5:42               ` Michael S. Tsirkin
@ 2019-01-07  6:54                 ` Jason Wang
  2019-01-07 14:45                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2019-01-07  6:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2019/1/7 下午1:42, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 12:26:51PM +0800, Jason Wang wrote:
>> On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
>>> On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
>>>> On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
>>>>> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>>>>>> Hi!
>>>>>>>>> I was just wondering: packed ring batches things naturally.
>>>>>>>>> E.g.
>>>>>>>>>
>>>>>>>>> user_access_begin
>>>>>>>>> check descriptor valid
>>>>>>>>> smp_rmb
>>>>>>>>> copy descriptor
>>>>>>>>> user_access_end
>>>>>>>> But without speculation on the descriptor (which may only work for in-order
>>>>>>>> or even a violation of spec). Only one two access of a single descriptor
>>>>>>>> could be batched. For split ring, we can batch more since we know how many
>>>>>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>>>>>
>>>>>>>> Anything I miss?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>> just check more descriptors in a loop:
>>>>>>>
>>>>>>>      user_access_begin
>>>>>>>      for (i = 0; i < 16; ++i) {
>>>>>>> 	 if (!descriptor valid)
>>>>>>> 		break;
>>>>>>> 	 smp_rmb
>>>>>>> 	 copy descriptor
>>>>>>>      }
>>>>>>>      user_access_end
>>>>>>>
>>>>>>> you don't really need to know how many there are
>>>>>>> ahead of the time as you still copy them 1 by one.
>>>>>> So let's see the case of split ring
>>>>>>
>>>>>>
>>>>>> user_access_begin
>>>>>>
>>>>>> n = avail_idx - last_avail_idx (1)
>>>>>>
>>>>>> n = MIN(n, 16)
>>>>>>
>>>>>> smp_rmb
>>>>>>
>>>>>> read n entries from avail_ring (2)
>>>>>>
>>>>>> for (i =0; i <n; i++)
>>>>>>
>>>>>>        copy descriptor (3)
>>>>>>
>>>>>> user_access_end
>>>>>>
>>>>>>
>>>>>> Consider for the case of heavy workload. So for packed ring, we have 32
>>>>>> times of userspace access and 16 times of smp_rmb()
>>>>>>
>>>>>> For split ring we have
>>>>>>
>>>>>> (1) 1 time
>>>>>>
>>>>>> (2) 2 times at most
>>>>>>
>>>>>> (3) 16 times
>>>>>>
>>>>>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>>>>>> eliminated with in order. 3 could be batched completely with in order and
>>>>>> partially when out of order.
>>>>>>
>>>>>> I don't see how packed ring help here especially consider lfence on x86 is
>>>>>> more than memory fence, it prevents speculation in fact.
>>>>>>
>>>>>> Thanks
>>>>> So on x86 at least RMB is free, this is why I never bothered optimizing
>>>>> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
>>>>> more than the extra indirection in the split ring?
>>>> I don't know, but obviously, RMB has a chance to damage the performance more
>>>> or less. But even on arch where the RMB is free, packed ring still does not
>>>> show obvious advantage.
>>> People do measure gains with a PMD on host+guest.
>>> So it's a question of optimizing the packed ring implementation in Linux.
>>
>> Well, 2%-3% difference is not quite a lot.
> People reported a 10% gain with tiny packets, others reported more.


Good to know this, any pointer. 2%-3% is the number I got from Jens' 
cover letter.


>
> Again, packed ring is faster sometimes by a factor of 3x but
> virtio is just virtio, there's a lot going on besides
> just passing the buffer addresses guest to host,
> and a different ring layout won't help with that.
>
>
>> I think it's not hard to let split ring faster have some small optimizations
>> on the code itself.
>>
>> Thanks
> Speed up the split ring support in virtio pmd in dpdk? There have been
> several people working on that for a while now. It seems more likely
> that we can speed up the newer packed ring code.  E.g. things like
> prefetch have a much better chance to work will with the packed layout,
> with split one it was a wash IIRC.


But what happen when in order is implemented for packed ring?

I post a patch that increase 10% of PPS with less than 10 lines of code 
for vhost (bypass the avail ring reading). I have similar patch for dpdk 
but just don't have time to test it. Similar optimization could be 
applied to used ring for TX easily.

Thanks


>
>>>
>>>>> But my point was really fundamental - if ring accesses are expensive
>>>>> then we should batch them.
>>>> I don't object the batching, the reason that they are expensive could be:
>>>>
>>>> 1) unnecessary overhead caused by speculation barrier and check likes SMAP
>>>> 2) cache contention
>>>>
>>>> So it does not conflict with the effort that I did to remove 1). My plan is:
>>>> for metadata, try to eliminate all the 1) completely. For data, we can do
>>>> batch copying to amortize its effort. For avail/descriptor batching, we can
>>>> try to it on top.
>>>>
>>>>
>>>>>     Right now we have an API that gets
>>>>> an iovec directly. That limits the optimizations you can do.
>>>>>
>>>>> The translation works like this:
>>>>>
>>>>> ring -> valid descriptors -> iovecs
>>>>>
>>>>> We should have APIs for each step that work in batches.
>>>>>
>>>> Yes.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>>>>>> So packed layout should show the gain with this approach.
>>>>>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>>>>>> support.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-07  6:54                 ` Jason Wang
@ 2019-01-07 14:45                   ` Michael S. Tsirkin
  2019-01-08 10:09                     ` Jason Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-07 14:45 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

On Mon, Jan 07, 2019 at 02:54:19PM +0800, Jason Wang wrote:
> 
> On 2019/1/7 下午1:42, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 12:26:51PM +0800, Jason Wang wrote:
> > > On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
> > > > On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
> > > > > On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> > > > > > > > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > > > > > > > > > Hi!
> > > > > > > > > > I was just wondering: packed ring batches things naturally.
> > > > > > > > > > E.g.
> > > > > > > > > > 
> > > > > > > > > > user_access_begin
> > > > > > > > > > check descriptor valid
> > > > > > > > > > smp_rmb
> > > > > > > > > > copy descriptor
> > > > > > > > > > user_access_end
> > > > > > > > > But without speculation on the descriptor (which may only work for in-order
> > > > > > > > > or even a violation of spec). Only one two access of a single descriptor
> > > > > > > > > could be batched. For split ring, we can batch more since we know how many
> > > > > > > > > descriptors is pending. (avail_idx - last_avail_idx).
> > > > > > > > > 
> > > > > > > > > Anything I miss?
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > > 
> > > > > > > > just check more descriptors in a loop:
> > > > > > > > 
> > > > > > > >      user_access_begin
> > > > > > > >      for (i = 0; i < 16; ++i) {
> > > > > > > > 	 if (!descriptor valid)
> > > > > > > > 		break;
> > > > > > > > 	 smp_rmb
> > > > > > > > 	 copy descriptor
> > > > > > > >      }
> > > > > > > >      user_access_end
> > > > > > > > 
> > > > > > > > you don't really need to know how many there are
> > > > > > > > ahead of the time as you still copy them 1 by one.
> > > > > > > So let's see the case of split ring
> > > > > > > 
> > > > > > > 
> > > > > > > user_access_begin
> > > > > > > 
> > > > > > > n = avail_idx - last_avail_idx (1)
> > > > > > > 
> > > > > > > n = MIN(n, 16)
> > > > > > > 
> > > > > > > smp_rmb
> > > > > > > 
> > > > > > > read n entries from avail_ring (2)
> > > > > > > 
> > > > > > > for (i =0; i <n; i++)
> > > > > > > 
> > > > > > >        copy descriptor (3)
> > > > > > > 
> > > > > > > user_access_end
> > > > > > > 
> > > > > > > 
> > > > > > > Consider for the case of heavy workload. So for packed ring, we have 32
> > > > > > > times of userspace access and 16 times of smp_rmb()
> > > > > > > 
> > > > > > > For split ring we have
> > > > > > > 
> > > > > > > (1) 1 time
> > > > > > > 
> > > > > > > (2) 2 times at most
> > > > > > > 
> > > > > > > (3) 16 times
> > > > > > > 
> > > > > > > 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
> > > > > > > eliminated with in order. 3 could be batched completely with in order and
> > > > > > > partially when out of order.
> > > > > > > 
> > > > > > > I don't see how packed ring help here especially consider lfence on x86 is
> > > > > > > more than memory fence, it prevents speculation in fact.
> > > > > > > 
> > > > > > > Thanks
> > > > > > So on x86 at least RMB is free, this is why I never bothered optimizing
> > > > > > it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> > > > > > more than the extra indirection in the split ring?
> > > > > I don't know, but obviously, RMB has a chance to damage the performance more
> > > > > or less. But even on arch where the RMB is free, packed ring still does not
> > > > > show obvious advantage.
> > > > People do measure gains with a PMD on host+guest.
> > > > So it's a question of optimizing the packed ring implementation in Linux.
> > > 
> > > Well, 2%-3% difference is not quite a lot.
> > People reported a 10% gain with tiny packets, others reported more.
> 
> 
> Good to know this, any pointer. 2%-3% is the number I got from Jens' cover
> letter.

Oh interesting.  Also Jens' cover letter only from an earlier version,
Jan 29. What happened between these two dates I don't know, worth
investigating.

> 
> > 
> > Again, packed ring is faster sometimes by a factor of 3x but
> > virtio is just virtio, there's a lot going on besides
> > just passing the buffer addresses guest to host,
> > and a different ring layout won't help with that.
> > 
> > 
> > > I think it's not hard to let split ring faster have some small optimizations
> > > on the code itself.
> > > 
> > > Thanks
> > Speed up the split ring support in virtio pmd in dpdk? There have been
> > several people working on that for a while now. It seems more likely
> > that we can speed up the newer packed ring code.  E.g. things like
> > prefetch have a much better chance to work will with the packed layout,
> > with split one it was a wash IIRC.
> 
> 
> But what happen when in order is implemented for packed ring?
> 
> I post a patch that increase 10% of PPS with less than 10 lines of code for
> vhost (bypass the avail ring reading). I have similar patch for dpdk but
> just don't have time to test it. Similar optimization could be applied to
> used ring for TX easily.
> 
> Thanks

Oh I have no doubt we can speed things up with interface extensions.


> 
> > 
> > > > 
> > > > > > But my point was really fundamental - if ring accesses are expensive
> > > > > > then we should batch them.
> > > > > I don't object the batching, the reason that they are expensive could be:
> > > > > 
> > > > > 1) unnecessary overhead caused by speculation barrier and check likes SMAP
> > > > > 2) cache contention
> > > > > 
> > > > > So it does not conflict with the effort that I did to remove 1). My plan is:
> > > > > for metadata, try to eliminate all the 1) completely. For data, we can do
> > > > > batch copying to amortize its effort. For avail/descriptor batching, we can
> > > > > try to it on top.
> > > > > 
> > > > > 
> > > > > >     Right now we have an API that gets
> > > > > > an iovec directly. That limits the optimizations you can do.
> > > > > > 
> > > > > > The translation works like this:
> > > > > > 
> > > > > > ring -> valid descriptors -> iovecs
> > > > > > 
> > > > > > We should have APIs for each step that work in batches.
> > > > > > 
> > > > > Yes.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > > > > > > So packed layout should show the gain with this approach.
> > > > > > > > > > That could be motivation enough to finally enable vhost packed ring
> > > > > > > > > > support.
> > > > > > > > > > 
> > > > > > > > > > Thoughts?
> > > > > > > > > > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts stac/clac and get user for vhost
  2019-01-07 14:45                   ` Michael S. Tsirkin
@ 2019-01-08 10:09                     ` Jason Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2019-01-08 10:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev


On 2019/1/7 下午10:45, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 02:54:19PM +0800, Jason Wang wrote:
>> On 2019/1/7 下午1:42, Michael S. Tsirkin wrote:
>>> On Mon, Jan 07, 2019 at 12:26:51PM +0800, Jason Wang wrote:
>>>> On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
>>>>>> On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>>>>>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>>>>>>>> Hi!
>>>>>>>>>>> I was just wondering: packed ring batches things naturally.
>>>>>>>>>>> E.g.
>>>>>>>>>>>
>>>>>>>>>>> user_access_begin
>>>>>>>>>>> check descriptor valid
>>>>>>>>>>> smp_rmb
>>>>>>>>>>> copy descriptor
>>>>>>>>>>> user_access_end
>>>>>>>>>> But without speculation on the descriptor (which may only work for in-order
>>>>>>>>>> or even a violation of spec). Only one two access of a single descriptor
>>>>>>>>>> could be batched. For split ring, we can batch more since we know how many
>>>>>>>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>>>>>>>
>>>>>>>>>> Anything I miss?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>> just check more descriptors in a loop:
>>>>>>>>>
>>>>>>>>>       user_access_begin
>>>>>>>>>       for (i = 0; i < 16; ++i) {
>>>>>>>>> 	 if (!descriptor valid)
>>>>>>>>> 		break;
>>>>>>>>> 	 smp_rmb
>>>>>>>>> 	 copy descriptor
>>>>>>>>>       }
>>>>>>>>>       user_access_end
>>>>>>>>>
>>>>>>>>> you don't really need to know how many there are
>>>>>>>>> ahead of the time as you still copy them 1 by one.
>>>>>>>> So let's see the case of split ring
>>>>>>>>
>>>>>>>>
>>>>>>>> user_access_begin
>>>>>>>>
>>>>>>>> n = avail_idx - last_avail_idx (1)
>>>>>>>>
>>>>>>>> n = MIN(n, 16)
>>>>>>>>
>>>>>>>> smp_rmb
>>>>>>>>
>>>>>>>> read n entries from avail_ring (2)
>>>>>>>>
>>>>>>>> for (i =0; i <n; i++)
>>>>>>>>
>>>>>>>>         copy descriptor (3)
>>>>>>>>
>>>>>>>> user_access_end
>>>>>>>>
>>>>>>>>
>>>>>>>> Consider for the case of heavy workload. So for packed ring, we have 32
>>>>>>>> times of userspace access and 16 times of smp_rmb()
>>>>>>>>
>>>>>>>> For split ring we have
>>>>>>>>
>>>>>>>> (1) 1 time
>>>>>>>>
>>>>>>>> (2) 2 times at most
>>>>>>>>
>>>>>>>> (3) 16 times
>>>>>>>>
>>>>>>>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>>>>>>>> eliminated with in order. 3 could be batched completely with in order and
>>>>>>>> partially when out of order.
>>>>>>>>
>>>>>>>> I don't see how packed ring help here especially consider lfence on x86 is
>>>>>>>> more than memory fence, it prevents speculation in fact.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> So on x86 at least RMB is free, this is why I never bothered optimizing
>>>>>>> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
>>>>>>> more than the extra indirection in the split ring?
>>>>>> I don't know, but obviously, RMB has a chance to damage the performance more
>>>>>> or less. But even on arch where the RMB is free, packed ring still does not
>>>>>> show obvious advantage.
>>>>> People do measure gains with a PMD on host+guest.
>>>>> So it's a question of optimizing the packed ring implementation in Linux.
>>>> Well, 2%-3% difference is not quite a lot.
>>> People reported a 10% gain with tiny packets, others reported more.
>>
>> Good to know this, any pointer. 2%-3% is the number I got from Jens' cover
>> letter.
> Oh interesting.  Also Jens' cover letter only from an earlier version,
> Jan 29. What happened between these two dates I don't know, worth
> investigating.


Btw, the increased times of userspace memory access looks like the root 
cause of regression of packed ring implementation in vhost. If I manage 
to do some out of spec stuffs to reduce the time, packed ring will be at 
most as fast as split ring. I wonder whether or not it's the same case 
of hardware implementation consider each PCI transaction is not free.


>
>>> Again, packed ring is faster sometimes by a factor of 3x but
>>> virtio is just virtio, there's a lot going on besides
>>> just passing the buffer addresses guest to host,
>>> and a different ring layout won't help with that.
>>>
>>>
>>>> I think it's not hard to let split ring faster have some small optimizations
>>>> on the code itself.
>>>>
>>>> Thanks
>>> Speed up the split ring support in virtio pmd in dpdk? There have been
>>> several people working on that for a while now. It seems more likely
>>> that we can speed up the newer packed ring code.  E.g. things like
>>> prefetch have a much better chance to work will with the packed layout,
>>> with split one it was a wash IIRC.
>>
>> But what happen when in order is implemented for packed ring?
>>
>> I post a patch that increase 10% of PPS with less than 10 lines of code for
>> vhost (bypass the avail ring reading). I have similar patch for dpdk but
>> just don't have time to test it. Similar optimization could be applied to
>> used ring for TX easily.
>>
>> Thanks
> Oh I have no doubt we can speed things up with interface extensions.
>

Ok, let me repost the in order series for vhost.

Thanks


>>>>>>> But my point was really fundamental - if ring accesses are expensive
>>>>>>> then we should batch them.
>>>>>> I don't object the batching, the reason that they are expensive could be:
>>>>>>
>>>>>> 1) unnecessary overhead caused by speculation barrier and check likes SMAP
>>>>>> 2) cache contention
>>>>>>
>>>>>> So it does not conflict with the effort that I did to remove 1). My plan is:
>>>>>> for metadata, try to eliminate all the 1) completely. For data, we can do
>>>>>> batch copying to amortize its effort. For avail/descriptor batching, we can
>>>>>> try to it on top.
>>>>>>
>>>>>>
>>>>>>>      Right now we have an API that gets
>>>>>>> an iovec directly. That limits the optimizations you can do.
>>>>>>>
>>>>>>> The translation works like this:
>>>>>>>
>>>>>>> ring -> valid descriptors -> iovecs
>>>>>>>
>>>>>>> We should have APIs for each step that work in batches.
>>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>>>>>> So packed layout should show the gain with this approach.
>>>>>>>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>>>>>>>> support.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-01-08 10:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-25 16:41 thoughts stac/clac and get user for vhost Michael S. Tsirkin
2018-12-26  4:03 ` Jason Wang
2018-12-26 15:06   ` Michael S. Tsirkin
2018-12-27  9:55     ` Jason Wang
2018-12-30 18:40       ` Michael S. Tsirkin
2019-01-02  3:25         ` Jason Wang
2019-01-04 21:25           ` Michael S. Tsirkin
2019-01-07  4:26             ` Jason Wang
2019-01-07  5:42               ` Michael S. Tsirkin
2019-01-07  6:54                 ` Jason Wang
2019-01-07 14:45                   ` Michael S. Tsirkin
2019-01-08 10:09                     ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.