Re: [PATCH] vhost: remove lockless enqueue to the virtio ring

From: "Polehn, Mike A" <mike.a.polehn@intel.com>
To: "Xie, Huawei" <huawei.xie@intel.com>,
	"Tan, Jianfeng" <jianfeng.tan@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: "ann.zhuangyanying@huawei.com" <ann.zhuangyanying@huawei.com>
Subject: Re: [PATCH] vhost: remove lockless enqueue to the virtio	ring
Date: Tue, 19 Jan 2016 18:33:12 +0000	[thread overview]
Message-ID: <745DB4B8861F8E4B9849C970520ABBF1498488E5@ORSMSX102.amr.corp.intel.com> (raw)
In-Reply-To: <C37D651A908B024F974696C65296B57B4C5A475A@SHSMSX101.ccr.corp.intel.com>

SMP operations can be very expensive, sometimes can impact operations by 100s to 1000s of clock cycles depending on what is the circumstances of the synchronization. It is how you arrange the SMP operations within the tasks at hand across the SMP cores that gives methods for top performance.  Using traditional general purpose SMP methods will result in traditional general purpose performance. Migrating to general libraries (understood by most general purpose programmers) from expert abilities (understood by much smaller group of expert programmers focused on performance) will greatly reduce the value of DPDK since the end result will be lower performance and/or have less predictable operation where rate performance, predictability, and low latency are the primary goals.

The best method to date, is to have multiple outputs to a single port is to use a DPDK queue with multiple producer, single consumer to do an SMP operation for multiple sources to feed a single non SMP task to output to the port (that is why the ports are not SMP protected). Also when considerable contention from multiple sources occur often (data feeding at same time), having DPDK queue with input and output variables  in separate cache lines can have a notable throughput improvement.

Mike 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xie, Huawei
Sent: Tuesday, January 19, 2016 8:44 AM
To: Tan, Jianfeng; dev@dpdk.org
Cc: ann.zhuangyanying@huawei.com
Subject: Re: [dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring

On 1/20/2016 12:25 AM, Tan, Jianfeng wrote:
> Hi Huawei,
>
> On 1/4/2016 10:46 PM, Huawei Xie wrote:
>> This patch removes the internal lockless enqueue implmentation.
>> DPDK doesn't support receiving/transmitting packets from/to the same 
>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK 
>> applications normally have their own lock implmentation when enqueue 
>> packets to the same queue of a port.
>>
>> The atomic cmpset is a costly operation. This patch should help 
>> performance a bit.
>>
>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> ---
>>   lib/librte_vhost/vhost_rxtx.c | 86
>> +++++++++++++------------------------------
>>   1 file changed, 25 insertions(+), 61 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost_rxtx.c 
>> b/lib/librte_vhost/vhost_rxtx.c index bbf3fac..26a1b9c 100644
>> --- a/lib/librte_vhost/vhost_rxtx.c
>> +++ b/lib/librte_vhost/vhost_rxtx.c
>
> I think vhost example will not work well with this patch when
> vm2vm=software.
>
> Test case:
> Two virtio ports handled by two pmd threads. Thread 0 polls pkts from
> physical NIC and sends to virtio0, while thread0 receives pkts from
> virtio1 and routes it to virtio0.

vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all
physical and virtual ports as ports equally. When two DPDK threads try
to enqueue to the same port, the APP needs to consider the contention.
All the physical PMDs doesn't support concurrent enqueuing/dequeuing.
Vhost PMD should expose the same behavior unless absolutely necessary
and we expose the difference of different PMD.

>
>> -
>>           *(volatile uint16_t *)&vq->used->idx += entry_success;
>
> Another unrelated question: We ever try to move this assignment out of
> loop to save cost as it's a data contention?

This operation itself is not that costly, but it has side effect on the
cache transfer.
It is outside of the loop for non-mergable case. For mergeable case, it
is inside the loop.
Actually it has pro and cons whether we do this in burst or in a smaller
step. I prefer to move it outside of the loop. Let us address this later.

>
> Thanks,
> Jianfeng
>
>