From mboxrd@z Thu Jan  1 00:00:00 1970
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Subject: Re: [PATCH] vhost_user: protect active rings from async
	ring changes
Date: Tue, 12 Dec 2017 09:41:20 +0100
Message-ID: <84e9797b-2eee-4c5f-5f33-e497e20d5c1c@redhat.com>
References: <20171206135329.652-1-vkaplans@redhat.com>
 <ED26CBA2FAD1BF48A8719AEF02201E36513789F8@SHSMSX103.ccr.corp.intel.com>
 <00a522dc-2287-b3ed-7ca4-e666f6217c91@redhat.com>
 <ED26CBA2FAD1BF48A8719AEF02201E3651379F80@SHSMSX103.ccr.corp.intel.com>
 <7352a09d-f90d-5b3a-51f9-bfc82faf744e@redhat.com>
 <ED26CBA2FAD1BF48A8719AEF02201E365137ACAC@SHSMSX103.ccr.corp.intel.com>
 <ea080aa6-8f0b-4788-bc71-3fdf5b2d636e@redhat.com>
 <ED26CBA2FAD1BF48A8719AEF02201E365138349F@SHSMSX103.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "stable@dpdk.org" <stable@dpdk.org>,
 "jfreiman@redhat.com" <jfreiman@redhat.com>
To: "Tan, Jianfeng" <jianfeng.tan@intel.com>,
 Victor Kaplansky <vkaplans@redhat.com>, "dev@dpdk.org" <dev@dpdk.org>,
 "yliu@fridaylinux.org" <yliu@fridaylinux.org>,
 "Bie, Tiwei" <tiwei.bie@intel.com>
Return-path: <dev-bounces@dpdk.org>
In-Reply-To: <ED26CBA2FAD1BF48A8719AEF02201E365138349F@SHSMSX103.ccr.corp.intel.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


On 12/12/2017 06:25 AM, Tan, Jianfeng wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Friday, December 8, 2017 6:12 PM
>> To: Tan, Jianfeng; Victor Kaplansky; dev@dpdk.org; yliu@fridaylinux.org; Bie,
>> Tiwei
>> Cc: stable@dpdk.org; jfreiman@redhat.com
>> Subject: Re: [PATCH] vhost_user: protect active rings from async ring
>> changes
>>
>> Hi Jianfeng,
>>
>> On 12/08/2017 09:51 AM, Tan, Jianfeng wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Friday, December 8, 2017 4:36 PM
>>>> To: Tan, Jianfeng; Victor Kaplansky; dev@dpdk.org; yliu@fridaylinux.org;
>> Bie,
>>>> Tiwei
>>>> Cc: stable@dpdk.org; jfreiman@redhat.com
>>>> Subject: Re: [PATCH] vhost_user: protect active rings from async ring
>>>> changes
>>>>
>>>>
>>>>
>>>> On 12/08/2017 03:14 AM, Tan, Jianfeng wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>>>> Sent: Thursday, December 7, 2017 6:02 PM
>>>>>> To: Tan, Jianfeng; Victor Kaplansky; dev@dpdk.org;
>> yliu@fridaylinux.org;
>>>> Bie,
>>>>>> Tiwei
>>>>>> Cc: stable@dpdk.org; jfreiman@redhat.com
>>>>>> Subject: Re: [PATCH] vhost_user: protect active rings from async ring
>>>>>> changes
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/07/2017 10:33 AM, Tan, Jianfeng wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Victor Kaplansky [mailto:vkaplans@redhat.com]
>>>>>>>> Sent: Wednesday, December 6, 2017 9:56 PM
>>>>>>>> To: dev@dpdk.org; yliu@fridaylinux.org; Bie, Tiwei; Tan, Jianfeng;
>>>>>>>> vkaplans@redhat.com
>>>>>>>> Cc: stable@dpdk.org; jfreiman@redhat.com; Maxime Coquelin
>>>>>>>> Subject: [PATCH] vhost_user: protect active rings from async ring
>>>> changes
>>>>>>>>
>>>>>>>> When performing live migration or memory hot-plugging,
>>>>>>>> the changes to the device and vrings made by message handler
>>>>>>>> done independently from vring usage by PMD threads.
>>>>>>>>
>>>>>>>> This causes for example segfauls during live-migration
>>>>>>>
>>>>>>> segfauls ->segfaults?
>>>>>>>
>>>>>>>> with MQ enable, but in general virtually any request
>>>>>>>> sent by qemu changing the state of device can cause
>>>>>>>> problems.
>>>>>>>>
>>>>>>>> These patches fixes all above issues by adding a spinlock
>>>>>>>> to every vring and requiring message handler to start operation
>>>>>>>> only after ensuring that all PMD threads related to the divece
>>>>>>>
>>>>>>> Another typo: divece.
>>>>>>>
>>>>>>>> are out of critical section accessing the vring data.
>>>>>>>>
>>>>>>>> Each vring has its own lock in order to not create contention
>>>>>>>> between PMD threads of different vrings and to prevent
>>>>>>>> performance degradation by scaling queue pair number.
>>>>>>>
>>>>>>> Also wonder how much overhead it brings.
>>>>>>>
>>>>>>> Instead of locking each vring, can we just, waiting a while (10us for
>>>> example)
>>>>>> after call destroy_device() callback so that every PMD thread has
>> enough
>>>>>> time to skip out the criterial area?
>>>>>>
>>>>>> No, because we are not destroying the device when it is needed.
>>>>>> Actually, once destroy_device() is called, it is likely that the
>>>>>> application has taken care the ring aren't being processed anymore
>>>>>> before returning from the callback (This is at least the case with Vhost
>>>>>> PMD).
>>>>>
>>>>> OK, I did not put it right way as there are multiple cases above: migration
>>>> and memory hot plug. Let me try again:
>>>>>
>>>>> Whenever a vhost thread handles a message affecting PMD threads,
>> (like
>>>> SET_MEM_TABLE, GET_VRING_BASE, etc) we can remove the dev flag -
>>>> VIRTIO_DEV_RUNNING, and wait for a while so that PMD threads skip out
>> of
>>>> those criterial area. After message handling, reset the flag -
>>>> VIRTIO_DEV_RUNNING.
>>>>
>>>> I think you mean clearing vq's enabled flag, because PMD threads never
>>>> check the VIRTIO_DEV_RUNNING flag.
>>>
>>> Ah, yes.
>>>
>>>>
>>>>> I suppose it can work, basing on an assumption that PMD threads work
>> in
>>>> polling mode and can skip criterial area quickly and inevitably.
>>>>
>>>> That sounds very fragile, because if the CPU aren't perfectly isolated,
>>>> your PMD thread can be preempted for interrupt handling for example.
>>>>
>>>> Or what if for some reason the PMD thread CPU stalls for a short while?
>>>>
>>>> The later is unlikely, but if it happens, it will be hard to debug.
>>>>
>>>> Let's see first the performance impact of using the spinlock. It might
>>>> not be that important because 99.9999% of the times, it will not even
>>>> spin.
>>>
>>> Fair enough.
>>
>> I did some benchmarks on my Broadwell test bench (see patch below), and
>> it seems that depending on the benchmark, perfs are on par, or better
>> with the spinlock! I guess it explains because with the spinlock there
>> is a better batching and less concurrent accesses on the rings, but I'm
>> not sure.
>>
>> Please find my results below (CPU E5-2667 v4 @ 3.20GHz):
>>
>>         Bench         v17.11     v17.11 + spinlock
>>    ---------------- ----------- -------------------
>>     PVP Bidir run1   19.29Mpps   19.26Mpps
>>     PVP Bidir run2   19.26Mpps   19.28Mpps
>>     TxOnly           18.47Mpps   18.83Mpps
>>     RxOnly           13.83Mpps   13.83Mpps
>>     IO Loopback      7.94Mpps    7.97Mpps
>>
> 
> This number seems really good for throughput.
> 
> FYI, we are recently doing a test to measure how long it takes for a noop (means there is no packets on the virtqueue) vhost dequeue operation. It's like:
> 
>      start_tsc = rte_rdtsc();
>      for (j = 0; j < 10000000; ++j)
>              nb_rx = rte_eth_rx_burst(port_id, queue_id, pkts_burst, nb_pkt_per_burst);
>      end_tsc = rte_rdtsc();
>      printf("%"PRIu64"\n", (end_tsc - start_tsc) / 10000000);

Thanks for sharing, it's an interesting benchmark.
Any chance it could land somewhere in DPDK upstream repo as a test 
application or something else
> Turns out that this patch will make it from 50 cycles -> 80 cycles for each noop rte_eth_rx_burst() operation on vhost port. My server is Haswell 2.3GHz.
I declared the spinlock at the end of the struct, so it may not be in 
the same cache line as enabled flag.
Maybe putting them on the same cache line could have a positive impact
in the vring empty case?

> Note that I'm not against this patch. Just that if such operation takes so many cycles, we might consider to introduce interrupt mode for vhost ports.

Indeed, that could make sense in case of lot of vhost ports processed on
a single CPU.

Thanks,
Maxime
> Thanks,
> Jianfeng
>