From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Wang, Zhihong" Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue Date: Sun, 25 Sep 2016 05:41:55 +0000 Message-ID: <8F6C2BD409508844A0EFC19955BE09414E7B6EA6@SHSMSX103.ccr.corp.intel.com> References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com> <8F6C2BD409508844A0EFC19955BE09414E7B6204@SHSMSX103.ccr.corp.intel.com> <1536480.IYe8r5XoNN@xps13> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" , Yuanhan Liu , Maxime Coquelin To: Thomas Monjalon , Jianbo Liu Return-path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id B42F64B79 for ; Sun, 25 Sep 2016 07:42:00 +0200 (CEST) In-Reply-To: <1536480.IYe8r5XoNN@xps13> Content-Language: en-US List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Friday, September 23, 2016 9:41 PM > To: Jianbo Liu > Cc: dev@dpdk.org; Wang, Zhihong ; Yuanhan Liu > ; Maxime Coquelin > > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >=20 > 2016-09-23 18:41, Jianbo Liu: > > On 23 September 2016 at 10:56, Wang, Zhihong > wrote: > > ..... > > > This is expected because the 2nd patch is just a baseline and all opt= imization > > > patches are organized in the rest of this patch set. > > > > > > I think you can do bottleneck analysis on ARM to see what's slowing d= own the > > > perf, there might be some micro-arch complications there, mostly like= ly in > > > memcpy. > > > > > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own. > > > > > > Could you publish the mrg_rxbuf=3Don data also? Since it's more widel= y used > > > in terms of spec integrity. > > > > > I don't think it will be helpful for you, considering the differences > > between x86 and arm. Hi Jianbo, This patch does help in ARM for small packets like 64B sized ones, this actually proves the similarity between x86 and ARM in terms of caching optimization in this patch. My estimation is based on: 1. The last patch are for mrg_rxbuf=3Don, and since you said it helps perf, we can ignore it for now when we discuss mrg_rxbuf=3Doff 2. Vhost enqueue perf =3D Ring overhead + Virtio header overhead + Data memcpy overhead 3. This patch helps small packets traffic, which means it helps ring + virtio header operations 4. So, when you say perf drop when packet size larger than 512B, this is most likely caused by memcpy in ARM not working well with this patch I'm not saying glibc's memcpy is not good enough, it's just that this is a rather special use case. And since we see specialized memcpy + this patch give better performance than other combinations significantly on x86, we suggest to hand-craft a specialized memcpy for it. Of course on ARM this is still just my speculation, and we need to either prove it or find the actual root cause. It can be **REALLY HELPFUL** if you could help to test this patch on ARM for mrg_rxbuf=3Don cases to see if this patch is in fact helpful to ARM at all, since mrg_rxbuf=3Don the more widely used cases. Thanks Zhihong > > So please move on with this patchset... >=20 > Jianbo, > I don't understand. > You said that the 2nd patch is a regression: > - volatile uint16_t last_used_idx; > + uint16_t last_used_idx; >=20 > And the overrall series lead to performance regression > for packets > 512 B, right? > But we don't know wether you have tested the v6 or not. >=20 > Zhihong talked about some improvements possible in rte_memcpy. > ARM64 is using libc memcpy in rte_memcpy. >=20 > Now you seem to give up. > Does it mean you accept having a regression in 16.11 release? > Are you working on rte_memcpy?