From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Wang, Zhihong" <zhihong.wang@intel.com>
Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
Date: Sun, 25 Sep 2016 05:41:55 +0000
Message-ID: <8F6C2BD409508844A0EFC19955BE09414E7B6EA6@SHSMSX103.ccr.corp.intel.com>
References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com>
 <8F6C2BD409508844A0EFC19955BE09414E7B6204@SHSMSX103.ccr.corp.intel.com>
 <CAP4Qi3_DxAnvs0jX1P=G_PiLnRRbP5Wty-eU-OPE_81RGCAuTA@mail.gmail.com>
 <1536480.IYe8r5XoNN@xps13>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "dev@dpdk.org" <dev@dpdk.org>, Yuanhan Liu <yuanhan.liu@linux.intel.com>,
 Maxime Coquelin <maxime.coquelin@redhat.com>
To: Thomas Monjalon <thomas.monjalon@6wind.com>, Jianbo Liu
 <jianbo.liu@linaro.org>
Return-path: <dev-bounces@dpdk.org>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id B42F64B79
 for <dev@dpdk.org>; Sun, 25 Sep 2016 07:42:00 +0200 (CEST)
In-Reply-To: <1536480.IYe8r5XoNN@xps13>
Content-Language: en-US
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, September 23, 2016 9:41 PM
> To: Jianbo Liu <jianbo.liu@linaro.org>
> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu
> <yuanhan.liu@linux.intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>=20
> 2016-09-23 18:41, Jianbo Liu:
> > On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com>
> wrote:
> > .....
> > > This is expected because the 2nd patch is just a baseline and all opt=
imization
> > > patches are organized in the rest of this patch set.
> > >
> > > I think you can do bottleneck analysis on ARM to see what's slowing d=
own the
> > > perf, there might be some micro-arch complications there, mostly like=
ly in
> > > memcpy.
> > >
> > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
> > >
> > > Could you publish the mrg_rxbuf=3Don data also? Since it's more widel=
y used
> > > in terms of spec integrity.
> > >
> > I don't think it will be helpful for you, considering the differences
> > between x86 and arm.


Hi Jianbo,

This patch does help in ARM for small packets like 64B sized ones,
this actually proves the similarity between x86 and ARM in terms
of caching optimization in this patch.

My estimation is based on:

 1. The last patch are for mrg_rxbuf=3Don, and since you said it helps
    perf, we can ignore it for now when we discuss mrg_rxbuf=3Doff

 2. Vhost enqueue perf =3D
    Ring overhead + Virtio header overhead + Data memcpy overhead

 3. This patch helps small packets traffic, which means it helps
    ring + virtio header operations

 4. So, when you say perf drop when packet size larger than 512B,
    this is most likely caused by memcpy in ARM not working well
    with this patch

I'm not saying glibc's memcpy is not good enough, it's just that
this is a rather special use case. And since we see specialized
memcpy + this patch give better performance than other combinations
significantly on x86, we suggest to hand-craft a specialized memcpy
for it.

Of course on ARM this is still just my speculation, and we need to
either prove it or find the actual root cause.

It can be **REALLY HELPFUL** if you could help to test this patch on
ARM for mrg_rxbuf=3Don cases to see if this patch is in fact helpful
to ARM at all, since mrg_rxbuf=3Don the more widely used cases.


Thanks
Zhihong


> > So please move on with this patchset...
>=20
> Jianbo,
> I don't understand.
> You said that the 2nd patch is a regression:
> -       volatile uint16_t       last_used_idx;
> +       uint16_t                last_used_idx;
>=20
> And the overrall series lead to performance regression
> for packets > 512 B, right?
> But we don't know wether you have tested the v6 or not.
>=20
> Zhihong talked about some improvements possible in rte_memcpy.
> ARM64 is using libc memcpy in rte_memcpy.
>=20
> Now you seem to give up.
> Does it mean you accept having a regression in 16.11 release?
> Are you working on rte_memcpy?