From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Wang, Zhihong" <zhihong.wang@intel.com>
Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
Date: Thu, 13 Oct 2016 06:02:04 +0000
Message-ID: <8F6C2BD409508844A0EFC19955BE09414E7CDDC7@SHSMSX103.ccr.corp.intel.com>
References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com>
 <1471585430-125925-1-git-send-email-zhihong.wang@intel.com>
 <e6addeba-ffbc-baae-61c8-5b8e798c843e@redhat.com>
 <CAP4Qi3-cSgHDPC3Wne3RSL0t=Z-vhYUPsPWH6VAXsXsHYX6ShQ@mail.gmail.com>
 <8F6C2BD409508844A0EFC19955BE09414E7B5581@SHSMSX103.ccr.corp.intel.com>
 <CAP4Qi39-KD8pY-3M31asoDV+dja27XzFTsBMq9ignoawdL8=HQ@mail.gmail.com>
 <20160922022903.GJ23158@yliu-dev.sh.intel.com>
 <CAP4Qi392=aOMrSyTu-5qwpSLpwK-NVdHp-aztT-xT=BcRPWoew@mail.gmail.com>
 <20161012025307.GG16751@yliu-dev.sh.intel.com>
 <8F6C2BD409508844A0EFC19955BE09414E7CD78E@SHSMSX103.ccr.corp.intel.com>
 <20161013053324.GP16751@yliu-dev.sh.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: Jianbo Liu <jianbo.liu@linaro.org>, Thomas Monjalon
 <thomas.monjalon@6wind.com>, Maxime Coquelin <maxime.coquelin@redhat.com>,
 "dev@dpdk.org" <dev@dpdk.org>
To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id DADFD2BEA
 for <dev@dpdk.org>; Thu, 13 Oct 2016 08:02:08 +0200 (CEST)
In-Reply-To: <20161013053324.GP16751@yliu-dev.sh.intel.com>
Content-Language: en-US
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Thursday, October 13, 2016 1:33 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Jianbo Liu <jianbo.liu@linaro.org>; Thomas Monjalon
> <thomas.monjalon@6wind.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>=20
> On Wed, Oct 12, 2016 at 12:22:08PM +0000, Wang, Zhihong wrote:
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes)     improvement (%)
> > > > 64                   3.92
> > > > 128                 11.51
> > > > 256                  24.16
> > > > 512                  -13.79
> > > > 1024                -22.51
> > > > 1500                -12.22
> > > > A correction is that performance is dropping if byte size is larger=
 than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK =
this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it impr=
oves
> > > the maintainability. But only if we see no performance regression on =
both
> > > path after the refactor. Unfortunately, that's not the case here: it =
hurts
> > > the performance for one code path (non-mrg Rx).
> > >
> > > That makes me think we may should not do the code path merge at all. =
I
> think
> > > that also aligns with what you have said before (internally): we coul=
d do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Ba=
dly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> > >
> > > That's the reason we like the idea of "one patch only does one thing,=
 an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
>=20
> Yes, that's great effort! With your hardwork, we know what the bottleneck
> is and how it could be improved.
>=20
> However, you don't have to do code refactor (merge two code path to one)
> to apply those improvements. From what I know, in this patchset, there
> are two factors could improve the performance:
>=20
> - copy hdr together with packet data
>=20
> - shadow used ring update and update at once
>=20
> The overall performance boost I got with your v6 patchset with mrg-Rx
> code path is about 27% (in PVP case). And I have just applied the 1st
> optimization, it yields about 20% boosts. The left could be covered if
> we apply the 2nd optimization (I guess).
>=20
> That would be a clean way to optimize vhost mergeable Rx path:
>=20
> - you don't touch non-mrg Rx path (well, you may could apply the
>   shadow_used_ring trick to it as wel)
>=20
>   This would at least make sure we will have no such performance
>   regression issue reported by ARM guys.
>=20
> - you don't refactor the code
>=20
>   The rewrite from scratch could introduce other issues, besides the
>   performance regression. We may just don't know it yet.
>=20
>=20
> Make sense to you? If you agree, I think we could still make it in
> this release: they would be some small changes after all. For example,
> below is the patch applies the 1st optimization tip on top of
> dpdk-next-virtio


Thanks for this great idea. I think it's a better way to do it.
I'll start to make the patch then.


>=20
> 	--yliu
>=20
> ---------------------------------------------------------------
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.=
c
> index 8a151af..0ddb5af 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  			    uint16_t end_idx, struct rte_mbuf *m,
>  			    struct buf_vector *buf_vec)
>  {
> -	struct virtio_net_hdr_mrg_rxbuf virtio_hdr =3D {{0, 0, 0, 0, 0, 0}, 0};
> +	struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
>  	uint32_t vec_idx =3D 0;
>  	uint16_t start_idx =3D vq->last_used_idx;
>  	uint16_t cur_idx =3D start_idx;
> @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  	uint32_t desc_offset, desc_avail;
>  	uint32_t cpy_len;
>  	uint16_t desc_idx, used_idx;
> +	uint16_t num_buffers =3D end_idx - start_idx;
> +	int hdr_copied =3D 0;
>=20
>  	if (unlikely(m =3D=3D NULL))
>  		return 0;
> @@ -399,16 +401,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  	if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr)
>  		return 0;
>=20
> -	rte_prefetch0((void *)(uintptr_t)desc_addr);
> +	virtio_hdr =3D (struct virtio_net_hdr_mrg_rxbuf
> *)(uintptr_t)desc_addr;
> +	rte_prefetch0(virtio_hdr);
>=20
> -	virtio_hdr.num_buffers =3D end_idx - start_idx;
>  	LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
> -		dev->vid, virtio_hdr.num_buffers);
> -
> -	virtio_enqueue_offload(m, &virtio_hdr.hdr);
> -	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
> -	vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen);
> -	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
> +		dev->vid, num_buffers);
>=20
>  	desc_avail  =3D buf_vec[vec_idx].buf_len - dev->vhost_hlen;
>  	desc_offset =3D dev->vhost_hlen;
> @@ -450,6 +447,15 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  			mbuf_avail  =3D rte_pktmbuf_data_len(m);
>  		}
>=20
> +		if (hdr_copied =3D=3D 0) {
> +			virtio_hdr->num_buffers =3D num_buffers;
> +			virtio_enqueue_offload(m, &virtio_hdr->hdr);
> +			vhost_log_write(dev, buf_vec[vec_idx].buf_addr,
> dev->vhost_hlen);
> +			PRINT_PACKET(dev, (uintptr_t)desc_addr, dev-
> >vhost_hlen, 0);
> +
> +			hdr_copied =3D 1;
> +		}
> +
>  		cpy_len =3D RTE_MIN(desc_avail, mbuf_avail);
>  		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
>  			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),