From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Wang, Zhihong" Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue Date: Thu, 13 Oct 2016 06:02:04 +0000 Message-ID: <8F6C2BD409508844A0EFC19955BE09414E7CDDC7@SHSMSX103.ccr.corp.intel.com> References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com> <1471585430-125925-1-git-send-email-zhihong.wang@intel.com> <8F6C2BD409508844A0EFC19955BE09414E7B5581@SHSMSX103.ccr.corp.intel.com> <20160922022903.GJ23158@yliu-dev.sh.intel.com> <20161012025307.GG16751@yliu-dev.sh.intel.com> <8F6C2BD409508844A0EFC19955BE09414E7CD78E@SHSMSX103.ccr.corp.intel.com> <20161013053324.GP16751@yliu-dev.sh.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: Jianbo Liu , Thomas Monjalon , Maxime Coquelin , "dev@dpdk.org" To: Yuanhan Liu Return-path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id DADFD2BEA for ; Thu, 13 Oct 2016 08:02:08 +0200 (CEST) In-Reply-To: <20161013053324.GP16751@yliu-dev.sh.intel.com> Content-Language: en-US List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Thursday, October 13, 2016 1:33 PM > To: Wang, Zhihong > Cc: Jianbo Liu ; Thomas Monjalon > ; Maxime Coquelin > ; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >=20 > On Wed, Oct 12, 2016 at 12:22:08PM +0000, Wang, Zhihong wrote: > > > > >> > 3. How many percentage of drop are you seeing? > > > > The testing result: > > > > size (bytes) improvement (%) > > > > 64 3.92 > > > > 128 11.51 > > > > 256 24.16 > > > > 512 -13.79 > > > > 1024 -22.51 > > > > 1500 -12.22 > > > > A correction is that performance is dropping if byte size is larger= than > 512. > > > > > > I have thought of this twice. Unfortunately, I think I may need NACK = this > > > series. > > > > > > Merging two code path into one is really good: as you stated, it impr= oves > > > the maintainability. But only if we see no performance regression on = both > > > path after the refactor. Unfortunately, that's not the case here: it = hurts > > > the performance for one code path (non-mrg Rx). > > > > > > That makes me think we may should not do the code path merge at all. = I > think > > > that also aligns with what you have said before (internally): we coul= d do > the > > > merge if it gives comparable performance before and after that. > > > > > > Besides that, I don't quite like the way you did in patch 2 (rewrite > enqueue): > > > you made a lot of changes in one patch. That means if something wrong > > > happened, > > > it is hard to narrow down which change introduces that regression. Ba= dly, > > > that's exactly what we met here. Weeks have been passed, I see no > progress. > > > > > > That's the reason we like the idea of "one patch only does one thing,= an > > > atomic thing". > > > > > > Yuanhan, folks, > > > > Thanks for the analysis. I disagree here though. > > > > I analyze, develop, benchmark on x86 platforms, where this patch > > works great. >=20 > Yes, that's great effort! With your hardwork, we know what the bottleneck > is and how it could be improved. >=20 > However, you don't have to do code refactor (merge two code path to one) > to apply those improvements. From what I know, in this patchset, there > are two factors could improve the performance: >=20 > - copy hdr together with packet data >=20 > - shadow used ring update and update at once >=20 > The overall performance boost I got with your v6 patchset with mrg-Rx > code path is about 27% (in PVP case). And I have just applied the 1st > optimization, it yields about 20% boosts. The left could be covered if > we apply the 2nd optimization (I guess). >=20 > That would be a clean way to optimize vhost mergeable Rx path: >=20 > - you don't touch non-mrg Rx path (well, you may could apply the > shadow_used_ring trick to it as wel) >=20 > This would at least make sure we will have no such performance > regression issue reported by ARM guys. >=20 > - you don't refactor the code >=20 > The rewrite from scratch could introduce other issues, besides the > performance regression. We may just don't know it yet. >=20 >=20 > Make sense to you? If you agree, I think we could still make it in > this release: they would be some small changes after all. For example, > below is the patch applies the 1st optimization tip on top of > dpdk-next-virtio Thanks for this great idea. I think it's a better way to do it. I'll start to make the patch then. >=20 > --yliu >=20 > --------------------------------------------------------------- > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.= c > index 8a151af..0ddb5af 100644 > --- a/lib/librte_vhost/virtio_net.c > +++ b/lib/librte_vhost/virtio_net.c > @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > uint16_t end_idx, struct rte_mbuf *m, > struct buf_vector *buf_vec) > { > - struct virtio_net_hdr_mrg_rxbuf virtio_hdr =3D {{0, 0, 0, 0, 0, 0}, 0}; > + struct virtio_net_hdr_mrg_rxbuf *virtio_hdr; > uint32_t vec_idx =3D 0; > uint16_t start_idx =3D vq->last_used_idx; > uint16_t cur_idx =3D start_idx; > @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > uint32_t desc_offset, desc_avail; > uint32_t cpy_len; > uint16_t desc_idx, used_idx; > + uint16_t num_buffers =3D end_idx - start_idx; > + int hdr_copied =3D 0; >=20 > if (unlikely(m =3D=3D NULL)) > return 0; > @@ -399,16 +401,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr) > return 0; >=20 > - rte_prefetch0((void *)(uintptr_t)desc_addr); > + virtio_hdr =3D (struct virtio_net_hdr_mrg_rxbuf > *)(uintptr_t)desc_addr; > + rte_prefetch0(virtio_hdr); >=20 > - virtio_hdr.num_buffers =3D end_idx - start_idx; > LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n", > - dev->vid, virtio_hdr.num_buffers); > - > - virtio_enqueue_offload(m, &virtio_hdr.hdr); > - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); > - vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen); > - PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); > + dev->vid, num_buffers); >=20 > desc_avail =3D buf_vec[vec_idx].buf_len - dev->vhost_hlen; > desc_offset =3D dev->vhost_hlen; > @@ -450,6 +447,15 @@ copy_mbuf_to_desc_mergeable(struct virtio_net > *dev, struct vhost_virtqueue *vq, > mbuf_avail =3D rte_pktmbuf_data_len(m); > } >=20 > + if (hdr_copied =3D=3D 0) { > + virtio_hdr->num_buffers =3D num_buffers; > + virtio_enqueue_offload(m, &virtio_hdr->hdr); > + vhost_log_write(dev, buf_vec[vec_idx].buf_addr, > dev->vhost_hlen); > + PRINT_PACKET(dev, (uintptr_t)desc_addr, dev- > >vhost_hlen, 0); > + > + hdr_copied =3D 1; > + } > + > cpy_len =3D RTE_MIN(desc_avail, mbuf_avail); > rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)), > rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),