From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yuanhan Liu Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue Date: Tue, 27 Sep 2016 18:21:23 +0800 Message-ID: <20160927102123.GL25823@yliu-dev.sh.intel.com> References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com> <1471585430-125925-1-git-send-email-zhihong.wang@intel.com> <8F6C2BD409508844A0EFC19955BE09414E7B5581@SHSMSX103.ccr.corp.intel.com> <20160922022903.GJ23158@yliu-dev.sh.intel.com> <8F6C2BD409508844A0EFC19955BE09414E7B5DAE@SHSMSX103.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="YiEDa0DAkWCtVeE4" Cc: "Wang, Zhihong" , Maxime Coquelin , "dev@dpdk.org" To: Jianbo Liu Return-path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 58CC39E7 for ; Tue, 27 Sep 2016 12:20:56 +0200 (CEST) Content-Disposition: inline In-Reply-To: List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" --YiEDa0DAkWCtVeE4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote: > On 22 September 2016 at 14:58, Wang, Zhihong wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Thursday, September 22, 2016 1:48 PM > >> To: Yuanhan Liu > >> Cc: Wang, Zhihong ; Maxime Coquelin > >> ; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 22 September 2016 at 10:29, Yuanhan Liu > >> wrote: > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> >> > My setup consists of one host running a guest. > >> >> >> > The guest generates as much 64bytes packets as possible using > >> >> >> > >> >> >> Have you tested with other different packet size? > >> >> >> My testing shows that performance is dropping when packet size is > >> more > >> >> >> than 256. > >> >> > > >> >> > > >> >> > Hi Jianbo, > >> >> > > >> >> > Thanks for reporting this. > >> >> > > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> >> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > >> > >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> >> > or Ivy Bridge? > >> >> > > >> It's an ARM server. > >> > >> >> > 3. How many percentage of drop are you seeing? > >> The testing result: > >> size (bytes) improvement (%) > >> 64 3.92 > >> 128 11.51 > >> 256 24.16 > >> 512 -13.79 > >> 1024 -22.51 > >> 1500 -12.22 > >> A correction is that performance is dropping if byte size is larger than 512. > > > > > > Jianbo, > > > > Could you please verify does this patch really cause enqueue perf to drop? > > > > You can test the enqueue path only by set guest to do rxonly, and compare > > the mpps by show port stats all in the guest. > > > > > Tested with testpmd, host: txonly, guest: rxonly > size (bytes) improvement (%) > 64 4.12 > 128 6 > 256 2.65 > 512 -1.12 > 1024 -7.02 There is a difference between Zhihong's code and the old I spotted in the first time: Zhihong removed the avail_idx prefetch. I understand the prefetch becomes a bit tricky when mrg-rx code path is considered; thus, I didn't comment on that. That's one of the difference that, IMO, could drop a regression. I then finally got a chance to add it back. A rough test shows it improves the performance of 1400B packet size greatly in the "txonly in host and rxonly in guest" case: +33% is the number I get with my test server (Ivybridge). I guess this might/would help your case as well. Mind to have a test and tell me the results? BTW, I made it in rush; I haven't tested the mrg-rx code path yet. Thanks. --yliu --YiEDa0DAkWCtVeE4 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=diff commit e5852d04bf87c02d6d0d8e6d8ded4c33030b9c9e Author: Yuanhan Liu Date: Tue Sep 27 17:51:15 2016 +0800 xxxx Signed-off-by: Yuanhan Liu diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index 381dc27..41bfeba 100644 --- a/lib/librte_vhost/vhost.h +++ b/lib/librte_vhost/vhost.h @@ -61,6 +61,8 @@ struct buf_vector { uint32_t desc_idx; }; +#define NR_AVAIL_IDX_PREFETCH 32 + /** * Structure contains variables relevant to RX/TX virtqueues. */ @@ -70,7 +72,7 @@ struct vhost_virtqueue { struct vring_used *used; uint32_t size; - /* Last index used on the available ring */ + uint16_t last_avail_idx; uint16_t last_used_idx; #define VIRTIO_INVALID_EVENTFD (-1) #define VIRTIO_UNINITIALIZED_EVENTFD (-2) @@ -89,6 +91,9 @@ struct vhost_virtqueue { /* Shadow used ring for performance */ struct vring_used_elem *shadow_used_ring; uint32_t shadow_used_idx; + + uint16_t next_avail_idx; + uint16_t avail_idx_buf[NR_AVAIL_IDX_PREFETCH]; } __rte_cache_aligned; /* Old kernels have no such macro defined */ diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 11a2c1a..1cc22fc 100644 --- a/lib/librte_vhost/virtio_net.c +++ b/lib/librte_vhost/virtio_net.c @@ -170,6 +170,41 @@ flush_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq, } } +/* Fetch NR_AVAIL_IDX_PREFETCH avail entries at once */ +static void +prefetch_avail_idx(struct vhost_virtqueue *vq) +{ + int i; + + for (i = 0; i < NR_AVAIL_IDX_PREFETCH; i++) { + vq->avail_idx_buf[i] = vq->avail->ring[(vq->last_avail_idx + i) & + (vq->size - 1)]; + } +} + +static uint16_t +next_avail_idx(struct vhost_virtqueue *vq) +{ + if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { + prefetch_avail_idx(vq); + vq->next_avail_idx = 0; + vq->last_avail_idx += NR_AVAIL_IDX_PREFETCH; + } + + return vq->avail_idx_buf[vq->next_avail_idx++]; +} + +/* + * Just peek, but don't move forward the "next_avail_idx" pointer + * The caller also has to make sure the point doesn't go beyond + * the array. + */ +static uint16_t +peek_next_avail_idx(struct vhost_virtqueue *vq) +{ + return vq->avail_idx_buf[vq->next_avail_idx]; +} + static inline int __attribute__((always_inline)) enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, uint16_t avail_idx, struct rte_mbuf *mbuf, @@ -193,7 +228,7 @@ enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, mbuf_avail = mbuf_len; /* get the current desc */ - desc_current = vq->avail->ring[(vq->last_used_idx) & (vq->size - 1)]; + desc_current = next_avail_idx(vq); desc_chain_head = desc_current; desc = &vq->desc[desc_current]; desc_addr = gpa_to_vva(dev, desc->addr); @@ -235,9 +270,7 @@ enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, if (avail_idx == vq->last_used_idx) goto error; - desc_current = - vq->avail->ring[(vq->last_used_idx) & - (vq->size - 1)]; + desc_current = next_avail_idx(vq); desc_chain_head = desc_current; desc_chain_len = 0; } else @@ -298,6 +331,7 @@ notify_guest(struct virtio_net *dev, struct vhost_virtqueue *vq) eventfd_write(vq->callfd, (eventfd_t)1); } + uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id, struct rte_mbuf **pkts, uint16_t count) @@ -331,14 +365,15 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id, /* start enqueuing packets 1 by 1 */ vq->shadow_used_idx = 0; + vq->next_avail_idx = 0; used_idx = vq->last_used_idx & (vq->size - 1); avail_idx = *((volatile uint16_t *)&vq->avail->idx); while (pkt_left && avail_idx != vq->last_used_idx) { /* prefetch the next desc */ - if (pkt_left > 1 && avail_idx != vq->last_used_idx + 1) - rte_prefetch0(&vq->desc[vq->avail->ring[ - (vq->last_used_idx + 1) & - (vq->size - 1)]]); + if (pkt_left > 1 && + vq->next_avail_idx + 1 < NR_AVAIL_IDX_PREFETCH) { + rte_prefetch0(&vq->desc[peek_next_avail_idx(vq)]); + } if (enqueue_packet(dev, vq, avail_idx, pkts[pkt_idx], is_mrg_rxbuf)) @@ -347,6 +382,7 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id, pkt_idx++; pkt_left--; } + vq->last_avail_idx += vq->next_avail_idx; /* batch update used ring for better performance */ if (likely(vq->shadow_used_idx > 0)) --YiEDa0DAkWCtVeE4--