Re: [PATCH v3 0/5] vhost: optimize enqueue

From: "Wang, Zhihong" <zhihong.wang@intel.com>
To: Jianbo Liu <jianbo.liu@linaro.org>,
	Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
Date: Mon, 10 Oct 2016 06:22:33 +0000	[thread overview]
Message-ID: <8F6C2BD409508844A0EFC19955BE09414E7BC1BD@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <CAP4Qi3_jk5+=diZhq8V7FecUkZge1_riEq-bopY35n71Q_pANQ@mail.gmail.com>

> -----Original Message-----
> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> Sent: Monday, October 10, 2016 1:32 PM
> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:
> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> > > > size (bytes)     improvement (%)
> >> > > > 64                    4.12
> >> > > > 128                   6
> >> > > > 256                   2.65
> >> > > > 512                   -1.12
> >> > > > 1024                 -7.02
> >> > >
> >> > > There is a difference between Zhihong's code and the old I spotted in
> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand
> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> considered;
> >> > > thus, I didn't comment on that.
> >> > >
> >> > > That's one of the difference that, IMO, could drop a regression. I then
> >> > > finally got a chance to add it back.
> >> > >
> >> > > A rough test shows it improves the performance of 1400B packet size
> >> > greatly
> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I
> get
> >> > > with my test server (Ivybridge).
> >> >
> >> > Thanks Yuanhan! I'll validate this on x86.
> >>
> >> Hi Yuanhan,
> >>
> >> Seems your code doesn't perform correctly. I write a new version
> >> of avail idx prefetch but didn't see any perf benefit.
> >>
> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> code has this method but doesn't give any benefits.
> >
> > Good point. I thought of that before, too. But you know that I made it
> > in rush, that I didn't think further and test more.
> >
> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> > actually didn't happen, due to following code piece:
> >
> >         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> >                 prefetch_avail_idx(vq);
> >                 ...
> >         }
> >
> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> > prefetch_avail_idx() will be called. The fix is easy though: just put
> > prefetch_avail_idx before invoking enqueue_packet.
> >
> > In summary, Zhihong is right, I see no more gains with that fix :(
> >
> > However, as stated, that's kind of the only difference I found between
> > yours and the old code, that maybe it's still worthwhile to have a
> > test on ARM, Jianbo?
> >
> I haven't tested it, but I think it could be no improvement for ARM either.
> 
> A smalll suggestion for enqueue_packet:
> 
> .....
> +       /* start copy from mbuf to desc */
> +       while (mbuf_avail || mbuf->next) {
> .....
> 
> Considering pkt_len is in the first cache line (same as data_len),
> while next pointer is in the second cache line,
> is it better to check the total packet len, instead of the last mbuf's
> next pointer to jump out of while loop and avoid possible cache miss?

Jianbo,

Thanks for the reply!

This idea sounds good, but it won't help the general perf in my
opinion, since the 2nd cache line is accessed anyway prior in
virtio_enqueue_offload.

Also this would bring a NULL check when actually access mbuf->next.

BTW, could you please publish the number of:

 1. mrg_rxbuf=on, comparison between original and original + this patch

 2. mrg_rxbuf=off, comparison between original and original + this patch

So we can have a whole picture of how this patch impact on ARM platform.

Thanks
Zhihong