Re: [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path

From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: cunming.liang@intel.com, jianfeng.tan@intel.com, dev@dpdk.org,
	"Wang, Zhihong" <zhihong.wang@intel.com>,
	"Yao, Lei A" <lei.a.yao@intel.com>
Subject: Re: [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Date: Mon, 6 Mar 2017 16:46:49 +0800	[thread overview]
Message-ID: <20170306084649.GH18844@yliu-dev.sh.intel.com> (raw)
In-Reply-To: <349f9a71-7407-e45a-4687-a54fe7e778c8@redhat.com>

On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
> 
> 
> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> >On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> >>
> >>
> >>On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >>>On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>>>This patch aligns the Virtio-net header on a cache-line boundary to
> >>>>optimize cache utilization, as it puts the Virtio-net header (which
> >>>>is always accessed) on the same cache line as the packet header.
> >>>>
> >>>>For example with an application that forwards packets at L2 level,
> >>>>a single cache-line will be accessed with this patch, instead of
> >>>>two before.
> >>>
> >>>I'm assuming you were testing pkt size <= (64 - hdr_size)?
> >>
> >>No, I tested with 64 bytes packets only.
> >
> >Oh, my bad, I overlooked it. While you were saying "a single cache
> >line", I was thinking putting the virtio net hdr and the "whole"
> >packet data in single cache line, which is not possible for pkt
> >size 64B.
> >
> >>I run some more tests this morning with different packet sizes,
> >>and also with changing the mbuf size on guest side to have multi-
> >>buffers packets:
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|    64 |   2048 |  11.05 |                   11.78 |
> >>|   128 |   2048 |  10.66 |                   11.48 |
> >>|   256 |   2048 |  10.47 |                   11.21 |
> >>|   512 |   2048 |  10.22 |                   10.88 |
> >>|  1024 |   2048 |   7.65 |                    7.84 |
> >>|  1500 |   2048 |   6.25 |                    6.45 |
> >>|  2000 |   2048 |   5.31 |                    5.43 |
> >>|  2048 |   2048 |   5.32 |                    4.25 |
> >>|  1500 |    512 |   3.89 |                    3.98 |
> >>|  2048 |    512 |   1.96 |                    2.02 |
> >>+-------+--------+--------+-------------------------+
> >
> >Could you share more info, say is it a PVP test? Is mergeable on?
> >What's the fwd mode?
> 
> No, this is not PVP benchmark, I have neither another server nor a packet
> generator connected to my Haswell machine back-to-back.
> 
> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
> rxonly. In this configuration, mergeable is ON and no offload disabled
> in QEMU cmdline.

Okay, I see. So the boost, as you have stated, comes from saving two
cache line access to one. Before that, vhost write 2 cache lines,
while the virtio pmd reads 2 cache lines: one for reading the header,
another one for reading the ether header, for updating xstats (there
is no ether access in the fwd mode you tested).

> That's why I would be interested in more testing on recent hardware
> with PVP benchmark. Is it something that could be run in Intel lab?

I think Yao Lei could help on that? But as stated, I think it may
break the performance for bit packets. And I also won't expect big
boost even for 64B in PVP test, judging that it's only 6% boost in
micro bechmarking.

	--yliu
> 
> I did some more trials, and I think that most of the gain seen in this
> microbenchmark  could happen in fact on vhost side.
> Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst()
> call, and I can see there are packets in the vq only once every 20
> calls. On Vhost side, monitoring shows that it always succeeds to write
> its burts, i.e. the vq is never full.
> 
> >>>>In case of multi-buffers packets, next segments will be aligned on
> >>>>a cache-line boundary, instead of cache-line boundary minus size of
> >>>>vnet header before.
> >>>
> >>>The another thing is, this patch always makes the pkt data cache
> >>>unaligned for the first packet, which makes Zhihong's optimization
> >>>on memcpy (for big packet) useless.
> >>>
> >>>   commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> >>>   Author: Zhihong Wang <zhihong.wang@intel.com>
> >>>   Date:   Tue Dec 6 20:31:06 2016 -0500
> >>
> >>I did run some loopback test with large packet also, an I see a small gain
> >>with my patch (fwd io on both ends):
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|  1500 |   2048 |   4.05 |                    4.14 |
> >>+-------+--------+--------+-------------------------+
> >
> >Wierd, that basically means Zhihong's patch doesn't work? Could you add
> >one more colum here: what's the data when roll back to the point without
> >Zhihong's commit?
> 
> I add this to my ToDo list, don't expect results before next week.
> 
> >>>
> >>>       Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >>>       Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >>>       Tested-by: Lei Yao <lei.a.yao@intel.com>
> >>
> >>Does this need to be cache-line aligned?
> >
> >Nope, the alignment size is different with different platforms. AVX512
> >needs a 64B alignment, while AVX2 needs 32B alignment.
> >
> >>I also tried to align pkt on 16bytes boundary, basically putting header
> >>at HEADROOM + 4 bytes offset, but I didn't measured any gain on
> >>Haswell,
> >
> >The fast rte_memcpy path (when dst & src is well aligned) on Haswell
> >(with AVX2) requires 32B alignment. Even the 16B boundary would make
> >it into the slow path. From this point of view, the extra pad does
> >not change anything. Thus, no gain is expected.
> >
> >>and even a drop on SandyBridge.
> >
> >That's weird, SandyBridge requries the 16B alignment, meaning the extra
> >pad should put it into fast path of rte_memcpy, whereas the performance
> >is worse.
> 
> Thanks for the info, I will run more tests to explain this.
> 
> Cheers,
> Maxime
> >
> >	--yliu
> >
> >>I understand your point regarding aligned memcpy, but I'm surprised I
> >>don't see its expected superiority with my benchmarks.
> >>Any thoughts?