All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: cunming.liang@intel.com, jianfeng.tan@intel.com, dev@dpdk.org,
	"Wang, Zhihong" <zhihong.wang@intel.com>,
	"Yao, Lei A" <lei.a.yao@intel.com>
Subject: Re: [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Date: Mon, 6 Mar 2017 16:46:49 +0800	[thread overview]
Message-ID: <20170306084649.GH18844@yliu-dev.sh.intel.com> (raw)
In-Reply-To: <349f9a71-7407-e45a-4687-a54fe7e778c8@redhat.com>

On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
> 
> 
> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> >On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> >>
> >>
> >>On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >>>On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>>>This patch aligns the Virtio-net header on a cache-line boundary to
> >>>>optimize cache utilization, as it puts the Virtio-net header (which
> >>>>is always accessed) on the same cache line as the packet header.
> >>>>
> >>>>For example with an application that forwards packets at L2 level,
> >>>>a single cache-line will be accessed with this patch, instead of
> >>>>two before.
> >>>
> >>>I'm assuming you were testing pkt size <= (64 - hdr_size)?
> >>
> >>No, I tested with 64 bytes packets only.
> >
> >Oh, my bad, I overlooked it. While you were saying "a single cache
> >line", I was thinking putting the virtio net hdr and the "whole"
> >packet data in single cache line, which is not possible for pkt
> >size 64B.
> >
> >>I run some more tests this morning with different packet sizes,
> >>and also with changing the mbuf size on guest side to have multi-
> >>buffers packets:
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|    64 |   2048 |  11.05 |                   11.78 |
> >>|   128 |   2048 |  10.66 |                   11.48 |
> >>|   256 |   2048 |  10.47 |                   11.21 |
> >>|   512 |   2048 |  10.22 |                   10.88 |
> >>|  1024 |   2048 |   7.65 |                    7.84 |
> >>|  1500 |   2048 |   6.25 |                    6.45 |
> >>|  2000 |   2048 |   5.31 |                    5.43 |
> >>|  2048 |   2048 |   5.32 |                    4.25 |
> >>|  1500 |    512 |   3.89 |                    3.98 |
> >>|  2048 |    512 |   1.96 |                    2.02 |
> >>+-------+--------+--------+-------------------------+
> >
> >Could you share more info, say is it a PVP test? Is mergeable on?
> >What's the fwd mode?
> 
> No, this is not PVP benchmark, I have neither another server nor a packet
> generator connected to my Haswell machine back-to-back.
> 
> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
> rxonly. In this configuration, mergeable is ON and no offload disabled
> in QEMU cmdline.

Okay, I see. So the boost, as you have stated, comes from saving two
cache line access to one. Before that, vhost write 2 cache lines,
while the virtio pmd reads 2 cache lines: one for reading the header,
another one for reading the ether header, for updating xstats (there
is no ether access in the fwd mode you tested).

> That's why I would be interested in more testing on recent hardware
> with PVP benchmark. Is it something that could be run in Intel lab?

I think Yao Lei could help on that? But as stated, I think it may
break the performance for bit packets. And I also won't expect big
boost even for 64B in PVP test, judging that it's only 6% boost in
micro bechmarking.

	--yliu
> 
> I did some more trials, and I think that most of the gain seen in this
> microbenchmark  could happen in fact on vhost side.
> Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst()
> call, and I can see there are packets in the vq only once every 20
> calls. On Vhost side, monitoring shows that it always succeeds to write
> its burts, i.e. the vq is never full.
> 
> >>>>In case of multi-buffers packets, next segments will be aligned on
> >>>>a cache-line boundary, instead of cache-line boundary minus size of
> >>>>vnet header before.
> >>>
> >>>The another thing is, this patch always makes the pkt data cache
> >>>unaligned for the first packet, which makes Zhihong's optimization
> >>>on memcpy (for big packet) useless.
> >>>
> >>>   commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> >>>   Author: Zhihong Wang <zhihong.wang@intel.com>
> >>>   Date:   Tue Dec 6 20:31:06 2016 -0500
> >>
> >>I did run some loopback test with large packet also, an I see a small gain
> >>with my patch (fwd io on both ends):
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|  1500 |   2048 |   4.05 |                    4.14 |
> >>+-------+--------+--------+-------------------------+
> >
> >Wierd, that basically means Zhihong's patch doesn't work? Could you add
> >one more colum here: what's the data when roll back to the point without
> >Zhihong's commit?
> 
> I add this to my ToDo list, don't expect results before next week.
> 
> >>>
> >>>       Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >>>       Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >>>       Tested-by: Lei Yao <lei.a.yao@intel.com>
> >>
> >>Does this need to be cache-line aligned?
> >
> >Nope, the alignment size is different with different platforms. AVX512
> >needs a 64B alignment, while AVX2 needs 32B alignment.
> >
> >>I also tried to align pkt on 16bytes boundary, basically putting header
> >>at HEADROOM + 4 bytes offset, but I didn't measured any gain on
> >>Haswell,
> >
> >The fast rte_memcpy path (when dst & src is well aligned) on Haswell
> >(with AVX2) requires 32B alignment. Even the 16B boundary would make
> >it into the slow path. From this point of view, the extra pad does
> >not change anything. Thus, no gain is expected.
> >
> >>and even a drop on SandyBridge.
> >
> >That's weird, SandyBridge requries the 16B alignment, meaning the extra
> >pad should put it into fast path of rte_memcpy, whereas the performance
> >is worse.
> 
> Thanks for the info, I will run more tests to explain this.
> 
> Cheers,
> Maxime
> >
> >	--yliu
> >
> >>I understand your point regarding aligned memcpy, but I'm surprised I
> >>don't see its expected superiority with my benchmarks.
> >>Any thoughts?

  reply	other threads:[~2017-03-06  8:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-21 17:32 [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path Maxime Coquelin
2017-02-22  1:37 ` Yuanhan Liu
2017-02-22  2:49   ` Yang, Zhiyong
2017-02-22  9:39     ` Maxime Coquelin
2017-02-22  9:36   ` Maxime Coquelin
2017-02-23  5:49     ` Yuanhan Liu
2017-03-01  7:36       ` Maxime Coquelin
2017-03-06  8:46         ` Yuanhan Liu [this message]
2017-03-06 14:11           ` Maxime Coquelin
2017-03-08  6:01             ` Yao, Lei A
2017-03-09 14:38               ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170306084649.GH18844@yliu-dev.sh.intel.com \
    --to=yuanhan.liu@linux.intel.com \
    --cc=cunming.liang@intel.com \
    --cc=dev@dpdk.org \
    --cc=jianfeng.tan@intel.com \
    --cc=lei.a.yao@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=zhihong.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.