bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yunsheng Lin <linyunsheng@huawei.com>
To: David Ahern <dsahern@gmail.com>, <davem@davemloft.net>,
Cc: <alexander.duyck@gmail.com>, <linux@armlinux.org.uk>,
	<mw@semihalf.com>, <linuxarm@openeuler.org>,
	<yisen.zhuang@huawei.com>, <salil.mehta@huawei.com>,
	<thomas.petazzoni@bootlin.com>, <hawk@kernel.org>,
	<ilias.apalodimas@linaro.org>, <ast@kernel.org>,
	<daniel@iogearbox.net>, <john.fastabend@gmail.com>,
	<akpm@linux-foundation.org>, <peterz@infradead.org>,
	<will@kernel.org>, <willy@infradead.org>, <vbabka@suse.cz>,
	<fenghua.yu@intel.com>, <guro@fb.com>, <peterx@redhat.com>,
	<feng.tang@intel.com>, <jgg@ziepe.ca>, <mcroce@microsoft.com>,
	<hughd@google.com>, <jonathan.lemon@gmail.com>, <alobakin@pm.me>,
	<willemb@google.com>, <wenxu@ucloud.cn>,
	<cong.wang@bytedance.com>, <haokexin@gmail.com>,
	<nogikh@google.com>, <elver@google.com>, <yhs@fb.com>,
	<kpsingh@kernel.org>, <andrii@kernel.org>, <kafai@fb.com>,
	<songliubraving@fb.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <bpf@vger.kernel.org>,
	<chenhao288@hisilicon.com>, <edumazet@google.com>,
	<yoshfuji@linux-ipv6.org>, <dsahern@kernel.org>,
	<memxor@gmail.com>, <linux@rempel-privat.de>,
	<atenart@kernel.org>, <weiwan@google.com>, <ap420073@gmail.com>,
	<arnd@arndb.de>, <mathew.j.martineau@linux.intel.com>,
	<aahringo@redhat.com>, <ceggers@arri.de>, <yangbo.lu@nxp.com>,
	<fw@strlen.de>, <xiangxia.m.yue@gmail.com>,
Subject: Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
Date: Tue, 24 Aug 2021 16:41:34 +0800	[thread overview]
Message-ID: <71ee12f2-8b71-923d-f993-ad4a43b9802d@huawei.com> (raw)
In-Reply-To: <80701f7a-e7c6-eb86-4018-67033f0823bf@gmail.com>

On 2021/8/24 11:34, David Ahern wrote:
> On 8/22/21 9:32 PM, Yunsheng Lin wrote:
>> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
>> bottleneck?
> yes.
>> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
>> is not changed when testing, MTU is 1500:
> -Z == sendfile API. That works fine to a point and that point is well
> below 100G.
>> IOMMU in strict mode:
>> 1. Tx ZC case:
>>    22Gbit with Tx being bottleneck(cpu bound)
>> 2. Tx non-ZC case with pfrag pool enabled:
>>    40Git with Rx being bottleneck(cpu bound)
>> 3. Tx non-ZC case with pfrag pool disabled:
>>    30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
>>    not have a single CPU reaching about 100% usage.
>>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>>> on throughput since the Rx is 100% cpu.
>> As above performance data, enabling ZC does not seems to help when IOMMU
>> is involved, which has about 30% performance degrade when pfrag pool is
>> disabled and 50% performance degrade when pfrag pool is enabled.
> In a past response you should numbers for Tx ZC API with a custom
> program. That program showed the dramatic reduction in CPU cycles for Tx
> with the ZC API.

As I deduced the cpu usage from the cycles in "perf stat -e cycles XX", which
does not seem to include the cycles for NAPI polling, which does the tx clean
(including dma unmapping) and does not run in the same cpu as msg_zerocopy runs.

I retested it using msg_zerocopy:
       msg_zerocopy cpu usage      NAPI polling cpu usage
ZC:        23%                               70%
non-ZC     50%                               40%

So it seems to match now, sorry for the confusion.

>>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>>> reduces Rx processing and lower CPU to process the incoming stream. Then
>>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>>> to faster - sending more data which in the end results in much higher
>>> pps and throughput. At the limit you are CPU bound (both ends in my
>>> testing as Rx side approaches the max pps, and Tx side as it continually
>>> tries to send data).
>>> Lowering CPU usage on Tx the side is a win regardless of whether there
>>> is a big increase on the throughput at 1500 MTU since that configuration
>>> is an Rx CPU bound problem. Hence, my point that we have a good start
>>> point for lowering CPU usage on the Tx side; we should improve it rather
>>> than add per-socket page pools.
>> Acctually it is not a per-socket page pools, the page pool is still per
>> NAPI, this patchset adds multi allocation context to the page pool, so that
>> the tx can reuse the same page pool with rx, which is quite usefully if the
>> ARFS is enabled.
>>> You can stress the Tx side and emphasize its overhead by modifying the
>>> receiver to drop the data on Rx rather than copy to userspace which is a
>>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
>> As the frag page is supported in page pool for Rx, the Rx probably is not
>> a bottleneck any more, at least not for IOMMU in strict mode.
>> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
>> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?
> https://github.com/dsahern/iperf, mods branch
> --zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.

Thanks for sharing the tool.
I retested using above iperf, and result is similar to previous result

>>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>>> then gup_pgd_range.
>> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
>> bigger than spinlock(23% to 10%).
>> Anyway, I still think ZC mostly benefit to packet which is bigger than a
>> specific size and IOMMU disabling case.
>>> .
> .

      reply	other threads:[~2021-08-24  8:41 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 2/7] skbuff: add interface to manipulate frag count for tx recycling Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 3/7] net: add NAPI api to register and retrieve the page pool ptr Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 4/7] net: pfrag_pool: add pfrag pool support based on page pool Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 5/7] sock: support refilling pfrag from pfrag_pool Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 6/7] net: hns3: support tx recycling in the hns3 driver Yunsheng Lin
2021-08-18  8:57 ` [PATCH RFC 0/7] add socket to netdev page frag recycling support Eric Dumazet
2021-08-18  9:36   ` Yunsheng Lin
2021-08-23  9:25     ` [Linuxarm] " Yunsheng Lin
2021-08-23 15:04       ` Eric Dumazet
2021-08-24  8:03         ` Yunsheng Lin
2021-08-25 16:29         ` David Ahern
2021-08-25 16:32           ` Eric Dumazet
2021-08-25 16:38             ` David Ahern
2021-08-25 17:24               ` Eric Dumazet
2021-08-26  4:05                 ` David Ahern
2021-08-18 22:05 ` David Ahern
2021-08-19  8:18   ` Yunsheng Lin
2021-08-20 14:35     ` David Ahern
2021-08-23  3:32       ` Yunsheng Lin
2021-08-24  3:34         ` David Ahern
2021-08-24  8:41           ` Yunsheng Lin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71ee12f2-8b71-923d-f993-ad4a43b9802d@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=aahringo@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=alobakin@pm.me \
    --cc=andrii@kernel.org \
    --cc=ap420073@gmail.com \
    --cc=arnd@arndb.de \
    --cc=ast@kernel.org \
    --cc=atenart@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=ceggers@arri.de \
    --cc=chenhao288@hisilicon.com \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=elver@google.com \
    --cc=feng.tang@intel.com \
    --cc=fenghua.yu@intel.com \
    --cc=fw@strlen.de \
    --cc=guro@fb.com \
    --cc=haokexin@gmail.com \
    --cc=hawk@kernel.org \
    --cc=hughd@google.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linux@rempel-privat.de \
    --cc=linuxarm@openeuler.org \
    --cc=mathew.j.martineau@linux.intel.com \
    --cc=mcroce@microsoft.com \
    --cc=memxor@gmail.com \
    --cc=mw@semihalf.com \
    --cc=netdev@vger.kernel.org \
    --cc=nogikh@google.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=salil.mehta@huawei.com \
    --cc=songliubraving@fb.com \
    --cc=thomas.petazzoni@bootlin.com \
    --cc=vbabka@suse.cz \
    --cc=weiwan@google.com \
    --cc=wenxu@ucloud.cn \
    --cc=will@kernel.org \
    --cc=willemb@google.com \
    --cc=willy@infradead.org \
    --cc=xiangxia.m.yue@gmail.com \
    --cc=yangbo.lu@nxp.com \
    --cc=yhs@fb.com \
    --cc=yisen.zhuang@huawei.com \
    --cc=yoshfuji@linux-ipv6.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).