bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Lorenzo Bianconi <lorenzo@kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>
Cc: brouer@redhat.com, Jesper Dangaard Brouer <jbrouer@redhat.com>,
	Yunsheng Lin <linyunsheng@huawei.com>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Eric Dumazet <edumazet@google.com>,
	Maryam Tahhan <mtahhan@redhat.com>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH net-next v3 3/4] page_pool: introduce page_pool_alloc() API
Date: Wed, 21 Jun 2023 13:55:00 +0200	[thread overview]
Message-ID: <3e6c191a-3be3-d6ff-92a2-2685bade2e66@redhat.com> (raw)
In-Reply-To: <ZJIXSyjxPf7FQQKo@lore-rh-laptop>



On 20/06/2023 23.16, Lorenzo Bianconi wrote:
> [...]
> 
>>> I did some experiments using page_frag_cache/page_frag_alloc() instead of
>>> page_pools in a simple environment I used to test XDP for veth driver.
>>> In particular, I allocate a new buffer in veth_convert_skb_to_xdp_buff() from
>>> the page_frag_cache in order to copy the full skb in the new one, actually
>>> "linearizing" the packet (since we know the original skb length).
>>> I run an iperf TCP connection over a veth pair where the
>>> remote device runs the xdp_rxq_info sample (available in the kernel source
>>> tree, with action XDP_PASS):
>>>
>>> TCP clietn -- v0 === v1 (xdp_rxq_info) -- TCP server
>>>
>>> net-next (page_pool):
>>> - MTU 1500B: ~  7.5 Gbps
>>> - MTU 8000B: ~ 15.3 Gbps
>>>
>>> net-next + page_frag_alloc:
>>> - MTU 1500B: ~  8.4 Gbps
>>> - MTU 8000B: ~ 14.7 Gbps
>>>
>>> It seems there is no a clear "win" situation here (at least in this environment
>>> and we this simple approach). Moreover:
>>
>> For the 1500B packets it is a win, but for 8000B it looks like there
>> is a regression. Any idea what is causing it?
> 
> nope, I have not looked into it yet.
> 

I think I can explain via using micro-benchmark numbers.
(Lorenzo and I have discussed this over IRC, so this is our summary)

*** MTU 1500***

* The MTU 1500 case, where page_frag_alloc is faster than PP (page_pool):

The PP alloc a 4K page for MTU 1500. The cost of alloc + recycle via
ptr_ring cost 48 cycles (page_pool02_ptr_ring Per elem: 48 cycles(tsc)).

The page_frag_alloc API allocates a 32KB order-3 page, and chops it up
for packets.  The order-3 alloc + free cost 514 cycles (page_bench01:
alloc_pages order:3(32768B) 514 cycles). The MTU 1500 needs alloc size
1514+320+256 = 2090 bytes.  In 32KB we can fit 15 packets.  Thus, the
amortized cost per packet is only 34.3 cycles (514/15).

Thus, this explains why page_frag_alloc API have an advantage here, as
amortized cost per packet is lower (for page_frag_alloc).


*** MTU 8000 ***

* The MTU 8000 case, where PP is faster than page_frag_alloc.

The page_frag_alloc API cannot slice the same 32KB into as many packets.
The MTU 8000 needs alloc size 8000+14+256+320 = 8590 bytes.  This is can
only store 3 full packets (32768/8590 = 3.81).
Thus, cost is 514/3 = 171 cycles.

The PP is actually challenged at MTU 8000, because it unfortunately
leads to allocating 3 full pages (12KiB), due to needed alloc size 8590
bytes. Thus cost is 3x 48 cycles = 144 cycles.
(There is also a chance of Jakubs "allow_direct" optimization in 
page_pool_return_skb_page to increase performance for PP).

Thus, this explains why PP is fastest in this case.


*** Surprising insights ***

My (maybe) surprising conclusion is that we should combine the two
approaches.  Which is basically what Lin's patchset is doing!
Thus, I'm actually suddenly become a fan of this patchset...

The insight is that PP can also work with higher-order pages and the
cost of PP recycles via ptr_ring will be the same, regardless of page
order size.  Thus, we can reduced the order-3 cost 514 cycles to
basically 48 cycles, and fit 15 packets (MTU 1500) resulting is
amortized allocator cost 48/15 = 3.2 cycles.

On the PP alloc-side this will be amazingly fast. When PP recycles frags
side, see page_pool_defrag_page() there is an atomic_sub operation.
I've measured atomic_inc to cost 17 cycles (for optimal non-contended
case), thus 3+17 = 20 cycles, it should still be a win.


--Jesper


  reply	other threads:[~2023-06-21 11:55 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-09 13:17 [PATCH net-next v3 0/4] introduce page_pool_alloc() API Yunsheng Lin
     [not found] ` <20230609131740.7496-4-linyunsheng@huawei.com>
     [not found]   ` <CAKgT0UfVwQ=ri7ZDNnsATH2RQpEz+zDBBb6YprvniMEWGdw+dQ@mail.gmail.com>
     [not found]     ` <36366741-8df2-1137-0dd9-d498d0f770e4@huawei.com>
     [not found]       ` <CAKgT0UdXTSv1fDHBX4UC6Ok9NXKMJ_9F88CEv5TK+mpzy0N21g@mail.gmail.com>
     [not found]         ` <c06f6f59-6c35-4944-8f7a-7f6f0e076649@huawei.com>
     [not found]           ` <CAKgT0UccmDe+CE6=zDYQHi1=3vXf5MptzDo+BsPrKdmP5j9kgQ@mail.gmail.com>
2023-06-15 16:19             ` [PATCH net-next v3 3/4] page_pool: " Jesper Dangaard Brouer
2023-06-16 11:57               ` Yunsheng Lin
2023-06-16 16:31                 ` Jesper Dangaard Brouer
2023-06-16 17:34                   ` Alexander Duyck
2023-06-16 18:41                     ` Jesper Dangaard Brouer
2023-06-16 18:47                       ` Jakub Kicinski
2023-06-16 19:50                       ` Alexander Duyck
2023-06-18 15:05                         ` Lorenzo Bianconi
2023-06-20 16:19                           ` Alexander Duyck
2023-06-20 21:16                             ` Lorenzo Bianconi
2023-06-21 11:55                               ` Jesper Dangaard Brouer [this message]
2023-06-24 14:44                                 ` Yunsheng Lin
2023-06-17 12:47                     ` Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e6c191a-3be3-d6ff-92a2-2685bade2e66@redhat.com \
    --to=jbrouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=lorenzo@kernel.org \
    --cc=mtahhan@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).