archive mirror
 help / color / mirror / Atom feed
From: Yunsheng Lin <>
To: Alexander Duyck <>,
	Yunsheng Lin <>
Cc: David Miller <>,
	Jakub Kicinski <>,
	Russell King - ARM Linux <>,
	Marcin Wojtas <>, <>,
	<>, "Salil Mehta" <>,
	<>, <>,
	Ilias Apalodimas <>,
	"Alexei Starovoitov" <>,
	Daniel Borkmann <>,
	"John Fastabend" <>,
	Andrew Morton <>,
	Peter Zijlstra <>,
	"Will Deacon" <>,
	Matthew Wilcox <>,
	"Vlastimil Babka" <>, <>,
	<>, Peter Xu <>,
	Feng Tang <>, Jason Gunthorpe <>,
	Matteo Croce <>,
	Hugh Dickins <>,
	Jonathan Lemon <>,
	"Alexander Lobakin" <>,
	Willem de Bruijn <>, <>,
	Cong Wang <>,
	Kevin Hao <>, <>,
	Marco Elver <>, Yonghong Song <>,
	<>, <>,
	"Martin KaFai Lau" <>, <>,
	Netdev <>,
	LKML <>, bpf <>
Subject: Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
Date: Tue, 27 Jul 2021 15:54:07 +0800	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 2021/7/26 0:49, Alexander Duyck wrote:
> On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <> wrote:
>> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
>>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <> wrote:
>>>> On 2021/7/22 23:18, Alexander Duyck wrote:
>>>>>>>> You are right that that may cover up the reference count errors. How about
>>>>>>>> something like below:
>>>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>>>>>                                                           long nr)
>>>>>>>> {
>>>>>>>> #ifdef CONFIG_DEBUG_PAGE_REF
>>>>>>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>>>         WARN_ON(ret < 0);
>>>>>>>>         return ret;
>>>>>>>> #else
>>>>>>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
>>>>>>>>                 return 0;
>>>>>>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>>> #end
>>>>>>>> }
>>>>>>>> Or any better suggestion?
>>>>>>> So the one thing I might change would be to make it so that you only
>>>>>>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
>>>>>>> That way you would be performing the comparison in
>>>>>>> __page_pool_put_page and in the cases of freeing or draining the
>>>>>>> page_frags you would be using the atomic_long_sub_return which should
>>>>>>> be paths where you would not expect it to match or that are slowpath
>>>>>>> anyway.
>>>>>>> Also I would keep the WARN_ON in both paths just to be on the safe side.
>>>>>> If I understand it correctly, we should change it as below, right?
>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>>>                                                           long nr)
>>>>>> {
>>>>>>         long ret;
>>>>>>         /* As suggested by Alexander, atomic_long_read() may cover up the
>>>>>>          * reference count errors, so avoid calling atomic_long_read() in
>>>>>>          * the cases of freeing or draining the page_frags, where we would
>>>>>>          * not expect it to match or that are slowpath anyway.
>>>>>>          */
>>>>>>         if (__builtin_constant_p(nr) &&
>>>>>>             atomic_long_read(&page->pp_frag_count) == nr)
>>>>>>                 return 0;
>>>>>>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>         WARN_ON(ret < 0);
>>>>>>         return ret;
>>>>>> }
>>>>> Yes, that is what I had in mind.
>>>>> One thought I had for a future optimization is that we could look at
>>>>> reducing the count by 1 so that we could essentially combine the
>>>>> non-frag and frag cases.Then instead of testing for 1 we would test
>>>>> for 0 at thee start of the function and test for < 0 to decide if we
>>>>> want to free it or not instead of testing for 0. With that we can
>>>>> essentially reduce the calls to the WARN_ON since we should only have
>>>>> one case where we actually return a value < 0, and we can then check
>>>>> to see if we overshot -1 which would be the WARN_ON case.
>>>>> With that a value of 0 instead of 1 would indicate page frag is not in
>>>>> use for the page *AND/OR* that the page has reached the state where
>>>>> there are no other frags present so the page can be recycled. In
>>>>> effect it would allow us to mix page frags and no frags within the
>>>>> same pool. The added bonus would be we could get rid of the check for
>>>>> PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
>>>>> replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
>>>>> cannot read frag_count in that case.
>>>> Let's leave it for a future optimization.
>>>> I am not sure if there is use case to support both frag page and non-frag
>>>> page for the same page pool. If there is, maybe we can use "page->pp_frag_count
>>>>> 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
>>>> to indicate that the page is non-frag page, so that we can support frag page and
>>>> non-frag page for the same page pool instead of disabling non-frag page support
>>>> when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
>>>> optimization?
>>> As far as use case I can see a number of potential uses. For example
>>> in the case of drivers that do something like a header/data split I
>>> could see potentially having the header pages be frags while the data
>>> pages being 4K blocks. Basically the big optimization of the count ==
>>> 1/0/nr case is that you aren't increasing/decreasing the count and it
>>> is immediately being recycled/reused. So in such a case being able to
>>> add frag count some pages, and not to others would likely be quite
>>> useful.
>> I am not sure how the header/data split is implemented in hw, but it
>> seems the driver is not able to tell which desc will be filled with
>> header or data in advance, so it might need to allocate 4K block for
>> all desc?
> It all depends on the hardware config. In theory you could have
> anything from a single use for a page to multiple uses for a page in
> the case of headers and/or packets being small. The overhead for
> adding/removing the frag count could end up being more than what is
> needed if the page is only used once. That is why I was thinking it
> might make sense to allow both to coexist in the same pool.

I am agreed that there may be usecase of using both frag page and non-frag
page of the same page pool. Let's leave it for now.

>>> Basically by shifting the pool values by 1 you can have both in the
>>> same pool with little issue. However the big change is that instead of
>>> testing for count = nr it would end up being pp_frag_count = nr - 1.
>>> So in the case of the standard page pool pages being freed or the last
>>> frag you would be looking at pp_frag_count = 0. In addition we can
>>> mask the WARN_ON overhead as you would be using -1 as the point to
>>> free so you would only have to perform the WARN_ON check for the last
>>> frag instead of every frag.
>> Yes, it seems doable.
>>>> Also, I am prototyping the tx recycling based on page pool in order to see
>>>> if there is any value supporting the tx recycling.
>>> Just to clarify here when you say Tx recycling you are talking about
>>> socket to netdev correct? Just want to be certain since the netdev to
>>> netdev case should already have recycling for page pool pages as long
>>> as it follows a 1<->1 path.
>> Yes, the above Tx recycling meant socket to netdev.
>> Also, the above "netdev to netdev" only meant XDP now, but not the IP
>> forwarding path in the network stack, right?
>>>> As the busypoll has enable the one-to-one relation between NAPI and sock,
>>>> and there is one-to-one relation between NAPI and page pool, perhaps it make
>>>> senses that we use page pool to recycle the tx page too?
>>>> There are possibly below problems when doing that as I am aware of now:
>>>> 1. busypoll is for rx, and tx may not be using the same queue as rx even if
>>>>    there are *technically* the same flow, so I am not sure it is ok to use
>>>>    busypoll infrastructure to get the page pool ptr for a specific sock.
>>>> 2. There may be multi socks using the same page pool ptr to allocate page for
>>>>    multi flow, so we can not assume the same NAPI polling protection as rx,
>>>>    which might mean we can only use the recyclable page from pool->ring under the
>>>>    r->consumer_lock protection.
>>>> 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
>>>>    frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
>>>>    based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
>>>>    tcp fragment in tso_fragment() might mess with the page_ref_count directly.
>>>> As the above the problem I am aware of(I believe there are other problems I am not
>>>> aware of yet), I am not sure if the tcp tx page recycling based on page pool is
>>>> doable or not, I would like to hear about your opinion about tcp tx recycling support
>>>> based on page pool first, in case it is a dead end to support that.
>>> I'm honestly not sure there is much there to gain. Last I knew TCP was
>>> using order 3 pages for transmitting and as a result the overhead for
>>> the pages should already be greatly reduced. In addition one of the
>>> main reasons for page_pool  is the fact that the device has to DMA map
>>> the page and that can have very high overhead on systems with an
>>> IOMMU.
>> Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages"
>> seems to be disabled on defaut?
>>> Rather than trying to reuse the devices page pool it might make more
>>> sense to see if you couldn't have TCP just use some sort of circular
>>> buffer of memory that is directly mapped for the device that it is
>>> going to be transmitting to. Essentially what you would be doing is
>>> creating a pre-mapped page and would need to communicate that the
>>> memory is already mapped for the device you want to send it to so that
>>> it could skip that step.
>> IIUC sk_page_frag_refill() is already doing a similar reusing as the
>> rx reusing implemented in most driver except for the not pre-mapping
>> part.
>> And it seems that even if we pre-map the page and communicate that the
>> memory is already mapped to the driver, it is likely that we will not
>> be able to reuse the page when the circular buffer is not big enough
>> or tx completion/tcp ack is not happening quickly enough, which might
>> means unmapping/deallocating old circular buffer and allocating/mapping
>> new circular buffer.
>> Using page pool we might be able to alleviate the above problem as it
>> does for rx?
> I would say that instead of looking at going straight for the page
> pool it might make more sense to look at seeing if we can coalesce the
> DMA mapping of the pages first at the socket layer rather than trying
> to introduce the overhead for the page pool. In the case of sockets we
> already have the destructors that are called when the memory is freed,
> so instead of making sockets use page pool it might make more sense to
> extend the socket buffer allocation/freeing to incorporate bulk
> mapping and unmapping of pages to optimize the socket Tx path in the
> 32K page case.

I was able to enable tx recycling prototyping based on page pool to
run some performance test, the performance improvement is about +20%
(30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in
strict mode. And CPU usage descreases about 10% for four threads iperf
tcp flow for line speed of 100Gbit when IOMMU is in strict mode.

Looking at the prototyping code, I am agreed that it is a bit controversial
to use the page pool for tx as the page pool is assuming NAPI polling
protection for allocation side.

So I will take a deeper look about your suggestion above to see how to
implement it.

Also, I am assuming the "destructors" means tcp_wfree() for TCP, right?
It seems tcp_wfree() is mainly used to do memory accounting and free
"struct sock" if necessary.
I am not so familiar with socket layer to understand how the "destructors"
will be helpful here, any detailed idea how to use "destructors" here?

> .

  reply	other threads:[~2021-07-27  7:54 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool Yunsheng Lin
2021-07-20 15:43   ` Alexander Duyck
2021-07-21  8:15     ` Yunsheng Lin
2021-07-21 14:06       ` Alexander Duyck
2021-07-22  8:07         ` Yunsheng Lin
2021-07-22 15:18           ` Alexander Duyck
2021-07-23 11:12             ` Yunsheng Lin
2021-07-23 16:08               ` Alexander Duyck
2021-07-24 13:07                 ` Yunsheng Lin
2021-07-25 16:49                   ` Alexander Duyck
2021-07-27  7:54                     ` Yunsheng Lin [this message]
2021-07-27 18:38                       ` Alexander Duyck
2021-08-02  9:17                         ` Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 3/4] page_pool: add frag page recycling support " Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).