All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Yunsheng Lin <linyunsheng@huawei.com>
Cc: brouer@redhat.com, Alexander Duyck <alexander.duyck@gmail.com>,
	davem@davemloft.net, kuba@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linuxarm@openeuler.org,
	hawk@kernel.org, jonathan.lemon@gmail.com, alobakin@pm.me,
	willemb@google.com, cong.wang@bytedance.com, pabeni@redhat.com,
	haokexin@gmail.com, nogikh@google.com, elver@google.com,
	memxor@gmail.com, edumazet@google.com, dsahern@gmail.com
Subject: Re: [PATCH net-next v2 3/3] skbuff: keep track of pp page when __skb_frag_ref() is called
Date: Wed, 15 Sep 2021 17:42:27 +0200	[thread overview]
Message-ID: <0337e2f6-5428-2c75-71a5-6db31c60650a@redhat.com> (raw)
In-Reply-To: <YUHtf+lI8ktBdjsQ@apalos.home>


On 15/09/2021 14.56, Ilias Apalodimas wrote:
> Hi Yunsheng,  Alexander,
> 
> On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:
>> On 2021/9/15 8:59, Alexander Duyck wrote:
>>> On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> As the skb->pp_recycle and page->pp_magic may not be enough
>>>> to track if a frag page is from page pool after the calling
>>>> of __skb_frag_ref(), mostly because of a data race, see:
>>>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
>>>> recycling page_pool packets").
>>>
>>> I'm not sure how this comment actually applies. It is an issue that
>>> was fixed. If anything my concern is that this change will introduce
>>> new races instead of fixing any existing ones.
>>
>> My initial thinking about adding the above comment is to emphasize
>> that we might clear cloned skb's pp_recycle when doing head expanding,
>> and page pool might need to give up on that page if that cloned skb is
>> the last one to be freed.
>>
>>>
>>>> There may be clone and expand head case that might lose the
>>>> track if a frag page is from page pool or not.
>>>
>>> Can you explain how? If there is such a case we should fix it instead
>>> of trying to introduce new features to address it. This seems more
>>> like a performance optimization rather than a fix.
>>
>> Yes, I consider it an optimization too, that's why I am targetting
>> net-next.
>>
>> Even for the below skb_split() case in tso_fragment(), I am not sure
>> how can a rx pp page can go through the tcp stack yet.
> 
> I am bit confused :).  We don't have that problem *now* right?  This will
> appear if we try to pull in your patches on using page pool and recycling
> for Tx where TSO and skb_split are used?
> 
> I'll be honest, when I came up with the recycling idea for page pool, I
> never intended to support Tx.  I agree with Alexander here,  If people want
> to use it on Tx and think there's value,  we might need to go back to the
> drawing board and see what I've missed.  It's still early and there's a
> handful of drivers using it,  so it will less painful now.

I agree, page_pool is NOT designed or intended for TX support.
E.g. it doesn't make sense to allocate a page_pool instance per socket, 
as the backing memory structures for page_pool are too much.
As the number RX-queues are more limited it was deemed okay that we use 
page_pool per RX-queue, which sacrifice some memory to gain speed.


> The pp_recycle_bit was introduced to make the checking faster, instead of
> getting stuff into cache and check the page signature.  If that ends up
> being counterproductive, we could just replace the entire logic with the
> frag count and the page signature, couldn't we?  In that case we should be
> very cautious and measure potential regression on the standard path.

+1

> But in general,  I'd be happier if we only had a simple logic in our
> testing for the pages we have to recycle.  Debugging and understanding this
> otherwise will end up being a mess.


[...]
>>
>>>
>>>> For 32 bit systems with 64 bit dma, we preserve the orginial
>>>> behavior as frag count is used to trace how many time does a
>>>> frag page is called with __skb_frag_ref().
>>>>
>>>> We still use both skb->pp_recycle and page->pp_magic to decide
>>>> the head page for a skb is from page pool or not.
>>>>
> 
> [...]
> 
>>>>
>>>> +/**
>>>> + * skb_frag_is_pp_page - decide if a page is recyclable.
>>>> + * @page: frag page
>>>> + * @recycle: skb->pp_recycle
>>>> + *
>>>> + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is
>>>> + * also used to decide if a page can be recycled to the page
>>>> + * pool.
>>>> + */
>>>> +static inline bool skb_frag_is_pp_page(struct page *page,
>>>> +                                      bool recycle)
>>>> +{
>>>> +       return page_pool_is_pp_page(page) ||
>>>> +               (recycle && __page_pool_is_pp_page(page));
>>>> +}
>>>> +
>>>
>>> The logic for this check is ugly. You are essentially calling
>>> __page_pool_is_pp_page again if it fails the first check. It would
>>> probably make more sense to rearrange things and just call
>>> (!DMA_USE_PP_FRAG_COUNT || recycle)  && __page_pool_is_pp_page(). With
>>> that the check of recycle could be dropped entirely if frag count is
>>> valid to use, and in the non-fragcount case it reverts back to the
>>> original check.
>>
>> The reason I did not do that is I kind of did not want to use the
>> DMA_USE_PP_FRAG_COUNT outside of page pool.
>> I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above
>> is considered harmless:)
>>
>> The 32 bit systems with 64 bit dma really seems a burden here, as
>> memtioned by Ilias [1], there seems to be no such system using page
>> pool, we might as well consider disabling page pool for such system?
>>
>> Ilias, Jesper, what do you think?
>>
>> 1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html
>>
> 
> Well I can't really disagree with myself too much :).  I still think we are
> carrying a lot of code and complexity for systems that don't exist.

I would be fine with rejecting such systems at page_pool setup time. 
Meaning that NIC drivers using page_pool for DMA-mapping, getting 
compiled on 32-bit systems and needing 64-bit DMA-mappings, will have 
their call to page_pool_create() fail (with something else than -EINVAL 
please).
If drivers really want work on such systems, they have to implement 
their own DMA-mapping fallback tracking outside page_pool.  Meaning it 
is only the keeping track of DMA-mapping part of page_pool they cannot use.

[...]
>>
>>>
>>>> +static inline bool __page_pool_is_pp_page(struct page *page)
>>>> +{
>>>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>>> +        * in order to preserve any existing bits, such as bit 0 for the
>>>> +        * head page of compound page and bit 1 for pfmemalloc page, so
>>>> +        * mask those bits for freeing side when doing below checking,
>>>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>>> +        * to avoid recycling the pfmemalloc page.
>>>> +        */
>>>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
>>>> +}
>>>> +
>>>> +static inline bool page_pool_is_pp_page(struct page *page)
>>>> +{
>>>> +       /* For systems with the same dma addr as the bus addr, we can use
>>>> +        * page->pp_magic to indicate a pp page uniquely.
>>>> +        */
>>>> +       return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>>>> +                       __page_pool_is_pp_page(page);
>>>> +}
>>>> +
>>>
>>> We should really change the name of the #define. I keep reading it as
>>> we are using the PP_FRAG_COUNT, not that it is already in use. Maybe
>>> we should look at something like PP_FRAG_COUNT_VALID and just invert
>>> the logic for it.
>>
>> Yes, Jesper seems to have the similar confusion.
> 
> +1

+1


>> I seems better that we can remove that macro completely if the 32 bit
>> systems with 64 bit dma turn out to be not existing at all?
>>
>>>
>>> Also this function naming is really confusing. You don't have to have
>>> the frag count to be a page pool page. Maybe this should be something
>>> like page_pool_is_pp_frag_page.
>>
> 
> [...]
> 
> Regards
> /Ilias

--Jesper


  reply	other threads:[~2021-09-15 15:42 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14 12:11 [PATCH net-next v2 0/3] some optimization for page pool Yunsheng Lin
2021-09-14 12:11 ` [PATCH net-next v2 1/3] page_pool: support non-split page with PP_FLAG_PAGE_FRAG Yunsheng Lin
2021-09-14 12:11 ` [PATCH net-next v2 2/3] pool_pool: avoid calling compound_head() for skb frag page Yunsheng Lin
2021-09-14 12:11 ` [PATCH net-next v2 3/3] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
2021-09-15  0:59   ` Alexander Duyck
2021-09-15  9:07     ` Yunsheng Lin
2021-09-15 12:56       ` Ilias Apalodimas
2021-09-15 15:42         ` Jesper Dangaard Brouer [this message]
2021-09-16  2:05           ` [Linuxarm] " Yunsheng Lin
2021-09-16  8:44             ` Ilias Apalodimas
2021-09-16  9:33               ` Yunsheng Lin
2021-09-16 10:38                 ` Ilias Apalodimas
2021-09-16 11:04                   ` Yunsheng Lin
2021-09-16 11:21                     ` Yunsheng Lin
2021-09-16 11:57                     ` [Linuxarm] " Ilias Apalodimas
2021-09-17  3:57                       ` Yunsheng Lin
2021-09-17  6:38                         ` Ilias Apalodimas
2021-09-17  9:17                           ` Yunsheng Lin
2021-09-17 15:01                             ` Ilias Apalodimas
2021-09-18  1:43                               ` Yunsheng Lin
2021-09-18  9:23                                 ` Ilias Apalodimas
2021-09-22  3:38                                   ` Yunsheng Lin
2021-09-17 17:15             ` [Linuxarm] " Eric Dumazet
2021-09-18  2:42               ` Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0337e2f6-5428-2c75-71a5-6db31c60650a@redhat.com \
    --to=jbrouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alobakin@pm.me \
    --cc=brouer@redhat.com \
    --cc=cong.wang@bytedance.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=edumazet@google.com \
    --cc=elver@google.com \
    --cc=haokexin@gmail.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=linyunsheng@huawei.com \
    --cc=memxor@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=nogikh@google.com \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.