Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>, <linuxarm@openeuler.org>,
	<hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>,
	Alexander Lobakin <alobakin@pm.me>,
	Willem de Bruijn <willemb@google.com>,
	Cong Wang <cong.wang@bytedance.com>,
	Paolo Abeni <pabeni@redhat.com>, "Kevin Hao" <haokexin@gmail.com>,
	<nogikh@google.com>, Marco Elver <elver@google.com>,
	<memxor@gmail.com>, Eric Dumazet <edumazet@google.com>,
	David Ahern <dsahern@gmail.com>
Subject: Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
Date: Tue, 31 Aug 2021 15:20:20 +0800	[thread overview]
Message-ID: <9cf28179-0cd5-a8c5-2bfd-bd844315ad1a@huawei.com> (raw)
In-Reply-To: <CAKgT0UfmcB93Hn1AS_o2a_h98xxZMouTiGzJfG09qsWf+O6L1Q@mail.gmail.com>

On 2021/8/30 23:14, Alexander Duyck wrote:
> On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> As the skb->pp_recycle and page->pp_magic may not be enough
>> to track if a frag page is from page pool after the calling
>> of __skb_frag_ref(), mostly because of a data race, see:
>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
>> recycling page_pool packets").
>>
>> There may be clone and expand head case that might lose the
>> track if a frag page is from page pool or not.
>>
>> So increment the frag count when __skb_frag_ref() is called,
>> and only use page->pp_magic to indicate if a frag page is from
>> page pool, to avoid the above data race.
>>
>> For 32 bit systems with 64 bit dma, we preserve the orginial
>> behavior as frag count is used to trace how many time does a
>> frag page is called with __skb_frag_ref().
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> 
> Is this really a common enough case to justify adding this extra overhead?

I am not sure I understand what does extra overhead mean here.
But it seems this patch does not add any explicit overhead?
As the added page_pool_is_pp_page() checking in __skb_frag_ref() is
neutralized by avoiding the recycle checking in __skb_frag_unref(),
and the atomic operation is with either pp_frag_count or _refcount?

> 
>> ---
>>  include/linux/skbuff.h  | 13 ++++++++++++-
>>  include/net/page_pool.h | 17 +++++++++++++++++
>>  net/core/page_pool.c    | 12 ++----------
>>  3 files changed, 31 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 6bdb0db..8311482 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>>   */
>>  static inline void __skb_frag_ref(skb_frag_t *frag)
>>  {
>> +       struct page *page = skb_frag_page(frag);
>> +
>> +#ifdef CONFIG_PAGE_POOL
>> +       if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>> +           page_pool_is_pp_page(page)) {
>> +               page_pool_atomic_inc_frag_count(page);
>> +               return;
>> +       }
>> +#endif
>> +
>>         get_page(skb_frag_page(frag));
>>  }
>>
> 
> This just seems like a bad idea in general. We are likely increasing
> the potential for issues with this patch instead of avoiding them. I

Yes, I am agreed that calling the __skb_frag_ref() without calling the
__skb_frag_unref() for the same page might be more likely to cause problem
for this patch. But we are already depending on the calling of
__skb_frag_unref() to free the pp page, making it more likely just enable
us to catch the bug more quickly?

Or is there other situation that I am not awared of, which might cause
issues?

> really feel it would be better for us to just give up on the page and
> kick it out of the page pool if we are cloning frames and multiple
> references are being taken on the pages.

For Rx, it seems fine for normal case.
For Tx, it seems the cloning and multiple references happens when
tso_fragment() is called in tcp_write_xmit(), and the driver need to
reliable way to tell if a page is from the page pool, so that the
dma mapping can be avoided for Tx too.

> 
>> @@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
>>         struct page *page = skb_frag_page(frag);
>>
>>  #ifdef CONFIG_PAGE_POOL
>> -       if (recycle && page_pool_return_skb_page(page))
>> +       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
>> +           page_pool_return_skb_page(page))
>>                 return;
>>  #endif
>>         put_page(page);
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index 2ad0706..8b43e3d9 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
>>         atomic_long_set(&page->pp_frag_count, nr);
>>  }
>>
>> +static inline void page_pool_atomic_inc_frag_count(struct page *page)
>> +{
>> +       atomic_long_inc(&page->pp_frag_count);
>> +}
>> +
>> +static inline bool page_pool_is_pp_page(struct page *page)
>> +{
>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>> +        * in order to preserve any existing bits, such as bit 0 for the
>> +        * head page of compound page and bit 1 for pfmemalloc page, so
>> +        * mask those bits for freeing side when doing below checking,
>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>> +        * to avoid recycling the pfmemalloc page.
>> +        */
>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
>> +}
>> +
>>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>                                                           long nr)
>>  {
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index ba9f14d..442d37b 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -24,7 +24,7 @@
>>  #define DEFER_TIME (msecs_to_jiffies(1000))
>>  #define DEFER_WARN_INTERVAL (60 * HZ)
>>
>> -#define BIAS_MAX       LONG_MAX
>> +#define BIAS_MAX       (LONG_MAX / 2)
> 
> This piece needs some explaining in the patch. Why are you changing
> the BIAS_MAX?

When __skb_frag_ref() is called for the pp page that is not drained yet,
the pp_frag_count could be overflowed if the BIAS is too big.

> 
>>  static int page_pool_init(struct page_pool *pool,
>>                           const struct page_pool_params *params)
>> @@ -741,15 +741,7 @@ bool page_pool_return_skb_page(struct page *page)
>>         struct page_pool *pp;
>>
>>         page = compound_head(page);
>> -
>> -       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>> -        * in order to preserve any existing bits, such as bit 0 for the
>> -        * head page of compound page and bit 1 for pfmemalloc page, so
>> -        * mask those bits for freeing side when doing below checking,
>> -        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>> -        * to avoid recycling the pfmemalloc page.
>> -        */
>> -       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>> +       if (!page_pool_is_pp_page(page))
>>                 return false;
>>
>>         pp = page->pp;
>> --
>> 2.7.4
>>
> .
>