All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] some optimization for page pool
@ 2021-08-30  1:18 Yunsheng Lin
  2021-08-30  1:18 ` [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG Yunsheng Lin
  2021-08-30  1:18 ` [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
  0 siblings, 2 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-08-30  1:18 UTC (permalink / raw)
  To: davem, kuba
  Cc: netdev, linux-kernel, linuxarm, hawk, ilias.apalodimas,
	jonathan.lemon, alobakin, willemb, cong.wang, pabeni, haokexin,
	nogikh, elver, memxor, edumazet, alexander.duyck, dsahern

Patch 1: support non-split page when PP_FLAG_PAGE_FRAG is set.
Patch 1: keep track of pp page when __skb_frag_ref() is called.

Yunshene Lin (2):
  page_pool: support non-split page with PP_FLAG_PAGE_FRAG
  skbuff: keep track of pp page when __skb_frag_ref() is called

 include/linux/skbuff.h  | 13 ++++++++++++-
 include/net/page_pool.h | 23 +++++++++++++++++++++++
 net/core/page_pool.c    | 24 +++++++++---------------
 3 files changed, 44 insertions(+), 16 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG
  2021-08-30  1:18 [PATCH net-next 0/2] some optimization for page pool Yunsheng Lin
@ 2021-08-30  1:18 ` Yunsheng Lin
  2021-08-30 15:05   ` Alexander Duyck
  2021-08-30  1:18 ` [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
  1 sibling, 1 reply; 11+ messages in thread
From: Yunsheng Lin @ 2021-08-30  1:18 UTC (permalink / raw)
  To: davem, kuba
  Cc: netdev, linux-kernel, linuxarm, hawk, ilias.apalodimas,
	jonathan.lemon, alobakin, willemb, cong.wang, pabeni, haokexin,
	nogikh, elver, memxor, edumazet, alexander.duyck, dsahern

Currently when PP_FLAG_PAGE_FRAG is set, the caller is not
expected to call page_pool_alloc_pages() directly because of
the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().

The patch removes the above checking to enable non-split page
support when PP_FLAG_PAGE_FRAG is set.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h |  6 ++++++
 net/core/page_pool.c    | 12 +++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index a408240..2ad0706 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -238,6 +238,9 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 
 static inline void page_pool_set_frag_count(struct page *page, long nr)
 {
+	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+		return;
+
 	atomic_long_set(&page->pp_frag_count, nr);
 }
 
@@ -246,6 +249,9 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
 {
 	long ret;
 
+	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+		return 0;
+
 	/* As suggested by Alexander, atomic_long_read() may cover up the
 	 * reference count errors, so avoid calling atomic_long_read() in
 	 * the cases of freeing or draining the page_frags, where we would
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1a69784..ba9f14d 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -313,11 +313,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
 
 	/* Fast-path: Get a page from cache */
 	page = __page_pool_get_cached(pool);
-	if (page)
-		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	if (!page)
+		page = __page_pool_alloc_pages_slow(pool, gfp);
+
+	if (likely(page))
+		page_pool_set_frag_count(page, 1);
+
 	return page;
 }
 EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -426,8 +429,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 		     unsigned int dma_sync_size, bool allow_direct)
 {
 	/* It is not the last user for the page frag case */
-	if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
-	    page_pool_atomic_sub_frag_count_return(page, 1))
+	if (page_pool_atomic_sub_frag_count_return(page, 1))
 		return NULL;
 
 	/* This allocator is optimized for the XDP mode that uses
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-30  1:18 [PATCH net-next 0/2] some optimization for page pool Yunsheng Lin
  2021-08-30  1:18 ` [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG Yunsheng Lin
@ 2021-08-30  1:18 ` Yunsheng Lin
  2021-08-30  4:50   ` kernel test robot
  2021-08-30 15:14   ` Alexander Duyck
  1 sibling, 2 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-08-30  1:18 UTC (permalink / raw)
  To: davem, kuba
  Cc: netdev, linux-kernel, linuxarm, hawk, ilias.apalodimas,
	jonathan.lemon, alobakin, willemb, cong.wang, pabeni, haokexin,
	nogikh, elver, memxor, edumazet, alexander.duyck, dsahern

As the skb->pp_recycle and page->pp_magic may not be enough
to track if a frag page is from page pool after the calling
of __skb_frag_ref(), mostly because of a data race, see:
commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
recycling page_pool packets").

There may be clone and expand head case that might lose the
track if a frag page is from page pool or not.

So increment the frag count when __skb_frag_ref() is called,
and only use page->pp_magic to indicate if a frag page is from
page pool, to avoid the above data race.

For 32 bit systems with 64 bit dma, we preserve the orginial
behavior as frag count is used to trace how many time does a
frag page is called with __skb_frag_ref().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/skbuff.h  | 13 ++++++++++++-
 include/net/page_pool.h | 17 +++++++++++++++++
 net/core/page_pool.c    | 12 ++----------
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6bdb0db..8311482 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	struct page *page = skb_frag_page(frag);
+
+#ifdef CONFIG_PAGE_POOL
+	if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
+	    page_pool_is_pp_page(page)) {
+		page_pool_atomic_inc_frag_count(page);
+		return;
+	}
+#endif
+
 	get_page(skb_frag_page(frag));
 }
 
@@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
 	struct page *page = skb_frag_page(frag);
 
 #ifdef CONFIG_PAGE_POOL
-	if (recycle && page_pool_return_skb_page(page))
+	if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
+	    page_pool_return_skb_page(page))
 		return;
 #endif
 	put_page(page);
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 2ad0706..8b43e3d9 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
 	atomic_long_set(&page->pp_frag_count, nr);
 }
 
+static inline void page_pool_atomic_inc_frag_count(struct page *page)
+{
+	atomic_long_inc(&page->pp_frag_count);
+}
+
+static inline bool page_pool_is_pp_page(struct page *page)
+{
+	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+	 * in order to preserve any existing bits, such as bit 0 for the
+	 * head page of compound page and bit 1 for pfmemalloc page, so
+	 * mask those bits for freeing side when doing below checking,
+	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+	 * to avoid recycling the pfmemalloc page.
+	 */
+	return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+
 static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
 							  long nr)
 {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ba9f14d..442d37b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -24,7 +24,7 @@
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
-#define BIAS_MAX	LONG_MAX
+#define BIAS_MAX	(LONG_MAX / 2)
 
 static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
@@ -741,15 +741,7 @@ bool page_pool_return_skb_page(struct page *page)
 	struct page_pool *pp;
 
 	page = compound_head(page);
-
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+	if (!page_pool_is_pp_page(page))
 		return false;
 
 	pp = page->pp;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-30  1:18 ` [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
@ 2021-08-30  4:50   ` kernel test robot
  2021-08-30 15:14   ` Alexander Duyck
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-08-30  4:50 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: kbuild-all, netdev, linux-kernel, linuxarm, hawk,
	ilias.apalodimas, jonathan.lemon, alobakin, willemb

[-- Attachment #1: Type: text/plain, Size: 2918 bytes --]

Hi Yunsheng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Yunsheng-Lin/some-optimization-for-page-pool/20210830-092301
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git eaf2aaec0be4623b1d19f5c6ef770a78a91cf460
config: sparc64-randconfig-r035-20210829 (attached as .config)
compiler: sparc64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/8d228f2ead7e21fe408d1e970132df9d13562a80
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yunsheng-Lin/some-optimization-for-page-pool/20210830-092301
        git checkout 8d228f2ead7e21fe408d1e970132df9d13562a80
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=sparc64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/net/net_namespace.h:39,
                    from include/linux/inet.h:42,
                    from arch/sparc/kernel/setup_64.c:27:
   include/linux/skbuff.h: In function '__skb_frag_ref':
>> include/linux/skbuff.h:3076:22: error: unused variable 'page' [-Werror=unused-variable]
    3076 |         struct page *page = skb_frag_page(frag);
         |                      ^~~~
   arch/sparc/kernel/setup_64.c: At top level:
   arch/sparc/kernel/setup_64.c:615:13: error: no previous prototype for 'alloc_irqstack_bootmem' [-Werror=missing-prototypes]
     615 | void __init alloc_irqstack_bootmem(void)
         |             ^~~~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors
--
   In file included from include/linux/if_ether.h:19,
                    from include/linux/etherdevice.h:20,
                    from arch/sparc/kernel/idprom.c:13:
   include/linux/skbuff.h: In function '__skb_frag_ref':
>> include/linux/skbuff.h:3076:22: error: unused variable 'page' [-Werror=unused-variable]
    3076 |         struct page *page = skb_frag_page(frag);
         |                      ^~~~
   cc1: all warnings being treated as errors


vim +/page +3076 include/linux/skbuff.h

  3067	
  3068	/**
  3069	 * __skb_frag_ref - take an addition reference on a paged fragment.
  3070	 * @frag: the paged fragment
  3071	 *
  3072	 * Takes an additional reference on the paged fragment @frag.
  3073	 */
  3074	static inline void __skb_frag_ref(skb_frag_t *frag)
  3075	{
> 3076		struct page *page = skb_frag_page(frag);
  3077	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 38625 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG
  2021-08-30  1:18 ` [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG Yunsheng Lin
@ 2021-08-30 15:05   ` Alexander Duyck
  2021-08-31  6:14     ` Yunsheng Lin
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2021-08-30 15:05 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Currently when PP_FLAG_PAGE_FRAG is set, the caller is not
> expected to call page_pool_alloc_pages() directly because of
> the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
>
> The patch removes the above checking to enable non-split page
> support when PP_FLAG_PAGE_FRAG is set.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/net/page_pool.h |  6 ++++++
>  net/core/page_pool.c    | 12 +++++++-----
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index a408240..2ad0706 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -238,6 +238,9 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>
>  static inline void page_pool_set_frag_count(struct page *page, long nr)
>  {
> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> +               return;
> +
>         atomic_long_set(&page->pp_frag_count, nr);
>  }
>
> @@ -246,6 +249,9 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>  {
>         long ret;
>
> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> +               return 0;
> +
>         /* As suggested by Alexander, atomic_long_read() may cover up the
>          * reference count errors, so avoid calling atomic_long_read() in
>          * the cases of freeing or draining the page_frags, where we would
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 1a69784..ba9f14d 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -313,11 +313,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
>
>         /* Fast-path: Get a page from cache */
>         page = __page_pool_get_cached(pool);
> -       if (page)
> -               return page;
>
>         /* Slow-path: cache empty, do real allocation */
> -       page = __page_pool_alloc_pages_slow(pool, gfp);
> +       if (!page)
> +               page = __page_pool_alloc_pages_slow(pool, gfp);
> +
> +       if (likely(page))
> +               page_pool_set_frag_count(page, 1);
> +
>         return page;
>  }
>  EXPORT_SYMBOL(page_pool_alloc_pages);
> @@ -426,8 +429,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>                      unsigned int dma_sync_size, bool allow_direct)
>  {
>         /* It is not the last user for the page frag case */
> -       if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
> -           page_pool_atomic_sub_frag_count_return(page, 1))
> +       if (page_pool_atomic_sub_frag_count_return(page, 1))
>                 return NULL;

Isn't this going to have a negative performance impact on page pool
pages in general? Essentially you are adding an extra atomic operation
for all the non-frag pages.

It would work better if this was doing a check against 1 to determine
if it is okay for this page to be freed here and only if the check
fails then you perform the atomic sub_return.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-30  1:18 ` [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
  2021-08-30  4:50   ` kernel test robot
@ 2021-08-30 15:14   ` Alexander Duyck
  2021-08-31  7:20     ` Yunsheng Lin
  1 sibling, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2021-08-30 15:14 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> As the skb->pp_recycle and page->pp_magic may not be enough
> to track if a frag page is from page pool after the calling
> of __skb_frag_ref(), mostly because of a data race, see:
> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
> recycling page_pool packets").
>
> There may be clone and expand head case that might lose the
> track if a frag page is from page pool or not.
>
> So increment the frag count when __skb_frag_ref() is called,
> and only use page->pp_magic to indicate if a frag page is from
> page pool, to avoid the above data race.
>
> For 32 bit systems with 64 bit dma, we preserve the orginial
> behavior as frag count is used to trace how many time does a
> frag page is called with __skb_frag_ref().
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

Is this really a common enough case to justify adding this extra overhead?

> ---
>  include/linux/skbuff.h  | 13 ++++++++++++-
>  include/net/page_pool.h | 17 +++++++++++++++++
>  net/core/page_pool.c    | 12 ++----------
>  3 files changed, 31 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6bdb0db..8311482 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>   */
>  static inline void __skb_frag_ref(skb_frag_t *frag)
>  {
> +       struct page *page = skb_frag_page(frag);
> +
> +#ifdef CONFIG_PAGE_POOL
> +       if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
> +           page_pool_is_pp_page(page)) {
> +               page_pool_atomic_inc_frag_count(page);
> +               return;
> +       }
> +#endif
> +
>         get_page(skb_frag_page(frag));
>  }
>

This just seems like a bad idea in general. We are likely increasing
the potential for issues with this patch instead of avoiding them. I
really feel it would be better for us to just give up on the page and
kick it out of the page pool if we are cloning frames and multiple
references are being taken on the pages.

> @@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
>         struct page *page = skb_frag_page(frag);
>
>  #ifdef CONFIG_PAGE_POOL
> -       if (recycle && page_pool_return_skb_page(page))
> +       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
> +           page_pool_return_skb_page(page))
>                 return;
>  #endif
>         put_page(page);
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 2ad0706..8b43e3d9 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
>         atomic_long_set(&page->pp_frag_count, nr);
>  }
>
> +static inline void page_pool_atomic_inc_frag_count(struct page *page)
> +{
> +       atomic_long_inc(&page->pp_frag_count);
> +}
> +
> +static inline bool page_pool_is_pp_page(struct page *page)
> +{
> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> +        * in order to preserve any existing bits, such as bit 0 for the
> +        * head page of compound page and bit 1 for pfmemalloc page, so
> +        * mask those bits for freeing side when doing below checking,
> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> +        * to avoid recycling the pfmemalloc page.
> +        */
> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> +}
> +
>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>                                                           long nr)
>  {
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ba9f14d..442d37b 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -24,7 +24,7 @@
>  #define DEFER_TIME (msecs_to_jiffies(1000))
>  #define DEFER_WARN_INTERVAL (60 * HZ)
>
> -#define BIAS_MAX       LONG_MAX
> +#define BIAS_MAX       (LONG_MAX / 2)

This piece needs some explaining in the patch. Why are you changing
the BIAS_MAX?

>  static int page_pool_init(struct page_pool *pool,
>                           const struct page_pool_params *params)
> @@ -741,15 +741,7 @@ bool page_pool_return_skb_page(struct page *page)
>         struct page_pool *pp;
>
>         page = compound_head(page);
> -
> -       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> -        * in order to preserve any existing bits, such as bit 0 for the
> -        * head page of compound page and bit 1 for pfmemalloc page, so
> -        * mask those bits for freeing side when doing below checking,
> -        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> -        * to avoid recycling the pfmemalloc page.
> -        */
> -       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> +       if (!page_pool_is_pp_page(page))
>                 return false;
>
>         pp = page->pp;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG
  2021-08-30 15:05   ` Alexander Duyck
@ 2021-08-31  6:14     ` Yunsheng Lin
  2021-08-31 13:43       ` Alexander Duyck
  0 siblings, 1 reply; 11+ messages in thread
From: Yunsheng Lin @ 2021-08-31  6:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On 2021/8/30 23:05, Alexander Duyck wrote:
> On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> Currently when PP_FLAG_PAGE_FRAG is set, the caller is not
>> expected to call page_pool_alloc_pages() directly because of
>> the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
>>
>> The patch removes the above checking to enable non-split page
>> support when PP_FLAG_PAGE_FRAG is set.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  include/net/page_pool.h |  6 ++++++
>>  net/core/page_pool.c    | 12 +++++++-----
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index a408240..2ad0706 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -238,6 +238,9 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>
>>  static inline void page_pool_set_frag_count(struct page *page, long nr)
>>  {
>> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>> +               return;
>> +
>>         atomic_long_set(&page->pp_frag_count, nr);
>>  }
>>
>> @@ -246,6 +249,9 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>  {
>>         long ret;
>>
>> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>> +               return 0;
>> +
>>         /* As suggested by Alexander, atomic_long_read() may cover up the
>>          * reference count errors, so avoid calling atomic_long_read() in
>>          * the cases of freeing or draining the page_frags, where we would
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 1a69784..ba9f14d 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -313,11 +313,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>
>>         /* Fast-path: Get a page from cache */
>>         page = __page_pool_get_cached(pool);
>> -       if (page)
>> -               return page;
>>
>>         /* Slow-path: cache empty, do real allocation */
>> -       page = __page_pool_alloc_pages_slow(pool, gfp);
>> +       if (!page)
>> +               page = __page_pool_alloc_pages_slow(pool, gfp);
>> +
>> +       if (likely(page))
>> +               page_pool_set_frag_count(page, 1);
>> +
>>         return page;
>>  }
>>  EXPORT_SYMBOL(page_pool_alloc_pages);
>> @@ -426,8 +429,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>                      unsigned int dma_sync_size, bool allow_direct)
>>  {
>>         /* It is not the last user for the page frag case */
>> -       if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
>> -           page_pool_atomic_sub_frag_count_return(page, 1))
>> +       if (page_pool_atomic_sub_frag_count_return(page, 1))
>>                 return NULL;
> 
> Isn't this going to have a negative performance impact on page pool
> pages in general? Essentially you are adding an extra atomic operation
> for all the non-frag pages.
> 
> It would work better if this was doing a check against 1 to determine
> if it is okay for this page to be freed here and only if the check
> fails then you perform the atomic sub_return.

The page_pool_atomic_sub_frag_count_return() has added the optimization
to not do the atomic sub_return when the caller is the last user of the
page, see page_pool_atomic_sub_frag_count_return():

	/* As suggested by Alexander, atomic_long_read() may cover up the
	 * reference count errors, so avoid calling atomic_long_read() in
	 * the cases of freeing or draining the page_frags, where we would
	 * not expect it to match or that are slowpath anyway.
	 */
        if (__builtin_constant_p(nr) &&
            atomic_long_read(&page->pp_frag_count) == nr)
                return 0;

So the check against 1 is not needed here?

> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-30 15:14   ` Alexander Duyck
@ 2021-08-31  7:20     ` Yunsheng Lin
  2021-08-31 14:30       ` Alexander Duyck
  0 siblings, 1 reply; 11+ messages in thread
From: Yunsheng Lin @ 2021-08-31  7:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On 2021/8/30 23:14, Alexander Duyck wrote:
> On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> As the skb->pp_recycle and page->pp_magic may not be enough
>> to track if a frag page is from page pool after the calling
>> of __skb_frag_ref(), mostly because of a data race, see:
>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
>> recycling page_pool packets").
>>
>> There may be clone and expand head case that might lose the
>> track if a frag page is from page pool or not.
>>
>> So increment the frag count when __skb_frag_ref() is called,
>> and only use page->pp_magic to indicate if a frag page is from
>> page pool, to avoid the above data race.
>>
>> For 32 bit systems with 64 bit dma, we preserve the orginial
>> behavior as frag count is used to trace how many time does a
>> frag page is called with __skb_frag_ref().
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> 
> Is this really a common enough case to justify adding this extra overhead?

I am not sure I understand what does extra overhead mean here.
But it seems this patch does not add any explicit overhead?
As the added page_pool_is_pp_page() checking in __skb_frag_ref() is
neutralized by avoiding the recycle checking in __skb_frag_unref(),
and the atomic operation is with either pp_frag_count or _refcount?

> 
>> ---
>>  include/linux/skbuff.h  | 13 ++++++++++++-
>>  include/net/page_pool.h | 17 +++++++++++++++++
>>  net/core/page_pool.c    | 12 ++----------
>>  3 files changed, 31 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 6bdb0db..8311482 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>>   */
>>  static inline void __skb_frag_ref(skb_frag_t *frag)
>>  {
>> +       struct page *page = skb_frag_page(frag);
>> +
>> +#ifdef CONFIG_PAGE_POOL
>> +       if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>> +           page_pool_is_pp_page(page)) {
>> +               page_pool_atomic_inc_frag_count(page);
>> +               return;
>> +       }
>> +#endif
>> +
>>         get_page(skb_frag_page(frag));
>>  }
>>
> 
> This just seems like a bad idea in general. We are likely increasing
> the potential for issues with this patch instead of avoiding them. I

Yes, I am agreed that calling the __skb_frag_ref() without calling the
__skb_frag_unref() for the same page might be more likely to cause problem
for this patch. But we are already depending on the calling of
__skb_frag_unref() to free the pp page, making it more likely just enable
us to catch the bug more quickly?

Or is there other situation that I am not awared of, which might cause
issues?

> really feel it would be better for us to just give up on the page and
> kick it out of the page pool if we are cloning frames and multiple
> references are being taken on the pages.

For Rx, it seems fine for normal case.
For Tx, it seems the cloning and multiple references happens when
tso_fragment() is called in tcp_write_xmit(), and the driver need to
reliable way to tell if a page is from the page pool, so that the
dma mapping can be avoided for Tx too.

> 
>> @@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
>>         struct page *page = skb_frag_page(frag);
>>
>>  #ifdef CONFIG_PAGE_POOL
>> -       if (recycle && page_pool_return_skb_page(page))
>> +       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
>> +           page_pool_return_skb_page(page))
>>                 return;
>>  #endif
>>         put_page(page);
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index 2ad0706..8b43e3d9 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
>>         atomic_long_set(&page->pp_frag_count, nr);
>>  }
>>
>> +static inline void page_pool_atomic_inc_frag_count(struct page *page)
>> +{
>> +       atomic_long_inc(&page->pp_frag_count);
>> +}
>> +
>> +static inline bool page_pool_is_pp_page(struct page *page)
>> +{
>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>> +        * in order to preserve any existing bits, such as bit 0 for the
>> +        * head page of compound page and bit 1 for pfmemalloc page, so
>> +        * mask those bits for freeing side when doing below checking,
>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>> +        * to avoid recycling the pfmemalloc page.
>> +        */
>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
>> +}
>> +
>>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>                                                           long nr)
>>  {
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index ba9f14d..442d37b 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -24,7 +24,7 @@
>>  #define DEFER_TIME (msecs_to_jiffies(1000))
>>  #define DEFER_WARN_INTERVAL (60 * HZ)
>>
>> -#define BIAS_MAX       LONG_MAX
>> +#define BIAS_MAX       (LONG_MAX / 2)
> 
> This piece needs some explaining in the patch. Why are you changing
> the BIAS_MAX?

When __skb_frag_ref() is called for the pp page that is not drained yet,
the pp_frag_count could be overflowed if the BIAS is too big.

> 
>>  static int page_pool_init(struct page_pool *pool,
>>                           const struct page_pool_params *params)
>> @@ -741,15 +741,7 @@ bool page_pool_return_skb_page(struct page *page)
>>         struct page_pool *pp;
>>
>>         page = compound_head(page);
>> -
>> -       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>> -        * in order to preserve any existing bits, such as bit 0 for the
>> -        * head page of compound page and bit 1 for pfmemalloc page, so
>> -        * mask those bits for freeing side when doing below checking,
>> -        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>> -        * to avoid recycling the pfmemalloc page.
>> -        */
>> -       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>> +       if (!page_pool_is_pp_page(page))
>>                 return false;
>>
>>         pp = page->pp;
>> --
>> 2.7.4
>>
> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG
  2021-08-31  6:14     ` Yunsheng Lin
@ 2021-08-31 13:43       ` Alexander Duyck
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Duyck @ 2021-08-31 13:43 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On Mon, Aug 30, 2021 at 11:14 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/8/30 23:05, Alexander Duyck wrote:
> > On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> Currently when PP_FLAG_PAGE_FRAG is set, the caller is not
> >> expected to call page_pool_alloc_pages() directly because of
> >> the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
> >>
> >> The patch removes the above checking to enable non-split page
> >> support when PP_FLAG_PAGE_FRAG is set.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >>  include/net/page_pool.h |  6 ++++++
> >>  net/core/page_pool.c    | 12 +++++++-----
> >>  2 files changed, 13 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >> index a408240..2ad0706 100644
> >> --- a/include/net/page_pool.h
> >> +++ b/include/net/page_pool.h
> >> @@ -238,6 +238,9 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >>
> >>  static inline void page_pool_set_frag_count(struct page *page, long nr)
> >>  {
> >> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >> +               return;
> >> +
> >>         atomic_long_set(&page->pp_frag_count, nr);
> >>  }
> >>
> >> @@ -246,6 +249,9 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >>  {
> >>         long ret;
> >>
> >> +       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >> +               return 0;
> >> +
> >>         /* As suggested by Alexander, atomic_long_read() may cover up the
> >>          * reference count errors, so avoid calling atomic_long_read() in
> >>          * the cases of freeing or draining the page_frags, where we would
> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >> index 1a69784..ba9f14d 100644
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -313,11 +313,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>
> >>         /* Fast-path: Get a page from cache */
> >>         page = __page_pool_get_cached(pool);
> >> -       if (page)
> >> -               return page;
> >>
> >>         /* Slow-path: cache empty, do real allocation */
> >> -       page = __page_pool_alloc_pages_slow(pool, gfp);
> >> +       if (!page)
> >> +               page = __page_pool_alloc_pages_slow(pool, gfp);
> >> +
> >> +       if (likely(page))
> >> +               page_pool_set_frag_count(page, 1);
> >> +
> >>         return page;
> >>  }
> >>  EXPORT_SYMBOL(page_pool_alloc_pages);
> >> @@ -426,8 +429,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>                      unsigned int dma_sync_size, bool allow_direct)
> >>  {
> >>         /* It is not the last user for the page frag case */
> >> -       if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
> >> -           page_pool_atomic_sub_frag_count_return(page, 1))
> >> +       if (page_pool_atomic_sub_frag_count_return(page, 1))
> >>                 return NULL;
> >
> > Isn't this going to have a negative performance impact on page pool
> > pages in general? Essentially you are adding an extra atomic operation
> > for all the non-frag pages.
> >
> > It would work better if this was doing a check against 1 to determine
> > if it is okay for this page to be freed here and only if the check
> > fails then you perform the atomic sub_return.
>
> The page_pool_atomic_sub_frag_count_return() has added the optimization
> to not do the atomic sub_return when the caller is the last user of the
> page, see page_pool_atomic_sub_frag_count_return():
>
>         /* As suggested by Alexander, atomic_long_read() may cover up the
>          * reference count errors, so avoid calling atomic_long_read() in
>          * the cases of freeing or draining the page_frags, where we would
>          * not expect it to match or that are slowpath anyway.
>          */
>         if (__builtin_constant_p(nr) &&
>             atomic_long_read(&page->pp_frag_count) == nr)
>                 return 0;
>
> So the check against 1 is not needed here?

Ah, okay. I hadn't seen that part. So yeah, then this should be mostly
harmless since 1 falls into the category of a builtin constant and
would result in the standard case being the frag count being set to 1
and then being read which should be minimal overhead.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-31  7:20     ` Yunsheng Lin
@ 2021-08-31 14:30       ` Alexander Duyck
  2021-09-01  3:10         ` Yunsheng Lin
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2021-08-31 14:30 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On Tue, Aug 31, 2021 at 12:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/8/30 23:14, Alexander Duyck wrote:
> > On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> As the skb->pp_recycle and page->pp_magic may not be enough
> >> to track if a frag page is from page pool after the calling
> >> of __skb_frag_ref(), mostly because of a data race, see:
> >> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
> >> recycling page_pool packets").
> >>
> >> There may be clone and expand head case that might lose the
> >> track if a frag page is from page pool or not.
> >>
> >> So increment the frag count when __skb_frag_ref() is called,
> >> and only use page->pp_magic to indicate if a frag page is from
> >> page pool, to avoid the above data race.
> >>
> >> For 32 bit systems with 64 bit dma, we preserve the orginial
> >> behavior as frag count is used to trace how many time does a
> >> frag page is called with __skb_frag_ref().
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >
> > Is this really a common enough case to justify adding this extra overhead?
>
> I am not sure I understand what does extra overhead mean here.
> But it seems this patch does not add any explicit overhead?
> As the added page_pool_is_pp_page() checking in __skb_frag_ref() is
> neutralized by avoiding the recycle checking in __skb_frag_unref(),
> and the atomic operation is with either pp_frag_count or _refcount?

My concern is maintenance overhead. Specifically what you are doing is
forking the code path in __skb_frag_ref so there are cases where it
will increment the page reference count and there are others where it
will increment the frag count. Changing things like this means we have
to be certain that any paths that are expecting the reference count to
be updated have been addressed and it means that any code dealing with
the reference count in the future will be that much more complex.

> >
> >> ---
> >>  include/linux/skbuff.h  | 13 ++++++++++++-
> >>  include/net/page_pool.h | 17 +++++++++++++++++
> >>  net/core/page_pool.c    | 12 ++----------
> >>  3 files changed, 31 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 6bdb0db..8311482 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
> >>   */
> >>  static inline void __skb_frag_ref(skb_frag_t *frag)
> >>  {
> >> +       struct page *page = skb_frag_page(frag);
> >> +
> >> +#ifdef CONFIG_PAGE_POOL
> >> +       if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
> >> +           page_pool_is_pp_page(page)) {
> >> +               page_pool_atomic_inc_frag_count(page);
> >> +               return;
> >> +       }
> >> +#endif
> >> +
> >>         get_page(skb_frag_page(frag));
> >>  }
> >>
> >
> > This just seems like a bad idea in general. We are likely increasing
> > the potential for issues with this patch instead of avoiding them. I
>
> Yes, I am agreed that calling the __skb_frag_ref() without calling the
> __skb_frag_unref() for the same page might be more likely to cause problem
> for this patch. But we are already depending on the calling of
> __skb_frag_unref() to free the pp page, making it more likely just enable
> us to catch the bug more quickly?

The problem is one of the things our earlier fix had changed is that
you could only have one skb running around with skb->pp_recycle after
cloning the head. So skb_frag_unref will not do the
page_pool_return_skb_page because it is cleared and it will likely
trigger a reference undercount since we incremented the frag count
instead of the reference count.

> Or is there other situation that I am not awared of, which might cause
> issues?

I'm pretty sure this breaks the case that was already called out in
commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling
page_pool packets"). The problem is the clone would then need to
convert over the frag count to page references in order to move
forward. What it effectively does is lock the pages into the skb and
prevent them from being used elsewhere.

> > really feel it would be better for us to just give up on the page and
> > kick it out of the page pool if we are cloning frames and multiple
> > references are being taken on the pages.
>
> For Rx, it seems fine for normal case.
> For Tx, it seems the cloning and multiple references happens when
> tso_fragment() is called in tcp_write_xmit(), and the driver need to
> reliable way to tell if a page is from the page pool, so that the
> dma mapping can be avoided for Tx too.

The problem is cloning and page pool do not play well together. We are
better off just avoiding the page pool entirely for Tx. Now if we are
wanting to store the DMA for the page until it is freed that is one
thing, but the current page pool doesn't work well with cloning and
such because of all the refcount tricks that have to be played.

The main issue is that page pool assumes single producer, single
consumer. In the Tx path that isn't necessarily guaranteed since for
things like TCP we end up having to hold onto clones of the packets
until the transmission is completed.

> >
> >> @@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
> >>         struct page *page = skb_frag_page(frag);
> >>
> >>  #ifdef CONFIG_PAGE_POOL
> >> -       if (recycle && page_pool_return_skb_page(page))
> >> +       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
> >> +           page_pool_return_skb_page(page))
> >>                 return;
> >>  #endif
> >>         put_page(page);
> >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >> index 2ad0706..8b43e3d9 100644
> >> --- a/include/net/page_pool.h
> >> +++ b/include/net/page_pool.h
> >> @@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
> >>         atomic_long_set(&page->pp_frag_count, nr);
> >>  }
> >>
> >> +static inline void page_pool_atomic_inc_frag_count(struct page *page)
> >> +{
> >> +       atomic_long_inc(&page->pp_frag_count);
> >> +}
> >> +
> >> +static inline bool page_pool_is_pp_page(struct page *page)
> >> +{
> >> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> >> +        * in order to preserve any existing bits, such as bit 0 for the
> >> +        * head page of compound page and bit 1 for pfmemalloc page, so
> >> +        * mask those bits for freeing side when doing below checking,
> >> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> >> +        * to avoid recycling the pfmemalloc page.
> >> +        */
> >> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> >> +}
> >> +
> >>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >>                                                           long nr)
> >>  {
> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >> index ba9f14d..442d37b 100644
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -24,7 +24,7 @@
> >>  #define DEFER_TIME (msecs_to_jiffies(1000))
> >>  #define DEFER_WARN_INTERVAL (60 * HZ)
> >>
> >> -#define BIAS_MAX       LONG_MAX
> >> +#define BIAS_MAX       (LONG_MAX / 2)
> >
> > This piece needs some explaining in the patch. Why are you changing
> > the BIAS_MAX?
>
> When __skb_frag_ref() is called for the pp page that is not drained yet,
> the pp_frag_count could be overflowed if the BIAS is too big.

Aren't we only checking against 0 though? We are calling LONG_MAX
which is already half the maximum possible value since it is dropping
the signed bit. If we are treating the value like it is unsigned and
only testing against 0 that would leave half the space still available
anyway.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called
  2021-08-31 14:30       ` Alexander Duyck
@ 2021-09-01  3:10         ` Yunsheng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Yunsheng Lin @ 2021-09-01  3:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Netdev, LKML, linuxarm, hawk,
	Ilias Apalodimas, Jonathan Lemon, Alexander Lobakin,
	Willem de Bruijn, Cong Wang, Paolo Abeni, Kevin Hao, nogikh,
	Marco Elver, memxor, Eric Dumazet, David Ahern

On 2021/8/31 22:30, Alexander Duyck wrote:
> On Tue, Aug 31, 2021 at 12:20 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2021/8/30 23:14, Alexander Duyck wrote:
>>> On Sun, Aug 29, 2021 at 6:19 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> As the skb->pp_recycle and page->pp_magic may not be enough
>>>> to track if a frag page is from page pool after the calling
>>>> of __skb_frag_ref(), mostly because of a data race, see:
>>>> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
>>>> recycling page_pool packets").
>>>>
>>>> There may be clone and expand head case that might lose the
>>>> track if a frag page is from page pool or not.
>>>>
>>>> So increment the frag count when __skb_frag_ref() is called,
>>>> and only use page->pp_magic to indicate if a frag page is from
>>>> page pool, to avoid the above data race.
>>>>
>>>> For 32 bit systems with 64 bit dma, we preserve the orginial
>>>> behavior as frag count is used to trace how many time does a
>>>> frag page is called with __skb_frag_ref().
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>
>>> Is this really a common enough case to justify adding this extra overhead?
>>
>> I am not sure I understand what does extra overhead mean here.
>> But it seems this patch does not add any explicit overhead?
>> As the added page_pool_is_pp_page() checking in __skb_frag_ref() is
>> neutralized by avoiding the recycle checking in __skb_frag_unref(),
>> and the atomic operation is with either pp_frag_count or _refcount?
> 
> My concern is maintenance overhead. Specifically what you are doing is
> forking the code path in __skb_frag_ref so there are cases where it
> will increment the page reference count and there are others where it
> will increment the frag count. Changing things like this means we have
> to be certain that any paths that are expecting the reference count to
> be updated have been addressed and it means that any code dealing with
> the reference count in the future will be that much more complex.

But it seems we are already doing that(calling page_pool_put_full_page() or
put_page() depends on skb->pp_recycle and page->pp_magic) in __skb_frag_unref(),
it seems symmetric to do the similar thing in __skb_frag_ref()?

> 
>>>
>>>> ---
>>>>  include/linux/skbuff.h  | 13 ++++++++++++-
>>>>  include/net/page_pool.h | 17 +++++++++++++++++
>>>>  net/core/page_pool.c    | 12 ++----------
>>>>  3 files changed, 31 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 6bdb0db..8311482 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -3073,6 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
>>>>   */
>>>>  static inline void __skb_frag_ref(skb_frag_t *frag)
>>>>  {
>>>> +       struct page *page = skb_frag_page(frag);
>>>> +
>>>> +#ifdef CONFIG_PAGE_POOL
>>>> +       if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>>>> +           page_pool_is_pp_page(page)) {
>>>> +               page_pool_atomic_inc_frag_count(page);
>>>> +               return;
>>>> +       }
>>>> +#endif
>>>> +
>>>>         get_page(skb_frag_page(frag));
>>>>  }
>>>>
>>>
>>> This just seems like a bad idea in general. We are likely increasing
>>> the potential for issues with this patch instead of avoiding them. I
>>
>> Yes, I am agreed that calling the __skb_frag_ref() without calling the
>> __skb_frag_unref() for the same page might be more likely to cause problem
>> for this patch. But we are already depending on the calling of
>> __skb_frag_unref() to free the pp page, making it more likely just enable
>> us to catch the bug more quickly?
> 
> The problem is one of the things our earlier fix had changed is that
> you could only have one skb running around with skb->pp_recycle after
> cloning the head. So skb_frag_unref will not do the
> page_pool_return_skb_page because it is cleared and it will likely
> trigger a reference undercount since we incremented the frag count
> instead of the reference count.

As the below change in this patch, the __skb_frag_unref() has been changed
to bypass the skb->pp_recycle checking and always call page_pool_return_skb_page()
for normal systems, page->pp_magic is the only checking to indicate if a page
is from page pool, which is symmetric to the checking in __skb_frag_ref().


@@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
        struct page *page = skb_frag_page(frag);

#ifdef CONFIG_PAGE_POOL
-       if (recycle && page_pool_return_skb_page(page))
+       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
+           page_pool_return_skb_page(page))
                 return;
#endif
         put_page(page);

> 
>> Or is there other situation that I am not awared of, which might cause
>> issues?
> 
> I'm pretty sure this breaks the case that was already called out in
> commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling
> page_pool packets"). The problem is the clone would then need to
> convert over the frag count to page references in order to move
> forward. What it effectively does is lock the pages into the skb and
> prevent them from being used elsewhere.

As my understanding, after this patch, the skb->pp_recycle is only used
to indicate if a head page is a pp page, the frag page is decided using
only the page->pp_magic(we might be able to do that for head page too,
and remove the skb->pp_recycle completely).

And cloning is happening in the netstack, it seems it is in our hand to
make sure that won't break the page pool reference counting.

So this patch does not seems to break the above commit?

> 
>>> really feel it would be better for us to just give up on the page and
>>> kick it out of the page pool if we are cloning frames and multiple
>>> references are being taken on the pages.
>>
>> For Rx, it seems fine for normal case.
>> For Tx, it seems the cloning and multiple references happens when
>> tso_fragment() is called in tcp_write_xmit(), and the driver need to
>> reliable way to tell if a page is from the page pool, so that the
>> dma mapping can be avoided for Tx too.
> 
> The problem is cloning and page pool do not play well together. We are

Yes, this patch is trying to make sure cloning and page pool play well
together.

For skb_split() case in tso_fragment(), if there is a skb comming from
the wire(supposing all frag pages of the skb is from a page pool) happens
to retransmit back to the wire using the TCP, it seems we might have bug
here?

Supposing skb1 has 3 frag pages, all coming from the page pool, and is
splitted to skb2 and skb3:
skb2: first frag page + first half of second frag page
skb3: second half of second frag page + third frag page

How do we set the skb->pp_recycle of skb1 and skb2?
1. If we set both of them to 1, then we may have a similar race as the
   above commit for second frag page.
2. If we set only one of them to 1, then we may have resouce leaking
   problem as both first frag page and third frag page are indeed from
   page pool.

> better off just avoiding the page pool entirely for Tx. Now if we are
> wanting to store the DMA for the page until it is freed that is one
> thing, but the current page pool doesn't work well with cloning and
> such because of all the refcount tricks that have to be played.

Storing the DMA for the page until it is freed means we need to do
mapping/unmapping every time a page is used, and not recycling?

As disscussion with Eric in below thread, it seems:
"So it seems the IOMMU overhead does not only related to how many
frag does a skb have, but also related to the length of each frag,
as the IOMMU mapping is based on 4K/2M granularity(for arm64), so it
may still take a lot of time to write each 4K page entry to the page
table when mapping and invalidate each 4K page entry when unmapping."

https://www.spinics.net/lists/kernel/msg4053130.html


And my arm64 system:
memcpy a buffer of 32768 bytes took ~2337 ns

And doing mapping/unmapping for the same buffer took above the same time
as memcpy for one thread case in strict mode(and IOMMU is known to scale
poorly for multi threads case):
root@(none):~# ./dma_map_benchmark -t 1 -s 10 -n 0 -g 8
dma mapping benchmark: threads:1 seconds:10 node:0 dir:BIDIRECTIONAL granule: 8
average map latency(us):0.7 standard deviation:0.1
average unmap latency(us):1.6 standard deviation:0.4

So it seems the IOMMU overhead is still there even with reduing the number
of mapping/unmapping or TX ZC?

> 
> The main issue is that page pool assumes single producer, single
> consumer. In the Tx path that isn't necessarily guaranteed since for
> things like TCP we end up having to hold onto clones of the packets
> until the transmission is completed.

The main bottleneck for page pool seems to be the scalablity of ptr_ring
for multi threads.
How about making ptr_ring support multi-producers and multi-consumers
lockless?

> 
>>>
>>>> @@ -3101,7 +3111,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
>>>>         struct page *page = skb_frag_page(frag);
>>>>
>>>>  #ifdef CONFIG_PAGE_POOL
>>>> -       if (recycle && page_pool_return_skb_page(page))
>>>> +       if ((!PAGE_POOL_DMA_USE_PP_FRAG_COUNT || recycle) &&
>>>> +           page_pool_return_skb_page(page))
>>>>                 return;
>>>>  #endif
>>>>         put_page(page);
>>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>>> index 2ad0706..8b43e3d9 100644
>>>> --- a/include/net/page_pool.h
>>>> +++ b/include/net/page_pool.h
>>>> @@ -244,6 +244,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr)
>>>>         atomic_long_set(&page->pp_frag_count, nr);
>>>>  }
>>>>
>>>> +static inline void page_pool_atomic_inc_frag_count(struct page *page)
>>>> +{
>>>> +       atomic_long_inc(&page->pp_frag_count);
>>>> +}
>>>> +
>>>> +static inline bool page_pool_is_pp_page(struct page *page)
>>>> +{
>>>> +       /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>>> +        * in order to preserve any existing bits, such as bit 0 for the
>>>> +        * head page of compound page and bit 1 for pfmemalloc page, so
>>>> +        * mask those bits for freeing side when doing below checking,
>>>> +        * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>>> +        * to avoid recycling the pfmemalloc page.
>>>> +        */
>>>> +       return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
>>>> +}
>>>> +
>>>>  static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>                                                           long nr)
>>>>  {
>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>> index ba9f14d..442d37b 100644
>>>> --- a/net/core/page_pool.c
>>>> +++ b/net/core/page_pool.c
>>>> @@ -24,7 +24,7 @@
>>>>  #define DEFER_TIME (msecs_to_jiffies(1000))
>>>>  #define DEFER_WARN_INTERVAL (60 * HZ)
>>>>
>>>> -#define BIAS_MAX       LONG_MAX
>>>> +#define BIAS_MAX       (LONG_MAX / 2)
>>>
>>> This piece needs some explaining in the patch. Why are you changing
>>> the BIAS_MAX?
>>
>> When __skb_frag_ref() is called for the pp page that is not drained yet,
>> the pp_frag_count could be overflowed if the BIAS is too big.
> 
> Aren't we only checking against 0 though? We are calling LONG_MAX
> which is already half the maximum possible value since it is dropping
> the signed bit. If we are treating the value like it is unsigned and
> only testing against 0 that would leave half the space still available
> anyway.

Yes, if we are treating the value like it is unsigned, LONG_MAX does not
need changing.

But we are treating the value like it is signed now, and (LONG_MAX / 2)
seems large enough for the number of page user and frag count?

> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-09-01  3:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30  1:18 [PATCH net-next 0/2] some optimization for page pool Yunsheng Lin
2021-08-30  1:18 ` [PATCH net-next 1/2] page_pool: support non-split page with PP_FLAG_PAGE_FRAG Yunsheng Lin
2021-08-30 15:05   ` Alexander Duyck
2021-08-31  6:14     ` Yunsheng Lin
2021-08-31 13:43       ` Alexander Duyck
2021-08-30  1:18 ` [PATCH net-next 2/2] skbuff: keep track of pp page when __skb_frag_ref() is called Yunsheng Lin
2021-08-30  4:50   ` kernel test robot
2021-08-30 15:14   ` Alexander Duyck
2021-08-31  7:20     ` Yunsheng Lin
2021-08-31 14:30       ` Alexander Duyck
2021-09-01  3:10         ` Yunsheng Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.