All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rfc v3 0/4] add frag page support in page pool
@ 2021-07-12 12:16 Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12 12:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

This patchset adds frag page support in page pool and
enable skb's page frag recycling based on page pool in
hns3 drvier.

RFC v3:
1. Implement the semantic of "page recycling only wait for the
   page pool user instead of all user of a page" 
2. Support the frag allocation of different sizes
3. Merge patch 4 & 5 to one patch as it does not make sense to
   use page_pool_dev_alloc_pages() API directly with elevated
   refcnt.
4. other minor comment suggested by Alexander.

RFC v2:
1. Split patch 1 to more reviewable one.
2. Repurpose the lower 12 bits of the dma address to store the
   pagecnt_bias as suggested by Alexander.
3. support recycling to pool->alloc for elevated refcnt case
   too.

Yunsheng Lin (4):
  page_pool: keep pp info as long as page pool owns the page
  page_pool: add interface for getting and setting pagecnt_bias
  page_pool: add frag page recycling support in page pool
  net: hns3: support skb's frag page recycling based on page pool

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c |  79 +++++++++++-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h |   3 +
 drivers/net/ethernet/marvell/mvneta.c           |   6 +-
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c |   2 +-
 drivers/net/ethernet/ti/cpsw.c                  |   2 +-
 drivers/net/ethernet/ti/cpsw_new.c              |   2 +-
 include/linux/skbuff.h                          |   4 +-
 include/net/page_pool.h                         |  58 +++++++--
 net/core/page_pool.c                            | 155 +++++++++++++++++++++---
 9 files changed, 267 insertions(+), 44 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc v3 1/4] page_pool: keep pp info as long as page pool owns the page
  2021-07-12 12:16 [PATCH rfc v3 0/4] add frag page support in page pool Yunsheng Lin
@ 2021-07-12 12:16 ` Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias Yunsheng Lin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12 12:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.

So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.

This is also a preparation to support allocating frag page
in page pool.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/marvell/mvneta.c           |  6 +-----
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c |  2 +-
 drivers/net/ethernet/ti/cpsw.c                  |  2 +-
 drivers/net/ethernet/ti/cpsw_new.c              |  2 +-
 include/linux/skbuff.h                          |  4 +---
 include/net/page_pool.h                         |  7 -------
 net/core/page_pool.c                            | 21 +++++++++++++++++----
 7 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 361bc4f..89bf31fd 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2327,7 +2327,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
-	skb_mark_for_recycle(skb, virt_to_page(xdp->data), pool);
+	skb_mark_for_recycle(skb);
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
@@ -2339,10 +2339,6 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 				skb_frag_page(frag), skb_frag_off(frag),
 				skb_frag_size(frag), PAGE_SIZE);
-		/* We don't need to reset pp_recycle here. It's already set, so
-		 * just mark fragments for recycling.
-		 */
-		page_pool_store_mem_info(skb_frag_page(frag), pool);
 	}
 
 	return skb;
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 3229baf..320eddb 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3995,7 +3995,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 		}
 
 		if (pp)
-			skb_mark_for_recycle(skb, page, pp);
+			skb_mark_for_recycle(skb);
 		else
 			dma_unmap_single_attrs(dev->dev.parent, dma_addr,
 					       bm_pool->buf_size, DMA_FROM_DEVICE,
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index cbbd0f6..9d59143 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -431,7 +431,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 	skb->protocol = eth_type_trans(skb, ndev);
 
 	/* mark skb for recycling */
-	skb_mark_for_recycle(skb, page, pool);
+	skb_mark_for_recycle(skb);
 	netif_receive_skb(skb);
 
 	ndev->stats.rx_bytes += len;
diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index 57d279f..a4234a3 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -374,7 +374,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 	skb->protocol = eth_type_trans(skb, ndev);
 
 	/* mark skb for recycling */
-	skb_mark_for_recycle(skb, page, pool);
+	skb_mark_for_recycle(skb);
 	netif_receive_skb(skb);
 
 	ndev->stats.rx_bytes += len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b2db9cd..7795979 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4711,11 +4711,9 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb)
 }
 
 #ifdef CONFIG_PAGE_POOL
-static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
-					struct page_pool *pp)
+static inline void skb_mark_for_recycle(struct sk_buff *skb)
 {
 	skb->pp_recycle = 1;
-	page_pool_store_mem_info(page, pp);
 }
 #endif
 
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 3dd62dd..8d7744d 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -253,11 +253,4 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
 		spin_unlock_bh(&pool->ring.producer_lock);
 }
 
-/* Store mem_info on struct page and use it while recycling skb frags */
-static inline
-void page_pool_store_mem_info(struct page *page, struct page_pool *pp)
-{
-	page->pp = pp;
-}
-
 #endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5e4eb45..78838c6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -206,6 +206,19 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	return true;
 }
 
+static void page_pool_set_pp_info(struct page_pool *pool,
+				  struct page *page)
+{
+	page->pp = pool;
+	page->pp_magic |= PP_SIGNATURE;
+}
+
+static void page_pool_clear_pp_info(struct page *page)
+{
+	page->pp_magic = 0;
+	page->pp = NULL;
+}
+
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -222,7 +235,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 		return NULL;
 	}
 
-	page->pp_magic |= PP_SIGNATURE;
+	page_pool_set_pp_info(pool, page);
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
@@ -266,7 +279,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 			put_page(page);
 			continue;
 		}
-		page->pp_magic |= PP_SIGNATURE;
+
+		page_pool_set_pp_info(pool, page);
 		pool->alloc.cache[pool->alloc.count++] = page;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
@@ -345,7 +359,7 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 			     DMA_ATTR_SKIP_CPU_SYNC);
 	page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
-	page->pp_magic = 0;
+	page_pool_clear_pp_info(page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -644,7 +658,6 @@ bool page_pool_return_skb_page(struct page *page)
 	 * The page will be returned to the pool here regardless of the
 	 * 'flipped' fragment being in use or not.
 	 */
-	page->pp = NULL;
 	page_pool_put_full_page(pp, page, false);
 
 	return true;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12 12:16 [PATCH rfc v3 0/4] add frag page support in page pool Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
@ 2021-07-12 12:16 ` Yunsheng Lin
  2021-07-12 16:02   ` Alexander Duyck
  2021-07-12 12:16 ` [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin
  3 siblings, 1 reply; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12 12:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

As suggested by Alexander, "A DMA mapping should be page
aligned anyway so the lower 12 bits would be reserved 0",
so it might make more sense to repurpose the lower 12 bits
of the dma address to store the pagecnt_bias for frag page
support in page pool.

As newly added page_pool_get_pagecnt_bias() may be called
outside of the softirq context, so annotate the access to
page->dma_addr[0] with READ_ONCE() and WRITE_ONCE().

And page_pool_get_pagecnt_bias_ptr() is added to implement
the pagecnt_bias atomic updating when a page is passsed to
the user.

Other three interfaces using page->dma_addr[0] is only called
in the softirq context during normal rx processing, hopefully
the barrier in the rx processing will ensure the correct order
between getting and setting pagecnt_bias.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h | 29 +++++++++++++++++++++++++++--
 net/core/page_pool.c    |  8 +++++++-
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 8d7744d..84cd972 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -200,17 +200,42 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr[0];
+	dma_addr_t ret = READ_ONCE(page->dma_addr[0]) & PAGE_MASK;
 	if (sizeof(dma_addr_t) > sizeof(unsigned long))
 		ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
 	return ret;
 }
 
-static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+	if (WARN_ON(addr & ~PAGE_MASK))
+		return false;
+
 	page->dma_addr[0] = addr;
 	if (sizeof(dma_addr_t) > sizeof(unsigned long))
 		page->dma_addr[1] = upper_32_bits(addr);
+
+	return true;
+}
+
+static inline int page_pool_get_pagecnt_bias(struct page *page)
+{
+	return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
+}
+
+static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
+{
+	return page->dma_addr;
+}
+
+static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
+{
+	unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
+
+	dma_addr_0 &= PAGE_MASK;
+	dma_addr_0 |= bias;
+
+	WRITE_ONCE(page->dma_addr[0], dma_addr_0);
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 78838c6..1abefc6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
-	page_pool_set_dma_addr(page, dma);
+	if (unlikely(!page_pool_set_dma_addr(page, dma))) {
+		dma_unmap_page_attrs(pool->p.dev, dma,
+				     PAGE_SIZE << pool->p.order,
+				     pool->p.dma_dir,
+				     DMA_ATTR_SKIP_CPU_SYNC);
+		return false;
+	}
 
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool
  2021-07-12 12:16 [PATCH rfc v3 0/4] add frag page support in page pool Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
  2021-07-12 12:16 ` [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias Yunsheng Lin
@ 2021-07-12 12:16 ` Yunsheng Lin
  2021-07-12 19:46   ` Alexander Duyck
  2021-07-12 12:16 ` [PATCH rfc v3 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin
  3 siblings, 1 reply; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12 12:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

Currently page pool only support page recycling only when
there is only one user of the page, and the split page
reusing implemented in the most driver can not use the
page pool as bing-pong way of reusing requires the multi
user support in page pool.

Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
   for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
   multi desc using different part of the same page in order
   to save memory.

So add multi-users support and frag page recycling in page pool
to overcome the above limitation.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h |  22 ++++++++-
 net/core/page_pool.c    | 126 +++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 134 insertions(+), 14 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 84cd972..d9a736f 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -45,7 +45,10 @@
 					* Please note DMA-sync-for-CPU is still
 					* device driver responsibility
 					*/
-#define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV)
+#define PP_FLAG_PAGE_FRAG	BIT(2)	/* for page frag feature */
+#define PP_FLAG_ALL		(PP_FLAG_DMA_MAP |\
+				 PP_FLAG_DMA_SYNC_DEV |\
+				 PP_FLAG_PAGE_FRAG)
 
 /*
  * Fast allocation side cache array/stack
@@ -88,6 +91,9 @@ struct page_pool {
 	unsigned long defer_warn;
 
 	u32 pages_state_hold_cnt;
+	unsigned int frag_offset;
+	int frag_bias;
+	struct page *frag_page;
 
 	/*
 	 * Data structure for allocation side
@@ -137,6 +143,20 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
 	return page_pool_alloc_pages(pool, gfp);
 }
 
+struct page *page_pool_alloc_frag(struct page_pool *pool,
+				  unsigned int *offset,
+				  unsigned int size,
+				  gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
+						    unsigned int *offset,
+						    unsigned int size)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_frag(pool, offset, size, gfp);
+}
+
 /* get the stored dma direction. A driver might decide to treat this locally and
  * avoid the extra cache line from page_pool to determine the direction
  */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1abefc6..39d5156 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -24,6 +24,8 @@
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
+#define BIAS_MAX	(PAGE_SIZE - 1)
+
 static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
 {
@@ -304,6 +306,33 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	return page;
 }
 
+/* nr could be negative */
+static int page_pool_atomic_add_bias(struct page *page, int nr)
+{
+	unsigned long *bias_ptr = page_pool_pagecnt_bias_ptr(page);
+	unsigned long old_bias = READ_ONCE(*bias_ptr);
+	unsigned long new_bias;
+
+	do {
+		int bias = (int)(old_bias & ~PAGE_MASK);
+
+		/* Warn when page_pool_dev_alloc_pages() is called
+		 * with PP_FLAG_PAGE_FRAG flag in driver.
+		 */
+		WARN_ON(!bias);
+
+		/* already the last user */
+		if (!(bias + nr))
+			return 0;
+
+		new_bias = old_bias + nr;
+	} while (!try_cmpxchg(bias_ptr, &old_bias, new_bias));
+
+	WARN_ON((new_bias & PAGE_MASK) != (old_bias & PAGE_MASK));
+
+	return new_bias & ~PAGE_MASK;
+}
+
 /* For using page_pool replace: alloc_pages() API calls, but provide
  * synchronization guarantee for allocation side.
  */
@@ -425,6 +454,11 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
 		     unsigned int dma_sync_size, bool allow_direct)
 {
+	/* It is not the last user for the page frag case */
+	if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
+	    page_pool_atomic_add_bias(page, -1))
+		return NULL;
+
 	/* This allocator is optimized for the XDP mode that uses
 	 * one-frame-per-page, but have fallbacks that act like the
 	 * regular page allocator APIs.
@@ -448,19 +482,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 		/* Page found as candidate for recycling */
 		return page;
 	}
-	/* Fallback/non-XDP mode: API user have elevated refcnt.
-	 *
-	 * Many drivers split up the page into fragments, and some
-	 * want to keep doing this to save memory and do refcnt based
-	 * recycling. Support this use case too, to ease drivers
-	 * switching between XDP/non-XDP.
-	 *
-	 * In-case page_pool maintains the DMA mapping, API user must
-	 * call page_pool_put_page once.  In this elevated refcnt
-	 * case, the DMA is unmapped/released, as driver is likely
-	 * doing refcnt based recycle tricks, meaning another process
-	 * will be invoking put_page.
-	 */
+
 	/* Do not replace this with page_pool_return_page() */
 	page_pool_release_page(pool, page);
 	put_page(page);
@@ -517,6 +539,82 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 }
 EXPORT_SYMBOL(page_pool_put_page_bulk);
 
+/* When BIAS_RESERVE to avoid frag page being recycled back to
+ * page pool while the frag page is still in pool->frag_page
+ * waiting for more user. As minimum align size for DMA seems to
+ * be 32, so we support max size of 2047 * 32 for 4K page size.
+ */
+#define BIAS_RESERVE		((int)(BIAS_MAX / 2 + 1))
+#define BIAS_NEGATIVE_RESERVE	(0 - BIAS_RESERVE)
+
+static struct page *page_pool_drain_frag(struct page_pool *pool,
+					 struct page *page)
+{
+	/* page pool is not the last user */
+	if (page_pool_atomic_add_bias(page, pool->frag_bias +
+				      BIAS_NEGATIVE_RESERVE))
+		return NULL;
+	else
+		return page;
+}
+
+static void page_pool_free_frag(struct page_pool *pool)
+{
+	struct page *page = pool->frag_page;
+
+	if (!page ||
+	    page_pool_atomic_add_bias(page, pool->frag_bias +
+				      BIAS_NEGATIVE_RESERVE))
+		return;
+
+	page_pool_return_page(pool, page);
+	pool->frag_page = NULL;
+}
+
+struct page *page_pool_alloc_frag(struct page_pool *pool,
+				  unsigned int *offset,
+				  unsigned int size,
+				  gfp_t gfp)
+{
+	unsigned int max_size = PAGE_SIZE << pool->p.order;
+	unsigned int frag_offset = pool->frag_offset;
+	struct page *frag_page = pool->frag_page;
+
+	if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
+		    size > max_size))
+		return NULL;
+
+	size = ALIGN(size, dma_get_cache_alignment());
+
+	if (frag_page && frag_offset + size > max_size) {
+		frag_page = page_pool_drain_frag(pool, frag_page);
+		if (frag_page)
+			goto frag_reset;
+	}
+
+	if (!frag_page) {
+		frag_page = page_pool_alloc_pages(pool, gfp);
+		if (unlikely(!frag_page)) {
+			pool->frag_page = NULL;
+			return NULL;
+		}
+
+		pool->frag_page = frag_page;
+
+frag_reset:
+		pool->frag_bias = 0;
+		frag_offset = 0;
+		page_pool_set_pagecnt_bias(frag_page, BIAS_RESERVE);
+	}
+
+	pool->frag_bias++;
+	*offset = frag_offset;
+	pool->frag_offset = frag_offset + size;
+
+	return frag_page;
+}
+EXPORT_SYMBOL(page_pool_alloc_frag);
+
 static void page_pool_empty_ring(struct page_pool *pool)
 {
 	struct page *page;
@@ -622,6 +720,8 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_put(pool))
 		return;
 
+	page_pool_free_frag(pool);
+
 	if (!page_pool_release(pool))
 		return;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rfc v3 4/4] net: hns3: support skb's frag page recycling based on page pool
  2021-07-12 12:16 [PATCH rfc v3 0/4] add frag page support in page pool Yunsheng Lin
                   ` (2 preceding siblings ...)
  2021-07-12 12:16 ` [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool Yunsheng Lin
@ 2021-07-12 12:16 ` Yunsheng Lin
  3 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12 12:16 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

This patch adds skb's frag page recycling support based on
the elevated refcnt support in page pool.

The performance improves above 10~20% with IOMMU disabled.
The performance improves about 200% when IOMMU is enabled
and iperf server shares the same cpu with irq/NAPI.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 79 +++++++++++++++++++++++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h |  3 +
 2 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index cdb5f14..c799129 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -3205,6 +3205,21 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
 	unsigned int order = hns3_page_order(ring);
 	struct page *p;
 
+	if (ring->page_pool) {
+		p = page_pool_dev_alloc_frag(ring->page_pool,
+					     &cb->page_offset,
+					     hns3_buf_size(ring));
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		cb->priv = p;
+		cb->buf = page_address(p);
+		cb->dma = page_pool_get_dma_addr(p);
+		cb->type = DESC_TYPE_FRAG;
+		cb->reuse_flag = 0;
+		return 0;
+	}
+
 	p = dev_alloc_pages(order);
 	if (!p)
 		return -ENOMEM;
@@ -3227,8 +3242,12 @@ static void hns3_free_buffer(struct hns3_enet_ring *ring,
 	if (cb->type & (DESC_TYPE_SKB | DESC_TYPE_BOUNCE_HEAD |
 			DESC_TYPE_BOUNCE_ALL | DESC_TYPE_SGL_SKB))
 		napi_consume_skb(cb->priv, budget);
-	else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias)
-		__page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+	else if (!HNAE3_IS_TX_RING(ring)) {
+		if (cb->type & DESC_TYPE_PAGE && cb->pagecnt_bias)
+			__page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+		else if (cb->type & DESC_TYPE_FRAG)
+			page_pool_put_full_page(ring->page_pool, cb->priv, false);
+	}
 	memset(cb, 0, sizeof(*cb));
 }
 
@@ -3315,7 +3334,7 @@ static int hns3_alloc_and_map_buffer(struct hns3_enet_ring *ring,
 	int ret;
 
 	ret = hns3_alloc_buffer(ring, cb);
-	if (ret)
+	if (ret || ring->page_pool)
 		goto out;
 
 	ret = hns3_map_buffer(ring, cb);
@@ -3337,7 +3356,8 @@ static int hns3_alloc_and_attach_buffer(struct hns3_enet_ring *ring, int i)
 	if (ret)
 		return ret;
 
-	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma);
+	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma +
+					 ring->desc_cb[i].page_offset);
 
 	return 0;
 }
@@ -3367,7 +3387,8 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i,
 {
 	hns3_unmap_buffer(ring, &ring->desc_cb[i]);
 	ring->desc_cb[i] = *res_cb;
-	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma);
+	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma +
+					 ring->desc_cb[i].page_offset);
 	ring->desc[i].rx.bd_base_info = 0;
 }
 
@@ -3539,6 +3560,12 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
 	u32 frag_size = size - pull_len;
 	bool reused;
 
+	if (ring->page_pool) {
+		skb_add_rx_frag(skb, i, desc_cb->priv, frag_offset,
+				frag_size, truesize);
+		return;
+	}
+
 	/* Avoid re-using remote or pfmem page */
 	if (unlikely(!dev_page_is_reusable(desc_cb->priv)))
 		goto out;
@@ -3856,6 +3883,9 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
 		/* We can reuse buffer as-is, just make sure it is reusable */
 		if (dev_page_is_reusable(desc_cb->priv))
 			desc_cb->reuse_flag = 1;
+		else if (desc_cb->type & DESC_TYPE_FRAG)
+			page_pool_put_full_page(ring->page_pool, desc_cb->priv,
+						false);
 		else /* This page cannot be reused so discard it */
 			__page_frag_cache_drain(desc_cb->priv,
 						desc_cb->pagecnt_bias);
@@ -3863,6 +3893,10 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
 		hns3_rx_ring_move_fw(ring);
 		return 0;
 	}
+
+	if (ring->page_pool)
+		skb_mark_for_recycle(skb);
+
 	u64_stats_update_begin(&ring->syncp);
 	ring->stats.seg_pkt_cnt++;
 	u64_stats_update_end(&ring->syncp);
@@ -3901,6 +3935,10 @@ static int hns3_add_frag(struct hns3_enet_ring *ring)
 					    "alloc rx fraglist skb fail\n");
 				return -ENXIO;
 			}
+
+			if (ring->page_pool)
+				skb_mark_for_recycle(new_skb);
+
 			ring->frag_num = 0;
 
 			if (ring->tail_skb) {
@@ -4705,6 +4743,29 @@ static void hns3_put_ring_config(struct hns3_nic_priv *priv)
 	priv->ring = NULL;
 }
 
+static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
+{
+	struct page_pool_params pp_params = {
+		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG,
+		.order = hns3_page_order(ring),
+		.pool_size = ring->desc_num * hns3_buf_size(ring) / PAGE_SIZE,
+		.nid = dev_to_node(ring_to_dev(ring)),
+		.dev = ring_to_dev(ring),
+		.dma_dir = DMA_FROM_DEVICE,
+		.offset = 0,
+		.max_len = 0,
+	};
+
+	ring->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(ring->page_pool)) {
+		dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
+			 PTR_ERR(ring->page_pool));
+		ring->page_pool = NULL;
+	} else {
+		dev_info(ring_to_dev(ring), "page pool creation succeeded\n");
+	}
+}
+
 static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
 {
 	int ret;
@@ -4724,6 +4785,8 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
 		goto out_with_desc_cb;
 
 	if (!HNAE3_IS_TX_RING(ring)) {
+		hns3_alloc_page_pool(ring);
+
 		ret = hns3_alloc_ring_buffers(ring);
 		if (ret)
 			goto out_with_desc;
@@ -4764,6 +4827,12 @@ void hns3_fini_ring(struct hns3_enet_ring *ring)
 		devm_kfree(ring_to_dev(ring), tx_spare);
 		ring->tx_spare = NULL;
 	}
+
+	if (!HNAE3_IS_TX_RING(ring) && ring->page_pool) {
+		page_pool_destroy(ring->page_pool);
+		ring->page_pool = NULL;
+		dev_info(ring_to_dev(ring), "page pool destroyed\n");
+	}
 }
 
 static int hns3_buf_size2type(u32 buf_size)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 15af3d9..115c0ce 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,6 +6,7 @@
 
 #include <linux/dim.h>
 #include <linux/if_vlan.h>
+#include <net/page_pool.h>
 
 #include "hnae3.h"
 
@@ -307,6 +308,7 @@ enum hns3_desc_type {
 	DESC_TYPE_BOUNCE_ALL		= 1 << 3,
 	DESC_TYPE_BOUNCE_HEAD		= 1 << 4,
 	DESC_TYPE_SGL_SKB		= 1 << 5,
+	DESC_TYPE_FRAG			= 1 << 6,
 };
 
 struct hns3_desc_cb {
@@ -451,6 +453,7 @@ struct hns3_enet_ring {
 	struct hnae3_queue *tqp;
 	int queue_index;
 	struct device *dev; /* will be used for DMA mapping of descriptors */
+	struct page_pool *page_pool;
 
 	/* statistic */
 	struct ring_stats stats;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12 12:16 ` [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias Yunsheng Lin
@ 2021-07-12 16:02   ` Alexander Duyck
  2021-07-12 17:20     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Duyck @ 2021-07-12 16:02 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On Mon, Jul 12, 2021 at 5:17 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> As suggested by Alexander, "A DMA mapping should be page
> aligned anyway so the lower 12 bits would be reserved 0",
> so it might make more sense to repurpose the lower 12 bits
> of the dma address to store the pagecnt_bias for frag page
> support in page pool.
>
> As newly added page_pool_get_pagecnt_bias() may be called
> outside of the softirq context, so annotate the access to
> page->dma_addr[0] with READ_ONCE() and WRITE_ONCE().
>
> And page_pool_get_pagecnt_bias_ptr() is added to implement
> the pagecnt_bias atomic updating when a page is passsed to
> the user.
>
> Other three interfaces using page->dma_addr[0] is only called
> in the softirq context during normal rx processing, hopefully
> the barrier in the rx processing will ensure the correct order
> between getting and setting pagecnt_bias.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/net/page_pool.h | 29 +++++++++++++++++++++++++++--
>  net/core/page_pool.c    |  8 +++++++-
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 8d7744d..84cd972 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -200,17 +200,42 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -       dma_addr_t ret = page->dma_addr[0];
> +       dma_addr_t ret = READ_ONCE(page->dma_addr[0]) & PAGE_MASK;
>         if (sizeof(dma_addr_t) > sizeof(unsigned long))
>                 ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
>         return ret;
>  }
>
> -static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> +static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>  {
> +       if (WARN_ON(addr & ~PAGE_MASK))
> +               return false;
> +
>         page->dma_addr[0] = addr;
>         if (sizeof(dma_addr_t) > sizeof(unsigned long))
>                 page->dma_addr[1] = upper_32_bits(addr);
> +
> +       return true;
> +}
> +

Rather than making this a part of the check here it might make more
sense to pull this out and perform the WARN_ON after the check for
dma_mapping_error.

Also it occurs to me that we only really have to do this in the case
where dma_addr_t is larger than the size of a long. Otherwise we could
just have the code split things so that dma_addr[0] is the dma_addr
and dma_addr[1] is our pagecnt_bias value in which case we could
probably just skip the check.

> +static inline int page_pool_get_pagecnt_bias(struct page *page)
> +{
> +       return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
> +}
> +
> +static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
> +{
> +       return page->dma_addr;
> +}
> +
> +static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
> +{
> +       unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
> +
> +       dma_addr_0 &= PAGE_MASK;
> +       dma_addr_0 |= bias;
> +
> +       WRITE_ONCE(page->dma_addr[0], dma_addr_0);
>  }
>
>  static inline bool is_page_pool_compiled_in(void)
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 78838c6..1abefc6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>         if (dma_mapping_error(pool->p.dev, dma))
>                 return false;
>

So instead of adding to the function below you could just add your
WARN_ON check here with the unmapping call.

> -       page_pool_set_dma_addr(page, dma);
> +       if (unlikely(!page_pool_set_dma_addr(page, dma))) {
> +               dma_unmap_page_attrs(pool->p.dev, dma,
> +                                    PAGE_SIZE << pool->p.order,
> +                                    pool->p.dma_dir,
> +                                    DMA_ATTR_SKIP_CPU_SYNC);
> +               return false;
> +       }
>
>         if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>                 page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12 16:02   ` Alexander Duyck
@ 2021-07-12 17:20     ` Jesper Dangaard Brouer
  2021-07-12 18:08       ` Ilias Apalodimas
  0 siblings, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2021-07-12 17:20 UTC (permalink / raw)
  To: Alexander Duyck, Yunsheng Lin
  Cc: brouer, David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf



On 12/07/2021 18.02, Alexander Duyck wrote:
> On Mon, Jul 12, 2021 at 5:17 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> As suggested by Alexander, "A DMA mapping should be page
>> aligned anyway so the lower 12 bits would be reserved 0",
>> so it might make more sense to repurpose the lower 12 bits
>> of the dma address to store the pagecnt_bias for frag page
>> support in page pool.
>>
>> As newly added page_pool_get_pagecnt_bias() may be called
>> outside of the softirq context, so annotate the access to
>> page->dma_addr[0] with READ_ONCE() and WRITE_ONCE().
>>
>> And page_pool_get_pagecnt_bias_ptr() is added to implement
>> the pagecnt_bias atomic updating when a page is passsed to
>> the user.
>>
>> Other three interfaces using page->dma_addr[0] is only called
>> in the softirq context during normal rx processing, hopefully
>> the barrier in the rx processing will ensure the correct order
>> between getting and setting pagecnt_bias.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>   include/net/page_pool.h | 29 +++++++++++++++++++++++++++--
>>   net/core/page_pool.c    |  8 +++++++-
>>   2 files changed, 34 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index 8d7744d..84cd972 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -200,17 +200,42 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>
>>   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>   {
>> -       dma_addr_t ret = page->dma_addr[0];
>> +       dma_addr_t ret = READ_ONCE(page->dma_addr[0]) & PAGE_MASK;
>>          if (sizeof(dma_addr_t) > sizeof(unsigned long))
>>                  ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
>>          return ret;
>>   }
>>
>> -static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>> +static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>   {
>> +       if (WARN_ON(addr & ~PAGE_MASK))
>> +               return false;
>> +
>>          page->dma_addr[0] = addr;
>>          if (sizeof(dma_addr_t) > sizeof(unsigned long))
>>                  page->dma_addr[1] = upper_32_bits(addr);
>> +
>> +       return true;
>> +}
>> +
> 
> Rather than making this a part of the check here it might make more
> sense to pull this out and perform the WARN_ON after the check for
> dma_mapping_error.

I need to point out that I don't like WARN_ON and BUG_ON code in 
fast-path code, because compiler adds 'ud2' assembler instructions that 
influences the instruction-cache fetching in the CPU.  Yes, I have seen 
a measuresable impact from this before.


> Also it occurs to me that we only really have to do this in the case
> where dma_addr_t is larger than the size of a long. Otherwise we could
> just have the code split things so that dma_addr[0] is the dma_addr
> and dma_addr[1] is our pagecnt_bias value in which case we could
> probably just skip the check.

The dance to get 64-bit DMA addr on 32-bit systems is rather ugly and 
confusing, sadly.  We could take advantage of this, I just hope this 
will not make it uglier.


>> +static inline int page_pool_get_pagecnt_bias(struct page *page)
>> +{
>> +       return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
>> +}
>> +
>> +static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
>> +{
>> +       return page->dma_addr;
>> +}
>> +
>> +static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
>> +{
>> +       unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
>> +
>> +       dma_addr_0 &= PAGE_MASK;
>> +       dma_addr_0 |= bias;
>> +
>> +       WRITE_ONCE(page->dma_addr[0], dma_addr_0);
>>   }
>>
>>   static inline bool is_page_pool_compiled_in(void)
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 78838c6..1abefc6 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>          if (dma_mapping_error(pool->p.dev, dma))
>>                  return false;
>>
> 
> So instead of adding to the function below you could just add your
> WARN_ON check here with the unmapping call.
> 
>> -       page_pool_set_dma_addr(page, dma);
>> +       if (unlikely(!page_pool_set_dma_addr(page, dma))) {
>> +               dma_unmap_page_attrs(pool->p.dev, dma,
>> +                                    PAGE_SIZE << pool->p.order,
>> +                                    pool->p.dma_dir,
>> +                                    DMA_ATTR_SKIP_CPU_SYNC);
>> +               return false;
>> +       }
>>
>>          if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>                  page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>> --
>> 2.7.4
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12 17:20     ` Jesper Dangaard Brouer
@ 2021-07-12 18:08       ` Ilias Apalodimas
  2021-07-13  6:38         ` Yunsheng Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Ilias Apalodimas @ 2021-07-12 18:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Yunsheng Lin, brouer, David Miller,
	Jakub Kicinski, Russell King - ARM Linux, Marcin Wojtas,
	linuxarm, yisen.zhuang, Salil Mehta, thomas.petazzoni, hawk,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, fenghua.yu, guro, Peter Xu, Feng Tang,
	Jason Gunthorpe, Matteo Croce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	nogikh, Marco Elver, Yonghong Song, kpsingh, andrii,
	Martin KaFai Lau, songliubraving, Netdev, LKML, bpf

[...]
> > > +static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> > >   {
> > > +       if (WARN_ON(addr & ~PAGE_MASK))
> > > +               return false;
> > > +
> > >          page->dma_addr[0] = addr;
> > >          if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > >                  page->dma_addr[1] = upper_32_bits(addr);
> > > +
> > > +       return true;
> > > +}
> > > +
> > 
> > Rather than making this a part of the check here it might make more
> > sense to pull this out and perform the WARN_ON after the check for
> > dma_mapping_error.
> 
> I need to point out that I don't like WARN_ON and BUG_ON code in fast-path
> code, because compiler adds 'ud2' assembler instructions that influences the
> instruction-cache fetching in the CPU.  Yes, I have seen a measuresable
> impact from this before.
> 
> 
> > Also it occurs to me that we only really have to do this in the case
> > where dma_addr_t is larger than the size of a long. Otherwise we could
> > just have the code split things so that dma_addr[0] is the dma_addr
> > and dma_addr[1] is our pagecnt_bias value in which case we could
> > probably just skip the check.
> 
> The dance to get 64-bit DMA addr on 32-bit systems is rather ugly and
> confusing, sadly.  We could take advantage of this, I just hope this will
> not make it uglier.

Note here that we can only use this because dma_addr is not aliased to
compound page anymore (after the initial page_pool recycling patchset). 
We must keep this in mind if we even restructure struct page.

Can we do something more radical for this? The 64/32 bit dance is only
there for 32 bit systems with 64 bit dma.  Since the last time we asked
about this no one seemed to care about these, and I really doubt we'll get
an ethernet driver for them (that needs recycling....), can we *only* support 
frag allocation and recycling for 'normal' systems? We could always just r
e-purpose dma_addr[1] for those.

Regards
/Ilias

> 
> 
> > > +static inline int page_pool_get_pagecnt_bias(struct page *page)
> > > +{
> > > +       return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
> > > +}
> > > +
> > > +static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
> > > +{
> > > +       return page->dma_addr;
> > > +}
> > > +
> > > +static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
> > > +{
> > > +       unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
> > > +
> > > +       dma_addr_0 &= PAGE_MASK;
> > > +       dma_addr_0 |= bias;
> > > +
> > > +       WRITE_ONCE(page->dma_addr[0], dma_addr_0);
> > >   }
> > > 
> > >   static inline bool is_page_pool_compiled_in(void)
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index 78838c6..1abefc6 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> > >          if (dma_mapping_error(pool->p.dev, dma))
> > >                  return false;
> > > 
> > 
> > So instead of adding to the function below you could just add your
> > WARN_ON check here with the unmapping call.
> > 
> > > -       page_pool_set_dma_addr(page, dma);
> > > +       if (unlikely(!page_pool_set_dma_addr(page, dma))) {
> > > +               dma_unmap_page_attrs(pool->p.dev, dma,
> > > +                                    PAGE_SIZE << pool->p.order,
> > > +                                    pool->p.dma_dir,
> > > +                                    DMA_ATTR_SKIP_CPU_SYNC);
> > > +               return false;
> > > +       }
> > > 
> > >          if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > >                  page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> > > --
> > > 2.7.4
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool
  2021-07-12 12:16 ` [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool Yunsheng Lin
@ 2021-07-12 19:46   ` Alexander Duyck
  2021-07-13  7:20     ` Yunsheng Lin
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Duyck @ 2021-07-12 19:46 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On Mon, Jul 12, 2021 at 5:17 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Currently page pool only support page recycling only when
> there is only one user of the page, and the split page
> reusing implemented in the most driver can not use the
> page pool as bing-pong way of reusing requires the multi
> user support in page pool.
>
> Those reusing or recycling has below limitations:
> 1. page from page pool can only be used be one user in order
>    for the page recycling to happen.
> 2. Bing-pong way of reusing in most driver does not support
>    multi desc using different part of the same page in order
>    to save memory.
>
> So add multi-users support and frag page recycling in page pool
> to overcome the above limitation.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/net/page_pool.h |  22 ++++++++-
>  net/core/page_pool.c    | 126 +++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 134 insertions(+), 14 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 84cd972..d9a736f 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -45,7 +45,10 @@
>                                         * Please note DMA-sync-for-CPU is still
>                                         * device driver responsibility
>                                         */
> -#define PP_FLAG_ALL            (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV)
> +#define PP_FLAG_PAGE_FRAG      BIT(2)  /* for page frag feature */
> +#define PP_FLAG_ALL            (PP_FLAG_DMA_MAP |\
> +                                PP_FLAG_DMA_SYNC_DEV |\
> +                                PP_FLAG_PAGE_FRAG)
>
>  /*
>   * Fast allocation side cache array/stack
> @@ -88,6 +91,9 @@ struct page_pool {
>         unsigned long defer_warn;
>
>         u32 pages_state_hold_cnt;
> +       unsigned int frag_offset;
> +       int frag_bias;
> +       struct page *frag_page;
>
>         /*
>          * Data structure for allocation side
> @@ -137,6 +143,20 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
>         return page_pool_alloc_pages(pool, gfp);
>  }
>
> +struct page *page_pool_alloc_frag(struct page_pool *pool,
> +                                 unsigned int *offset,
> +                                 unsigned int size,
> +                                 gfp_t gfp);
> +
> +static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
> +                                                   unsigned int *offset,
> +                                                   unsigned int size)
> +{
> +       gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
> +
> +       return page_pool_alloc_frag(pool, offset, size, gfp);
> +}
> +
>  /* get the stored dma direction. A driver might decide to treat this locally and
>   * avoid the extra cache line from page_pool to determine the direction
>   */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 1abefc6..39d5156 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -24,6 +24,8 @@
>  #define DEFER_TIME (msecs_to_jiffies(1000))
>  #define DEFER_WARN_INTERVAL (60 * HZ)
>
> +#define BIAS_MAX       (PAGE_SIZE - 1)
> +
>  static int page_pool_init(struct page_pool *pool,
>                           const struct page_pool_params *params)
>  {
> @@ -304,6 +306,33 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>         return page;
>  }
>
> +/* nr could be negative */
> +static int page_pool_atomic_add_bias(struct page *page, int nr)
> +{
> +       unsigned long *bias_ptr = page_pool_pagecnt_bias_ptr(page);
> +       unsigned long old_bias = READ_ONCE(*bias_ptr);
> +       unsigned long new_bias;
> +
> +       do {
> +               int bias = (int)(old_bias & ~PAGE_MASK);
> +
> +               /* Warn when page_pool_dev_alloc_pages() is called
> +                * with PP_FLAG_PAGE_FRAG flag in driver.
> +                */
> +               WARN_ON(!bias);
> +
> +               /* already the last user */
> +               if (!(bias + nr))
> +                       return 0;
> +
> +               new_bias = old_bias + nr;
> +       } while (!try_cmpxchg(bias_ptr, &old_bias, new_bias));
> +
> +       WARN_ON((new_bias & PAGE_MASK) != (old_bias & PAGE_MASK));
> +
> +       return new_bias & ~PAGE_MASK;
> +}
> +

So instead of having a function to add bias it might make more sense
to have a function that does a subtract and test since that is what we
are really doing. So in a way you could think of it as a countdown
until unmap or recycle.

For the frags case we could probably default to one of two values. For
the variable sized frags we default to a pagecnt_bias of PAGE_SIZE - 1
and then eventually do the subtraction at the end to free up unused
count when the page has been fully used. For fixed sized frags you
could theoretically just store off the PAGE_SIZE / frag_size - 1 and
use that to initialize the pagecnt_bias of the pages as you use them.

Also one thing we may want to do is look at renaming this since we
aren't messing with the page_ref_count anymore. It might make more
sense to refer to this as something such as pp_frag_refcount. Also as
per the other patch if we can get away from having to share the same
space as DMA that would be even better.

>  /* For using page_pool replace: alloc_pages() API calls, but provide
>   * synchronization guarantee for allocation side.
>   */
> @@ -425,6 +454,11 @@ static __always_inline struct page *
>  __page_pool_put_page(struct page_pool *pool, struct page *page,
>                      unsigned int dma_sync_size, bool allow_direct)
>  {
> +       /* It is not the last user for the page frag case */
> +       if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
> +           page_pool_atomic_add_bias(page, -1))
> +               return NULL;
> +
>         /* This allocator is optimized for the XDP mode that uses
>          * one-frame-per-page, but have fallbacks that act like the
>          * regular page allocator APIs.
> @@ -448,19 +482,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>                 /* Page found as candidate for recycling */
>                 return page;
>         }
> -       /* Fallback/non-XDP mode: API user have elevated refcnt.
> -        *
> -        * Many drivers split up the page into fragments, and some
> -        * want to keep doing this to save memory and do refcnt based
> -        * recycling. Support this use case too, to ease drivers
> -        * switching between XDP/non-XDP.
> -        *
> -        * In-case page_pool maintains the DMA mapping, API user must
> -        * call page_pool_put_page once.  In this elevated refcnt
> -        * case, the DMA is unmapped/released, as driver is likely
> -        * doing refcnt based recycle tricks, meaning another process
> -        * will be invoking put_page.
> -        */
> +
>         /* Do not replace this with page_pool_return_page() */
>         page_pool_release_page(pool, page);
>         put_page(page);
> @@ -517,6 +539,82 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
>  }
>  EXPORT_SYMBOL(page_pool_put_page_bulk);
>
> +/* When BIAS_RESERVE to avoid frag page being recycled back to
> + * page pool while the frag page is still in pool->frag_page
> + * waiting for more user. As minimum align size for DMA seems to
> + * be 32, so we support max size of 2047 * 32 for 4K page size.
> + */
> +#define BIAS_RESERVE           ((int)(BIAS_MAX / 2 + 1))
> +#define BIAS_NEGATIVE_RESERVE  (0 - BIAS_RESERVE)

The explanation and these values don't make sense to me. BIAS_RESERVE
would be 4095 / 2 + 1 which should come out to 2048, and
BIAS_NEGATIVE_REVERSE is -2048. Should these values have been 127/-127
which would have been 4096 / 32 - 1?

> +static struct page *page_pool_drain_frag(struct page_pool *pool,
> +                                        struct page *page)
> +{
> +       /* page pool is not the last user */
> +       if (page_pool_atomic_add_bias(page, pool->frag_bias +
> +                                     BIAS_NEGATIVE_RESERVE))
> +               return NULL;
> +       else
> +               return page;
> +}
> +

So this check isn't quite complete, and I probably wouldn't use it
this way. To be complete you would need to check the page_ref_count
and pfmemalloc flags. Also, we already have recycling into the page
pool so we might as well just use that rather than trying to introduce
yet another layer of recycling.

> +static void page_pool_free_frag(struct page_pool *pool)
> +{
> +       struct page *page = pool->frag_page;
> +
> +       if (!page ||
> +           page_pool_atomic_add_bias(page, pool->frag_bias +
> +                                     BIAS_NEGATIVE_RESERVE))
> +               return;
> +
> +       page_pool_return_page(pool, page);
> +       pool->frag_page = NULL;
> +}
> +

If we make the add_bias into a sub_bias then we are just needing to
flip the values so that you are subtracting BIAS_RESERVE -
pool->frag_bias which is much more readable in my opinion.

> +struct page *page_pool_alloc_frag(struct page_pool *pool,
> +                                 unsigned int *offset,
> +                                 unsigned int size,
> +                                 gfp_t gfp)
> +{
> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
> +       unsigned int frag_offset = pool->frag_offset;
> +       struct page *frag_page = pool->frag_page;
> +
> +       if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
> +                   size > max_size))
> +               return NULL;
> +
> +       size = ALIGN(size, dma_get_cache_alignment());
> +
> +       if (frag_page && frag_offset + size > max_size) {
> +               frag_page = page_pool_drain_frag(pool, frag_page);
> +               if (frag_page)
> +                       goto frag_reset;
> +       }
> +
> +       if (!frag_page) {
> +               frag_page = page_pool_alloc_pages(pool, gfp);
> +               if (unlikely(!frag_page)) {
> +                       pool->frag_page = NULL;
> +                       return NULL;
> +               }
> +
> +               pool->frag_page = frag_page;
> +
> +frag_reset:
> +               pool->frag_bias = 0;
> +               frag_offset = 0;
> +               page_pool_set_pagecnt_bias(frag_page, BIAS_RESERVE);
> +       }
> +
> +       pool->frag_bias++;
> +       *offset = frag_offset;
> +       pool->frag_offset = frag_offset + size;
> +
> +       return frag_page;
> +}
> +EXPORT_SYMBOL(page_pool_alloc_frag);
> +
>  static void page_pool_empty_ring(struct page_pool *pool)
>  {
>         struct page *page;
> @@ -622,6 +720,8 @@ void page_pool_destroy(struct page_pool *pool)
>         if (!page_pool_put(pool))
>                 return;
>
> +       page_pool_free_frag(pool);
> +
>         if (!page_pool_release(pool))
>                 return;
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12 18:08       ` Ilias Apalodimas
@ 2021-07-13  6:38         ` Yunsheng Lin
  0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-13  6:38 UTC (permalink / raw)
  To: Ilias Apalodimas, Jesper Dangaard Brouer
  Cc: Alexander Duyck, brouer, David Miller, Jakub Kicinski,
	Russell King - ARM Linux, Marcin Wojtas, linuxarm, yisen.zhuang,
	Salil Mehta, thomas.petazzoni, hawk, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/13 2:08, Ilias Apalodimas wrote:
> [...]
>>>> +static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>>   {
>>>> +       if (WARN_ON(addr & ~PAGE_MASK))
>>>> +               return false;
>>>> +
>>>>          page->dma_addr[0] = addr;
>>>>          if (sizeof(dma_addr_t) > sizeof(unsigned long))
>>>>                  page->dma_addr[1] = upper_32_bits(addr);
>>>> +
>>>> +       return true;
>>>> +}
>>>> +
>>>
>>> Rather than making this a part of the check here it might make more
>>> sense to pull this out and perform the WARN_ON after the check for
>>> dma_mapping_error.
>>
>> I need to point out that I don't like WARN_ON and BUG_ON code in fast-path
>> code, because compiler adds 'ud2' assembler instructions that influences the
>> instruction-cache fetching in the CPU.  Yes, I have seen a measuresable
>> impact from this before.
>>
>>
>>> Also it occurs to me that we only really have to do this in the case
>>> where dma_addr_t is larger than the size of a long. Otherwise we could
>>> just have the code split things so that dma_addr[0] is the dma_addr
>>> and dma_addr[1] is our pagecnt_bias value in which case we could
>>> probably just skip the check.
>>
>> The dance to get 64-bit DMA addr on 32-bit systems is rather ugly and
>> confusing, sadly.  We could take advantage of this, I just hope this will
>> not make it uglier.
> 
> Note here that we can only use this because dma_addr is not aliased to
> compound page anymore (after the initial page_pool recycling patchset). 
> We must keep this in mind if we even restructure struct page.
> 
> Can we do something more radical for this? The 64/32 bit dance is only
> there for 32 bit systems with 64 bit dma.  Since the last time we asked
> about this no one seemed to care about these, and I really doubt we'll get
> an ethernet driver for them (that needs recycling....), can we *only* support 
> frag allocation and recycling for 'normal' systems? We could always just r
> e-purpose dma_addr[1] for those.

Will define a macro for "sizeof(dma_addr_t) > sizeof(unsigned long)" to
decide whether to use the dma_addr[1], hopefully the compiler will optimize
out the unused code in a specific system.

> 
> Regards
> /Ilias
> 
>>
>>
>>>> +static inline int page_pool_get_pagecnt_bias(struct page *page)
>>>> +{
>>>> +       return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
>>>> +}
>>>> +
>>>> +static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
>>>> +{
>>>> +       return page->dma_addr;
>>>> +}
>>>> +
>>>> +static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
>>>> +{
>>>> +       unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
>>>> +
>>>> +       dma_addr_0 &= PAGE_MASK;
>>>> +       dma_addr_0 |= bias;
>>>> +
>>>> +       WRITE_ONCE(page->dma_addr[0], dma_addr_0);
>>>>   }
>>>>
>>>>   static inline bool is_page_pool_compiled_in(void)
>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>> index 78838c6..1abefc6 100644
>>>> --- a/net/core/page_pool.c
>>>> +++ b/net/core/page_pool.c
>>>> @@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>          if (dma_mapping_error(pool->p.dev, dma))
>>>>                  return false;
>>>>
>>>
>>> So instead of adding to the function below you could just add your
>>> WARN_ON check here with the unmapping call.

Ok.

>>>
>>>> -       page_pool_set_dma_addr(page, dma);
>>>> +       if (unlikely(!page_pool_set_dma_addr(page, dma))) {
>>>> +               dma_unmap_page_attrs(pool->p.dev, dma,
>>>> +                                    PAGE_SIZE << pool->p.order,
>>>> +                                    pool->p.dma_dir,
>>>> +                                    DMA_ATTR_SKIP_CPU_SYNC);
>>>> +               return false;
>>>> +       }
>>>>
>>>>          if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>                  page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>>>> --
>>>> 2.7.4
>>>>
>>>
>>
> .
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool
  2021-07-12 19:46   ` Alexander Duyck
@ 2021-07-13  7:20     ` Yunsheng Lin
  0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-13  7:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/13 3:46, Alexander Duyck wrote:
> On Mon, Jul 12, 2021 at 5:17 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> Currently page pool only support page recycling only when
>> there is only one user of the page, and the split page
>> reusing implemented in the most driver can not use the
>> page pool as bing-pong way of reusing requires the multi
>> user support in page pool.
>>
>> Those reusing or recycling has below limitations:
>> 1. page from page pool can only be used be one user in order
>>    for the page recycling to happen.
>> 2. Bing-pong way of reusing in most driver does not support
>>    multi desc using different part of the same page in order
>>    to save memory.
>>
>> So add multi-users support and frag page recycling in page pool
>> to overcome the above limitation.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  include/net/page_pool.h |  22 ++++++++-
>>  net/core/page_pool.c    | 126 +++++++++++++++++++++++++++++++++++++++++++-----
>>  2 files changed, 134 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index 84cd972..d9a736f 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -45,7 +45,10 @@
>>                                         * Please note DMA-sync-for-CPU is still
>>                                         * device driver responsibility
>>                                         */
>> -#define PP_FLAG_ALL            (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV)
>> +#define PP_FLAG_PAGE_FRAG      BIT(2)  /* for page frag feature */
>> +#define PP_FLAG_ALL            (PP_FLAG_DMA_MAP |\
>> +                                PP_FLAG_DMA_SYNC_DEV |\
>> +                                PP_FLAG_PAGE_FRAG)
>>
>>  /*
>>   * Fast allocation side cache array/stack
>> @@ -88,6 +91,9 @@ struct page_pool {
>>         unsigned long defer_warn;
>>
>>         u32 pages_state_hold_cnt;
>> +       unsigned int frag_offset;
>> +       int frag_bias;
>> +       struct page *frag_page;
>>
>>         /*
>>          * Data structure for allocation side
>> @@ -137,6 +143,20 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
>>         return page_pool_alloc_pages(pool, gfp);
>>  }
>>
>> +struct page *page_pool_alloc_frag(struct page_pool *pool,
>> +                                 unsigned int *offset,
>> +                                 unsigned int size,
>> +                                 gfp_t gfp);
>> +
>> +static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
>> +                                                   unsigned int *offset,
>> +                                                   unsigned int size)
>> +{
>> +       gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
>> +
>> +       return page_pool_alloc_frag(pool, offset, size, gfp);
>> +}
>> +
>>  /* get the stored dma direction. A driver might decide to treat this locally and
>>   * avoid the extra cache line from page_pool to determine the direction
>>   */
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 1abefc6..39d5156 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -24,6 +24,8 @@
>>  #define DEFER_TIME (msecs_to_jiffies(1000))
>>  #define DEFER_WARN_INTERVAL (60 * HZ)
>>
>> +#define BIAS_MAX       (PAGE_SIZE - 1)
>> +
>>  static int page_pool_init(struct page_pool *pool,
>>                           const struct page_pool_params *params)
>>  {
>> @@ -304,6 +306,33 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>>         return page;
>>  }
>>
>> +/* nr could be negative */
>> +static int page_pool_atomic_add_bias(struct page *page, int nr)
>> +{
>> +       unsigned long *bias_ptr = page_pool_pagecnt_bias_ptr(page);
>> +       unsigned long old_bias = READ_ONCE(*bias_ptr);
>> +       unsigned long new_bias;
>> +
>> +       do {
>> +               int bias = (int)(old_bias & ~PAGE_MASK);
>> +
>> +               /* Warn when page_pool_dev_alloc_pages() is called
>> +                * with PP_FLAG_PAGE_FRAG flag in driver.
>> +                */
>> +               WARN_ON(!bias);
>> +
>> +               /* already the last user */
>> +               if (!(bias + nr))
>> +                       return 0;
>> +
>> +               new_bias = old_bias + nr;
>> +       } while (!try_cmpxchg(bias_ptr, &old_bias, new_bias));
>> +
>> +       WARN_ON((new_bias & PAGE_MASK) != (old_bias & PAGE_MASK));
>> +
>> +       return new_bias & ~PAGE_MASK;
>> +}
>> +
> 
> So instead of having a function to add bias it might make more sense
> to have a function that does a subtract and test since that is what we
> are really doing. So in a way you could think of it as a countdown
> until unmap or recycle.

ok

> 
> For the frags case we could probably default to one of two values. For
> the variable sized frags we default to a pagecnt_bias of PAGE_SIZE - 1
> and then eventually do the subtraction at the end to free up unused
> count when the page has been fully used. For fixed sized frags you
> could theoretically just store off the PAGE_SIZE / frag_size - 1 and
> use that to initialize the pagecnt_bias of the pages as you use them.

Will default to a pagecnt_bias of PAGE_SIZE - 1.

> 
> Also one thing we may want to do is look at renaming this since we
> aren't messing with the page_ref_count anymore. It might make more
> sense to refer to this as something such as pp_frag_refcount. Also as
> per the other patch if we can get away from having to share the same
> space as DMA that would be even better.

page_pool_pagecnt_bias_ptr() is gone in the new version when we get away
from having to share the same space as DMA.

> 
>>  /* For using page_pool replace: alloc_pages() API calls, but provide
>>   * synchronization guarantee for allocation side.
>>   */
>> @@ -425,6 +454,11 @@ static __always_inline struct page *
>>  __page_pool_put_page(struct page_pool *pool, struct page *page,
>>                      unsigned int dma_sync_size, bool allow_direct)
>>  {
>> +       /* It is not the last user for the page frag case */
>> +       if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
>> +           page_pool_atomic_add_bias(page, -1))
>> +               return NULL;
>> +
>>         /* This allocator is optimized for the XDP mode that uses
>>          * one-frame-per-page, but have fallbacks that act like the
>>          * regular page allocator APIs.
>> @@ -448,19 +482,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>                 /* Page found as candidate for recycling */
>>                 return page;
>>         }
>> -       /* Fallback/non-XDP mode: API user have elevated refcnt.
>> -        *
>> -        * Many drivers split up the page into fragments, and some
>> -        * want to keep doing this to save memory and do refcnt based
>> -        * recycling. Support this use case too, to ease drivers
>> -        * switching between XDP/non-XDP.
>> -        *
>> -        * In-case page_pool maintains the DMA mapping, API user must
>> -        * call page_pool_put_page once.  In this elevated refcnt
>> -        * case, the DMA is unmapped/released, as driver is likely
>> -        * doing refcnt based recycle tricks, meaning another process
>> -        * will be invoking put_page.
>> -        */
>> +
>>         /* Do not replace this with page_pool_return_page() */
>>         page_pool_release_page(pool, page);
>>         put_page(page);
>> @@ -517,6 +539,82 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
>>  }
>>  EXPORT_SYMBOL(page_pool_put_page_bulk);
>>
>> +/* When BIAS_RESERVE to avoid frag page being recycled back to
>> + * page pool while the frag page is still in pool->frag_page
>> + * waiting for more user. As minimum align size for DMA seems to
>> + * be 32, so we support max size of 2047 * 32 for 4K page size.
>> + */
>> +#define BIAS_RESERVE           ((int)(BIAS_MAX / 2 + 1))
>> +#define BIAS_NEGATIVE_RESERVE  (0 - BIAS_RESERVE)
> 
> The explanation and these values don't make sense to me. BIAS_RESERVE
> would be 4095 / 2 + 1 which should come out to 2048, and
> BIAS_NEGATIVE_REVERSE is -2048. Should these values have been 127/-127
> which would have been 4096 / 32 - 1?

The idea is to prevent the bias from being subtracted to zeor while
the frag page is sitting in the page pool waiting for more users.

Anyway, the above comment is not needed if I take the "default to a
pagecnt_bias of PAGE_SIZE - 1" suggestion.

> 
>> +static struct page *page_pool_drain_frag(struct page_pool *pool,
>> +                                        struct page *page)
>> +{
>> +       /* page pool is not the last user */
>> +       if (page_pool_atomic_add_bias(page, pool->frag_bias +
>> +                                     BIAS_NEGATIVE_RESERVE))
>> +               return NULL;
>> +       else
>> +               return page;
>> +}
>> +
> 
> So this check isn't quite complete, and I probably wouldn't use it
> this way. To be complete you would need to check the page_ref_count
> and pfmemalloc flags. Also, we already have recycling into the page
> pool so we might as well just use that rather than trying to introduce
> yet another layer of recycling.

Yes, will add the the pfmemalloc and page_ref_count checking.

As introducing another layer of recycling, it seems we can not call
page_pool_put_page() directly, and if the page is reusable, adding
back to the pool and taking one from the pool seems unnecessary.

> 
>> +static void page_pool_free_frag(struct page_pool *pool)
>> +{
>> +       struct page *page = pool->frag_page;
>> +
>> +       if (!page ||
>> +           page_pool_atomic_add_bias(page, pool->frag_bias +
>> +                                     BIAS_NEGATIVE_RESERVE))
>> +               return;
>> +
>> +       page_pool_return_page(pool, page);
>> +       pool->frag_page = NULL;
>> +}
>> +
> 
> If we make the add_bias into a sub_bias then we are just needing to
> flip the values so that you are subtracting BIAS_RESERVE -
> pool->frag_bias which is much more readable in my opinion.

ok.

> 
>> +struct page *page_pool_alloc_frag(struct page_pool *pool,
>> +                                 unsigned int *offset,
>> +                                 unsigned int size,
>> +                                 gfp_t gfp)
>> +{
>> +       unsigned int max_size = PAGE_SIZE << pool->p.order;
>> +       unsigned int frag_offset = pool->frag_offset;
>> +       struct page *frag_page = pool->frag_page;
>> +
>> +       if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
>> +                   size > max_size))
>> +               return NULL;
>> +
>> +       size = ALIGN(size, dma_get_cache_alignment());
>> +
>> +       if (frag_page && frag_offset + size > max_size) {
>> +               frag_page = page_pool_drain_frag(pool, frag_page);
>> +               if (frag_page)
>> +                       goto frag_reset;
>> +       }
>> +
>> +       if (!frag_page) {
>> +               frag_page = page_pool_alloc_pages(pool, gfp);
>> +               if (unlikely(!frag_page)) {
>> +                       pool->frag_page = NULL;
>> +                       return NULL;
>> +               }
>> +
>> +               pool->frag_page = frag_page;
>> +
>> +frag_reset:
>> +               pool->frag_bias = 0;
>> +               frag_offset = 0;
>> +               page_pool_set_pagecnt_bias(frag_page, BIAS_RESERVE);
>> +       }
>> +
>> +       pool->frag_bias++;
>> +       *offset = frag_offset;
>> +       pool->frag_offset = frag_offset + size;
>> +
>> +       return frag_page;
>> +}
>> +EXPORT_SYMBOL(page_pool_alloc_frag);
>> +
>>  static void page_pool_empty_ring(struct page_pool *pool)
>>  {
>>         struct page *page;
>> @@ -622,6 +720,8 @@ void page_pool_destroy(struct page_pool *pool)
>>         if (!page_pool_put(pool))
>>                 return;
>>
>> +       page_pool_free_frag(pool);
>> +
>>         if (!page_pool_release(pool))
>>                 return;
>>
>> --
>> 2.7.4
>>
> .
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias
  2021-07-12  9:19 [PATCH rfc v3 0/4] add frag page support in " Yunsheng Lin
@ 2021-07-12  9:19 ` Yunsheng Lin
  0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2021-07-12  9:19 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

As suggested by Alexander, "A DMA mapping should be page
aligned anyway so the lower 12 bits would be reserved 0",
so it might make more sense to repurpose the lower 12 bits
of the dma address to store the pagecnt_bias for elevated
refcnt case in page pool.

As newly added page_pool_get_pagecnt_bias() may be called
outside of the softirq context, so annotate the access to
page->dma_addr[0] with READ_ONCE() and WRITE_ONCE().

And page_pool_get_pagecnt_bias_ptr() is added to implement
the pagecnt_bias atomic updating.

Other three interfaces using page->dma_addr[0] is only called
in the softirq context during normal rx processing, hopefully
the barrier in the rx processing will ensure the correct order
between getting and setting pagecnt_bias.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h | 29 +++++++++++++++++++++++++++--
 net/core/page_pool.c    |  8 +++++++-
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 8d7744d..84cd972 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -200,17 +200,42 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr[0];
+	dma_addr_t ret = READ_ONCE(page->dma_addr[0]) & PAGE_MASK;
 	if (sizeof(dma_addr_t) > sizeof(unsigned long))
 		ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
 	return ret;
 }
 
-static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+	if (WARN_ON(addr & ~PAGE_MASK))
+		return false;
+
 	page->dma_addr[0] = addr;
 	if (sizeof(dma_addr_t) > sizeof(unsigned long))
 		page->dma_addr[1] = upper_32_bits(addr);
+
+	return true;
+}
+
+static inline int page_pool_get_pagecnt_bias(struct page *page)
+{
+	return READ_ONCE(page->dma_addr[0]) & ~PAGE_MASK;
+}
+
+static inline unsigned long *page_pool_pagecnt_bias_ptr(struct page *page)
+{
+	return page->dma_addr;
+}
+
+static inline void page_pool_set_pagecnt_bias(struct page *page, int bias)
+{
+	unsigned long dma_addr_0 = READ_ONCE(page->dma_addr[0]);
+
+	dma_addr_0 &= PAGE_MASK;
+	dma_addr_0 |= bias;
+
+	WRITE_ONCE(page->dma_addr[0], dma_addr_0);
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 78838c6..1abefc6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -198,7 +198,13 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
-	page_pool_set_dma_addr(page, dma);
+	if (unlikely(!page_pool_set_dma_addr(page, dma))) {
+		dma_unmap_page_attrs(pool->p.dev, dma,
+				     PAGE_SIZE << pool->p.order,
+				     pool->p.dma_dir,
+				     DMA_ATTR_SKIP_CPU_SYNC);
+		return false;
+	}
 
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-13  7:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-12 12:16 [PATCH rfc v3 0/4] add frag page support in page pool Yunsheng Lin
2021-07-12 12:16 ` [PATCH rfc v3 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
2021-07-12 12:16 ` [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias Yunsheng Lin
2021-07-12 16:02   ` Alexander Duyck
2021-07-12 17:20     ` Jesper Dangaard Brouer
2021-07-12 18:08       ` Ilias Apalodimas
2021-07-13  6:38         ` Yunsheng Lin
2021-07-12 12:16 ` [PATCH rfc v3 3/4] page_pool: add frag page recycling support in page pool Yunsheng Lin
2021-07-12 19:46   ` Alexander Duyck
2021-07-13  7:20     ` Yunsheng Lin
2021-07-12 12:16 ` [PATCH rfc v3 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin
  -- strict thread matches above, loose matches on Subject: below --
2021-07-12  9:19 [PATCH rfc v3 0/4] add frag page support in " Yunsheng Lin
2021-07-12  9:19 ` [PATCH rfc v3 2/4] page_pool: add interface for getting and setting pagecnt_bias Yunsheng Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.