linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rfc v6 0/4] add frag page support in page pool
@ 2021-07-20  3:35 Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-20  3:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

This patchset adds frag page support in page pool and
enable skb's page frag recycling based on page pool in
hns3 drvier.

RFC v6:
1. Disable frag page support in system 32-bit arch and
   64-bit DMA.

RFC v5:
1. Rename dma_addr[0] to pp_frag_count and adjust codes
   according to the rename.

RFC v4:
1. Use the dma_addr[1] to store bias.
2. Default to a pagecnt_bias of PAGE_SIZE - 1.
3. other minor comment suggested by Alexander.

RFC v3:
1. Implement the semantic of "page recycling only wait for the
   page pool user instead of all user of a page"
2. Support the frag allocation of different sizes
3. Merge patch 4 & 5 to one patch as it does not make sense to
   use page_pool_dev_alloc_pages() API directly with elevated
   refcnt.
4. other minor comment suggested by Alexander.

RFC v2:
1. Split patch 1 to more reviewable one.
2. Repurpose the lower 12 bits of the dma address to store the
   pagecnt_bias as suggested by Alexander.
3. support recycling to pool->alloc for elevated refcnt case
   too.

Yunsheng Lin (4):
  page_pool: keep pp info as long as page pool owns the page
  page_pool: add interface to manipulate frag count in page pool
  page_pool: add frag page recycling support in page pool
  net: hns3: support skb's frag page recycling based on page pool

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c |  82 +++++++++++++++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h |   3 +
 drivers/net/ethernet/marvell/mvneta.c           |   6 +-
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c |   2 +-
 drivers/net/ethernet/ti/cpsw.c                  |   2 +-
 drivers/net/ethernet/ti/cpsw_new.c              |   2 +-
 include/linux/mm_types.h                        |  18 ++--
 include/linux/skbuff.h                          |   4 +-
 include/net/page_pool.h                         |  65 +++++++++++---
 net/core/page_pool.c                            | 112 +++++++++++++++++++++++-
 10 files changed, 257 insertions(+), 39 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page
  2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
@ 2021-07-20  3:35 ` Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool Yunsheng Lin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-20  3:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.

So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.

This is also a preparation to support allocating frag page
in page pool.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/marvell/mvneta.c           |  6 +-----
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c |  2 +-
 drivers/net/ethernet/ti/cpsw.c                  |  2 +-
 drivers/net/ethernet/ti/cpsw_new.c              |  2 +-
 include/linux/skbuff.h                          |  4 +---
 include/net/page_pool.h                         |  7 -------
 net/core/page_pool.c                            | 21 +++++++++++++++++----
 7 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 361bc4f..89bf31fd 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2327,7 +2327,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
-	skb_mark_for_recycle(skb, virt_to_page(xdp->data), pool);
+	skb_mark_for_recycle(skb);
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
@@ -2339,10 +2339,6 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 				skb_frag_page(frag), skb_frag_off(frag),
 				skb_frag_size(frag), PAGE_SIZE);
-		/* We don't need to reset pp_recycle here. It's already set, so
-		 * just mark fragments for recycling.
-		 */
-		page_pool_store_mem_info(skb_frag_page(frag), pool);
 	}
 
 	return skb;
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 3229baf..320eddb 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3995,7 +3995,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 		}
 
 		if (pp)
-			skb_mark_for_recycle(skb, page, pp);
+			skb_mark_for_recycle(skb);
 		else
 			dma_unmap_single_attrs(dev->dev.parent, dma_addr,
 					       bm_pool->buf_size, DMA_FROM_DEVICE,
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index cbbd0f6..9d59143 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -431,7 +431,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 	skb->protocol = eth_type_trans(skb, ndev);
 
 	/* mark skb for recycling */
-	skb_mark_for_recycle(skb, page, pool);
+	skb_mark_for_recycle(skb);
 	netif_receive_skb(skb);
 
 	ndev->stats.rx_bytes += len;
diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index 57d279f..a4234a3 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -374,7 +374,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 	skb->protocol = eth_type_trans(skb, ndev);
 
 	/* mark skb for recycling */
-	skb_mark_for_recycle(skb, page, pool);
+	skb_mark_for_recycle(skb);
 	netif_receive_skb(skb);
 
 	ndev->stats.rx_bytes += len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b2db9cd..7795979 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4711,11 +4711,9 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb)
 }
 
 #ifdef CONFIG_PAGE_POOL
-static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
-					struct page_pool *pp)
+static inline void skb_mark_for_recycle(struct sk_buff *skb)
 {
 	skb->pp_recycle = 1;
-	page_pool_store_mem_info(page, pp);
 }
 #endif
 
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 3dd62dd..8d7744d 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -253,11 +253,4 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
 		spin_unlock_bh(&pool->ring.producer_lock);
 }
 
-/* Store mem_info on struct page and use it while recycling skb frags */
-static inline
-void page_pool_store_mem_info(struct page *page, struct page_pool *pp)
-{
-	page->pp = pp;
-}
-
 #endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5e4eb45..78838c6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -206,6 +206,19 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	return true;
 }
 
+static void page_pool_set_pp_info(struct page_pool *pool,
+				  struct page *page)
+{
+	page->pp = pool;
+	page->pp_magic |= PP_SIGNATURE;
+}
+
+static void page_pool_clear_pp_info(struct page *page)
+{
+	page->pp_magic = 0;
+	page->pp = NULL;
+}
+
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -222,7 +235,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 		return NULL;
 	}
 
-	page->pp_magic |= PP_SIGNATURE;
+	page_pool_set_pp_info(pool, page);
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
@@ -266,7 +279,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 			put_page(page);
 			continue;
 		}
-		page->pp_magic |= PP_SIGNATURE;
+
+		page_pool_set_pp_info(pool, page);
 		pool->alloc.cache[pool->alloc.count++] = page;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
@@ -345,7 +359,7 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 			     DMA_ATTR_SKIP_CPU_SYNC);
 	page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
-	page->pp_magic = 0;
+	page_pool_clear_pp_info(page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -644,7 +658,6 @@ bool page_pool_return_skb_page(struct page *page)
 	 * The page will be returned to the pool here regardless of the
 	 * 'flipped' fragment being in use or not.
 	 */
-	page->pp = NULL;
 	page_pool_put_full_page(pp, page, false);
 
 	return true;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
@ 2021-07-20  3:35 ` Yunsheng Lin
  2021-07-20 15:43   ` Alexander Duyck
  2021-07-20  3:35 ` [PATCH rfc v6 3/4] page_pool: add frag page recycling support " Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin
  3 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-20  3:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.

For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.

In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.

The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/mm_types.h | 18 +++++++++++++-----
 include/net/page_pool.h  | 41 ++++++++++++++++++++++++++++++++++-------
 net/core/page_pool.c     |  4 ++++
 3 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d33d97c..3e339dd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,11 +103,19 @@ struct page {
 			unsigned long pp_magic;
 			struct page_pool *pp;
 			unsigned long _pp_mapping_pad;
-			/**
-			 * @dma_addr: might require a 64-bit value on
-			 * 32-bit architectures.
-			 */
-			unsigned long dma_addr[2];
+			unsigned long dma_addr;
+			union {
+				/**
+				 * dma_addr_upper: might require a 64-bit
+				 * value on 32-bit architectures.
+				 */
+				unsigned long dma_addr_upper;
+				/**
+				 * For frag page support, not supported in
+				 * 32-bit architectures with 64-bit DMA.
+				 */
+				atomic_long_t pp_frag_count;
+			};
 		};
 		struct {	/* slab, slob and slub */
 			union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 8d7744d..517d89f 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -45,7 +45,10 @@
 					* Please note DMA-sync-for-CPU is still
 					* device driver responsibility
 					*/
-#define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV)
+#define PP_FLAG_PAGE_FRAG	BIT(2) /* for page frag feature */
+#define PP_FLAG_ALL		(PP_FLAG_DMA_MAP |\
+				 PP_FLAG_DMA_SYNC_DEV |\
+				 PP_FLAG_PAGE_FRAG)
 
 /*
  * Fast allocation side cache array/stack
@@ -198,19 +201,43 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 	page_pool_put_full_page(pool, page, true);
 }
 
+#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
+		(sizeof(dma_addr_t) > sizeof(unsigned long))
+
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr[0];
-	if (sizeof(dma_addr_t) > sizeof(unsigned long))
-		ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
+	dma_addr_t ret = page->dma_addr;
+
+	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+		ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
+
 	return ret;
 }
 
 static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
-	page->dma_addr[0] = addr;
-	if (sizeof(dma_addr_t) > sizeof(unsigned long))
-		page->dma_addr[1] = upper_32_bits(addr);
+	page->dma_addr = addr;
+	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+		page->dma_addr_upper = upper_32_bits(addr);
+}
+
+static inline void page_pool_set_frag_count(struct page *page, long nr)
+{
+	atomic_long_set(&page->pp_frag_count, nr);
+}
+
+static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
+							  long nr)
+{
+	long frag_count = atomic_long_read(&page->pp_frag_count);
+	long ret;
+
+	if (frag_count == nr)
+		return 0;
+
+	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
+	WARN_ON(ret < 0);
+	return ret;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 78838c6..68fab94 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -67,6 +67,10 @@ static int page_pool_init(struct page_pool *pool,
 		 */
 	}
 
+	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
+	    pool->p.flags & PP_FLAG_PAGE_FRAG)
+		return -EINVAL;
+
 	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
 		return -ENOMEM;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH rfc v6 3/4] page_pool: add frag page recycling support in page pool
  2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool Yunsheng Lin
@ 2021-07-20  3:35 ` Yunsheng Lin
  2021-07-20  3:35 ` [PATCH rfc v6 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin
  3 siblings, 0 replies; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-20  3:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.

Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
   for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
   multi desc using different part of the same page in order
   to save memory.

So add multi-users support and frag page recycling in page
pool to overcome the above limitation.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h | 17 ++++++++++
 net/core/page_pool.c    | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 517d89f..2186771 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -91,6 +91,9 @@ struct page_pool {
 	unsigned long defer_warn;
 
 	u32 pages_state_hold_cnt;
+	unsigned int frag_offset;
+	struct page *frag_page;
+	long frag_users;
 
 	/*
 	 * Data structure for allocation side
@@ -140,6 +143,20 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
 	return page_pool_alloc_pages(pool, gfp);
 }
 
+struct page *page_pool_alloc_frag(struct page_pool *pool,
+				  unsigned int *offset,
+				  unsigned int size,
+				  gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
+						    unsigned int *offset,
+						    unsigned int size)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_frag(pool, offset, size, gfp);
+}
+
 /* get the stored dma direction. A driver might decide to treat this locally and
  * avoid the extra cache line from page_pool to determine the direction
  */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 68fab94..a81a5bd 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -24,6 +24,8 @@
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
+#define BIAS_MAX	LONG_MAX
+
 static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
 {
@@ -423,6 +425,11 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
 		     unsigned int dma_sync_size, bool allow_direct)
 {
+	/* It is not the last user for the page frag case */
+	if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
+	    page_pool_atomic_sub_frag_count_return(page, 1))
+		return NULL;
+
 	/* This allocator is optimized for the XDP mode that uses
 	 * one-frame-per-page, but have fallbacks that act like the
 	 * regular page allocator APIs.
@@ -515,6 +522,84 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 }
 EXPORT_SYMBOL(page_pool_put_page_bulk);
 
+static struct page *page_pool_drain_frag(struct page_pool *pool,
+					 struct page *page)
+{
+	long drain_count = BIAS_MAX - pool->frag_users;
+
+	/* page pool is not the last user */
+	if (page_pool_atomic_sub_frag_count_return(page, drain_count))
+		return NULL;
+
+	if (likely(page_ref_count(page) == 1 &&
+		   !page_is_pfmemalloc(page))) {
+		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+			page_pool_dma_sync_for_device(pool, page, -1);
+
+		return page;
+	}
+
+	page_pool_return_page(pool, page);
+	return NULL;
+}
+
+static void page_pool_free_frag(struct page_pool *pool)
+{
+	long drain_count = BIAS_MAX - pool->frag_users;
+	struct page *page = pool->frag_page;
+
+	pool->frag_page = NULL;
+
+	if (!page ||
+	    page_pool_atomic_sub_frag_count_return(page, drain_count))
+		return;
+
+	page_pool_return_page(pool, page);
+}
+
+struct page *page_pool_alloc_frag(struct page_pool *pool,
+				  unsigned int *offset,
+				  unsigned int size, gfp_t gfp)
+{
+	unsigned int max_size = PAGE_SIZE << pool->p.order;
+	struct page *page = pool->frag_page;
+
+	if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
+		    size > max_size))
+		return NULL;
+
+	size = ALIGN(size, dma_get_cache_alignment());
+	*offset = pool->frag_offset;
+
+	if (page && *offset + size > max_size) {
+		page = page_pool_drain_frag(pool, page);
+		if (page)
+			goto frag_reset;
+	}
+
+	if (!page) {
+		page = page_pool_alloc_pages(pool, gfp);
+		if (unlikely(!page)) {
+			pool->frag_page = NULL;
+			return NULL;
+		}
+
+		pool->frag_page = page;
+
+frag_reset:
+		pool->frag_users = 1;
+		*offset = 0;
+		pool->frag_offset = size;
+		page_pool_set_frag_count(page, BIAS_MAX);
+		return page;
+	}
+
+	pool->frag_users++;
+	pool->frag_offset = *offset + size;
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_frag);
+
 static void page_pool_empty_ring(struct page_pool *pool)
 {
 	struct page *page;
@@ -620,6 +705,8 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_put(pool))
 		return;
 
+	page_pool_free_frag(pool);
+
 	if (!page_pool_release(pool))
 		return;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH rfc v6 4/4] net: hns3: support skb's frag page recycling based on page pool
  2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
                   ` (2 preceding siblings ...)
  2021-07-20  3:35 ` [PATCH rfc v6 3/4] page_pool: add frag page recycling support " Yunsheng Lin
@ 2021-07-20  3:35 ` Yunsheng Lin
  3 siblings, 0 replies; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-20  3:35 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf

This patch adds skb's frag page recycling support based on
the frag page support in page pool.

The performance improves above 10~20% for single thread iperf
TCP flow with IOMMU disabled when iperf server and irq/NAPI
have a different CPU.

The performance improves about 135%(14Gbit to 33Gbit) for single
thread iperf TCP flow IOMMU is in strict mode and iperf server
shares the same cpu with irq/NAPI.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 82 +++++++++++++++++++++++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h |  3 +
 2 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index cdb5f14..f3f9b13 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -3205,6 +3205,21 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
 	unsigned int order = hns3_page_order(ring);
 	struct page *p;
 
+	if (ring->page_pool) {
+		p = page_pool_dev_alloc_frag(ring->page_pool,
+					     &cb->page_offset,
+					     hns3_buf_size(ring));
+		if (unlikely(!p))
+			return -ENOMEM;
+
+		cb->priv = p;
+		cb->buf = page_address(p);
+		cb->dma = page_pool_get_dma_addr(p);
+		cb->type = DESC_TYPE_FRAG;
+		cb->reuse_flag = 0;
+		return 0;
+	}
+
 	p = dev_alloc_pages(order);
 	if (!p)
 		return -ENOMEM;
@@ -3227,8 +3242,13 @@ static void hns3_free_buffer(struct hns3_enet_ring *ring,
 	if (cb->type & (DESC_TYPE_SKB | DESC_TYPE_BOUNCE_HEAD |
 			DESC_TYPE_BOUNCE_ALL | DESC_TYPE_SGL_SKB))
 		napi_consume_skb(cb->priv, budget);
-	else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias)
-		__page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+	else if (!HNAE3_IS_TX_RING(ring)) {
+		if (cb->type & DESC_TYPE_PAGE && cb->pagecnt_bias)
+			__page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+		else if (cb->type & DESC_TYPE_FRAG)
+			page_pool_put_full_page(ring->page_pool, cb->priv,
+						false);
+	}
 	memset(cb, 0, sizeof(*cb));
 }
 
@@ -3315,7 +3335,7 @@ static int hns3_alloc_and_map_buffer(struct hns3_enet_ring *ring,
 	int ret;
 
 	ret = hns3_alloc_buffer(ring, cb);
-	if (ret)
+	if (ret || ring->page_pool)
 		goto out;
 
 	ret = hns3_map_buffer(ring, cb);
@@ -3337,7 +3357,8 @@ static int hns3_alloc_and_attach_buffer(struct hns3_enet_ring *ring, int i)
 	if (ret)
 		return ret;
 
-	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma);
+	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma +
+					 ring->desc_cb[i].page_offset);
 
 	return 0;
 }
@@ -3367,7 +3388,8 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i,
 {
 	hns3_unmap_buffer(ring, &ring->desc_cb[i]);
 	ring->desc_cb[i] = *res_cb;
-	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma);
+	ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma +
+					 ring->desc_cb[i].page_offset);
 	ring->desc[i].rx.bd_base_info = 0;
 }
 
@@ -3539,6 +3561,12 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
 	u32 frag_size = size - pull_len;
 	bool reused;
 
+	if (ring->page_pool) {
+		skb_add_rx_frag(skb, i, desc_cb->priv, frag_offset,
+				frag_size, truesize);
+		return;
+	}
+
 	/* Avoid re-using remote or pfmem page */
 	if (unlikely(!dev_page_is_reusable(desc_cb->priv)))
 		goto out;
@@ -3856,6 +3884,9 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
 		/* We can reuse buffer as-is, just make sure it is reusable */
 		if (dev_page_is_reusable(desc_cb->priv))
 			desc_cb->reuse_flag = 1;
+		else if (desc_cb->type & DESC_TYPE_FRAG)
+			page_pool_put_full_page(ring->page_pool, desc_cb->priv,
+						false);
 		else /* This page cannot be reused so discard it */
 			__page_frag_cache_drain(desc_cb->priv,
 						desc_cb->pagecnt_bias);
@@ -3863,6 +3894,10 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
 		hns3_rx_ring_move_fw(ring);
 		return 0;
 	}
+
+	if (ring->page_pool)
+		skb_mark_for_recycle(skb);
+
 	u64_stats_update_begin(&ring->syncp);
 	ring->stats.seg_pkt_cnt++;
 	u64_stats_update_end(&ring->syncp);
@@ -3901,6 +3936,10 @@ static int hns3_add_frag(struct hns3_enet_ring *ring)
 					    "alloc rx fraglist skb fail\n");
 				return -ENXIO;
 			}
+
+			if (ring->page_pool)
+				skb_mark_for_recycle(new_skb);
+
 			ring->frag_num = 0;
 
 			if (ring->tail_skb) {
@@ -4705,6 +4744,31 @@ static void hns3_put_ring_config(struct hns3_nic_priv *priv)
 	priv->ring = NULL;
 }
 
+static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
+{
+	struct page_pool_params pp_params = {
+		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
+				PP_FLAG_DMA_SYNC_DEV,
+		.order = hns3_page_order(ring),
+		.pool_size = ring->desc_num * hns3_buf_size(ring) /
+				(PAGE_SIZE << hns3_page_order(ring)),
+		.nid = dev_to_node(ring_to_dev(ring)),
+		.dev = ring_to_dev(ring),
+		.dma_dir = DMA_FROM_DEVICE,
+		.offset = 0,
+		.max_len = PAGE_SIZE << hns3_page_order(ring),
+	};
+
+	ring->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(ring->page_pool)) {
+		dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
+			 PTR_ERR(ring->page_pool));
+		ring->page_pool = NULL;
+	} else {
+		dev_info(ring_to_dev(ring), "page pool creation succeeded\n");
+	}
+}
+
 static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
 {
 	int ret;
@@ -4724,6 +4788,8 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
 		goto out_with_desc_cb;
 
 	if (!HNAE3_IS_TX_RING(ring)) {
+		hns3_alloc_page_pool(ring);
+
 		ret = hns3_alloc_ring_buffers(ring);
 		if (ret)
 			goto out_with_desc;
@@ -4764,6 +4830,12 @@ void hns3_fini_ring(struct hns3_enet_ring *ring)
 		devm_kfree(ring_to_dev(ring), tx_spare);
 		ring->tx_spare = NULL;
 	}
+
+	if (!HNAE3_IS_TX_RING(ring) && ring->page_pool) {
+		page_pool_destroy(ring->page_pool);
+		ring->page_pool = NULL;
+		dev_info(ring_to_dev(ring), "page pool destroyed\n");
+	}
 }
 
 static int hns3_buf_size2type(u32 buf_size)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 15af3d9..115c0ce 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,6 +6,7 @@
 
 #include <linux/dim.h>
 #include <linux/if_vlan.h>
+#include <net/page_pool.h>
 
 #include "hnae3.h"
 
@@ -307,6 +308,7 @@ enum hns3_desc_type {
 	DESC_TYPE_BOUNCE_ALL		= 1 << 3,
 	DESC_TYPE_BOUNCE_HEAD		= 1 << 4,
 	DESC_TYPE_SGL_SKB		= 1 << 5,
+	DESC_TYPE_FRAG			= 1 << 6,
 };
 
 struct hns3_desc_cb {
@@ -451,6 +453,7 @@ struct hns3_enet_ring {
 	struct hnae3_queue *tqp;
 	int queue_index;
 	struct device *dev; /* will be used for DMA mapping of descriptors */
+	struct page_pool *page_pool;
 
 	/* statistic */
 	struct ring_stats stats;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-20  3:35 ` [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool Yunsheng Lin
@ 2021-07-20 15:43   ` Alexander Duyck
  2021-07-21  8:15     ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-20 15:43 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On Mon, Jul 19, 2021 at 8:36 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> For 32 bit systems with 64 bit dma, dma_addr[1] is used to
> store the upper 32 bit dma addr, those system should be rare
> those days.
>
> For normal system, the dma_addr[1] in 'struct page' is not
> used, so we can reuse dma_addr[1] for storing frag count,
> which means how many frags this page might be splited to.
>
> In order to simplify the page frag support in the page pool,
> the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
> the 32 bit systems with 64 bit dma, and the page frag support
> in page pool is disabled for such system.
>
> The newly added page_pool_set_frag_count() is called to reserve
> the maximum frag count before any page frag is passed to the
> user. The page_pool_atomic_sub_frag_count_return() is called
> when user is done with the page frag.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  include/linux/mm_types.h | 18 +++++++++++++-----
>  include/net/page_pool.h  | 41 ++++++++++++++++++++++++++++++++++-------
>  net/core/page_pool.c     |  4 ++++
>  3 files changed, 51 insertions(+), 12 deletions(-)
>

<snip>

> +static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> +                                                         long nr)
> +{
> +       long frag_count = atomic_long_read(&page->pp_frag_count);
> +       long ret;
> +
> +       if (frag_count == nr)
> +               return 0;
> +
> +       ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> +       WARN_ON(ret < 0);
> +       return ret;
>  }
>

So this should just be an atomic_long_sub_return call. You should get
rid of the atomic_long_read portion of this as it can cover up
reference count errors.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-20 15:43   ` Alexander Duyck
@ 2021-07-21  8:15     ` Yunsheng Lin
  2021-07-21 14:06       ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-21  8:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/20 23:43, Alexander Duyck wrote:
> On Mon, Jul 19, 2021 at 8:36 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> For 32 bit systems with 64 bit dma, dma_addr[1] is used to
>> store the upper 32 bit dma addr, those system should be rare
>> those days.
>>
>> For normal system, the dma_addr[1] in 'struct page' is not
>> used, so we can reuse dma_addr[1] for storing frag count,
>> which means how many frags this page might be splited to.
>>
>> In order to simplify the page frag support in the page pool,
>> the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
>> the 32 bit systems with 64 bit dma, and the page frag support
>> in page pool is disabled for such system.
>>
>> The newly added page_pool_set_frag_count() is called to reserve
>> the maximum frag count before any page frag is passed to the
>> user. The page_pool_atomic_sub_frag_count_return() is called
>> when user is done with the page frag.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  include/linux/mm_types.h | 18 +++++++++++++-----
>>  include/net/page_pool.h  | 41 ++++++++++++++++++++++++++++++++++-------
>>  net/core/page_pool.c     |  4 ++++
>>  3 files changed, 51 insertions(+), 12 deletions(-)
>>
> 
> <snip>
> 
>> +static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>> +                                                         long nr)
>> +{
>> +       long frag_count = atomic_long_read(&page->pp_frag_count);
>> +       long ret;
>> +
>> +       if (frag_count == nr)
>> +               return 0;
>> +
>> +       ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>> +       WARN_ON(ret < 0);
>> +       return ret;
>>  }
>>
> 
> So this should just be an atomic_long_sub_return call. You should get
> rid of the atomic_long_read portion of this as it can cover up
> reference count errors.

atomic_long_sub_return() is used to avoid one possible cache bouncing and
barrrier caused by the last user.

You are right that that may cover up the reference count errors. How about
something like below:

static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
							  long nr)
{
#ifdef CONFIG_DEBUG_PAGE_REF
	long ret = atomic_long_sub_return(nr, &page->pp_frag_count);

	WARN_ON(ret < 0);

	return ret;
#else
	if (atomic_long_read(&page->pp_frag_count) == nr)
		return 0;

	return atomic_long_sub_return(nr, &page->pp_frag_count);
#end
}

Or any better suggestion?


> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-21  8:15     ` Yunsheng Lin
@ 2021-07-21 14:06       ` Alexander Duyck
  2021-07-22  8:07         ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-21 14:06 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On Wed, Jul 21, 2021 at 1:15 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/7/20 23:43, Alexander Duyck wrote:
> > On Mon, Jul 19, 2021 at 8:36 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> For 32 bit systems with 64 bit dma, dma_addr[1] is used to
> >> store the upper 32 bit dma addr, those system should be rare
> >> those days.
> >>
> >> For normal system, the dma_addr[1] in 'struct page' is not
> >> used, so we can reuse dma_addr[1] for storing frag count,
> >> which means how many frags this page might be splited to.
> >>
> >> In order to simplify the page frag support in the page pool,
> >> the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
> >> the 32 bit systems with 64 bit dma, and the page frag support
> >> in page pool is disabled for such system.
> >>
> >> The newly added page_pool_set_frag_count() is called to reserve
> >> the maximum frag count before any page frag is passed to the
> >> user. The page_pool_atomic_sub_frag_count_return() is called
> >> when user is done with the page frag.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >>  include/linux/mm_types.h | 18 +++++++++++++-----
> >>  include/net/page_pool.h  | 41 ++++++++++++++++++++++++++++++++++-------
> >>  net/core/page_pool.c     |  4 ++++
> >>  3 files changed, 51 insertions(+), 12 deletions(-)
> >>
> >
> > <snip>
> >
> >> +static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >> +                                                         long nr)
> >> +{
> >> +       long frag_count = atomic_long_read(&page->pp_frag_count);
> >> +       long ret;
> >> +
> >> +       if (frag_count == nr)
> >> +               return 0;
> >> +
> >> +       ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >> +       WARN_ON(ret < 0);
> >> +       return ret;
> >>  }
> >>
> >
> > So this should just be an atomic_long_sub_return call. You should get
> > rid of the atomic_long_read portion of this as it can cover up
> > reference count errors.
>
> atomic_long_sub_return() is used to avoid one possible cache bouncing and
> barrrier caused by the last user.

I assume you mean "atomic_long_read()" here.

> You are right that that may cover up the reference count errors. How about
> something like below:
>
> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>                                                           long nr)
> {
> #ifdef CONFIG_DEBUG_PAGE_REF
>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>
>         WARN_ON(ret < 0);
>
>         return ret;
> #else
>         if (atomic_long_read(&page->pp_frag_count) == nr)
>                 return 0;
>
>         return atomic_long_sub_return(nr, &page->pp_frag_count);
> #end
> }
>
> Or any better suggestion?

So the one thing I might change would be to make it so that you only
do the atomic_long_read if nr is a constant via __builtin_constant_p.
That way you would be performing the comparison in
__page_pool_put_page and in the cases of freeing or draining the
page_frags you would be using the atomic_long_sub_return which should
be paths where you would not expect it to match or that are slowpath
anyway.

Also I would keep the WARN_ON in both paths just to be on the safe side.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-21 14:06       ` Alexander Duyck
@ 2021-07-22  8:07         ` Yunsheng Lin
  2021-07-22 15:18           ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-22  8:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/21 22:06, Alexander Duyck wrote:
> On Wed, Jul 21, 2021 at 1:15 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2021/7/20 23:43, Alexander Duyck wrote:
>>> On Mon, Jul 19, 2021 at 8:36 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> For 32 bit systems with 64 bit dma, dma_addr[1] is used to
>>>> store the upper 32 bit dma addr, those system should be rare
>>>> those days.
>>>>
>>>> For normal system, the dma_addr[1] in 'struct page' is not
>>>> used, so we can reuse dma_addr[1] for storing frag count,
>>>> which means how many frags this page might be splited to.
>>>>
>>>> In order to simplify the page frag support in the page pool,
>>>> the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
>>>> the 32 bit systems with 64 bit dma, and the page frag support
>>>> in page pool is disabled for such system.
>>>>
>>>> The newly added page_pool_set_frag_count() is called to reserve
>>>> the maximum frag count before any page frag is passed to the
>>>> user. The page_pool_atomic_sub_frag_count_return() is called
>>>> when user is done with the page frag.
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>>  include/linux/mm_types.h | 18 +++++++++++++-----
>>>>  include/net/page_pool.h  | 41 ++++++++++++++++++++++++++++++++++-------
>>>>  net/core/page_pool.c     |  4 ++++
>>>>  3 files changed, 51 insertions(+), 12 deletions(-)
>>>>
>>>
>>> <snip>
>>>
>>>> +static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>> +                                                         long nr)
>>>> +{
>>>> +       long frag_count = atomic_long_read(&page->pp_frag_count);
>>>> +       long ret;
>>>> +
>>>> +       if (frag_count == nr)
>>>> +               return 0;
>>>> +
>>>> +       ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>> +       WARN_ON(ret < 0);
>>>> +       return ret;
>>>>  }
>>>>
>>>
>>> So this should just be an atomic_long_sub_return call. You should get
>>> rid of the atomic_long_read portion of this as it can cover up
>>> reference count errors.
>>
>> atomic_long_sub_return() is used to avoid one possible cache bouncing and
>> barrrier caused by the last user.
> 
> I assume you mean "atomic_long_read()" here.

Yes, sorry for the confusion.

> 
>> You are right that that may cover up the reference count errors. How about
>> something like below:
>>
>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>                                                           long nr)
>> {
>> #ifdef CONFIG_DEBUG_PAGE_REF
>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>
>>         WARN_ON(ret < 0);
>>
>>         return ret;
>> #else
>>         if (atomic_long_read(&page->pp_frag_count) == nr)
>>                 return 0;
>>
>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
>> #end
>> }
>>
>> Or any better suggestion?
> 
> So the one thing I might change would be to make it so that you only
> do the atomic_long_read if nr is a constant via __builtin_constant_p.
> That way you would be performing the comparison in
> __page_pool_put_page and in the cases of freeing or draining the
> page_frags you would be using the atomic_long_sub_return which should
> be paths where you would not expect it to match or that are slowpath
> anyway.
> 
> Also I would keep the WARN_ON in both paths just to be on the safe side.

If I understand it correctly, we should change it as below, right?

static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
							  long nr)
{
	long ret;

	/* As suggested by Alexander, atomic_long_read() may cover up the
	 * reference count errors, so avoid calling atomic_long_read() in
	 * the cases of freeing or draining the page_frags, where we would
	 * not expect it to match or that are slowpath anyway.
	 */
	if (__builtin_constant_p(nr) &&
	    atomic_long_read(&page->pp_frag_count) == nr)
		return 0;

	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
	WARN_ON(ret < 0);
	return ret;
}


> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-22  8:07         ` Yunsheng Lin
@ 2021-07-22 15:18           ` Alexander Duyck
  2021-07-23 11:12             ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-22 15:18 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

> >
> >> You are right that that may cover up the reference count errors. How about
> >> something like below:
> >>
> >> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >>                                                           long nr)
> >> {
> >> #ifdef CONFIG_DEBUG_PAGE_REF
> >>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >>
> >>         WARN_ON(ret < 0);
> >>
> >>         return ret;
> >> #else
> >>         if (atomic_long_read(&page->pp_frag_count) == nr)
> >>                 return 0;
> >>
> >>         return atomic_long_sub_return(nr, &page->pp_frag_count);
> >> #end
> >> }
> >>
> >> Or any better suggestion?
> >
> > So the one thing I might change would be to make it so that you only
> > do the atomic_long_read if nr is a constant via __builtin_constant_p.
> > That way you would be performing the comparison in
> > __page_pool_put_page and in the cases of freeing or draining the
> > page_frags you would be using the atomic_long_sub_return which should
> > be paths where you would not expect it to match or that are slowpath
> > anyway.
> >
> > Also I would keep the WARN_ON in both paths just to be on the safe side.
>
> If I understand it correctly, we should change it as below, right?
>
> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>                                                           long nr)
> {
>         long ret;
>
>         /* As suggested by Alexander, atomic_long_read() may cover up the
>          * reference count errors, so avoid calling atomic_long_read() in
>          * the cases of freeing or draining the page_frags, where we would
>          * not expect it to match or that are slowpath anyway.
>          */
>         if (__builtin_constant_p(nr) &&
>             atomic_long_read(&page->pp_frag_count) == nr)
>                 return 0;
>
>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>         WARN_ON(ret < 0);
>         return ret;
> }

Yes, that is what I had in mind.

One thought I had for a future optimization is that we could look at
reducing the count by 1 so that we could essentially combine the
non-frag and frag cases.Then instead of testing for 1 we would test
for 0 at thee start of the function and test for < 0 to decide if we
want to free it or not instead of testing for 0. With that we can
essentially reduce the calls to the WARN_ON since we should only have
one case where we actually return a value < 0, and we can then check
to see if we overshot -1 which would be the WARN_ON case.

With that a value of 0 instead of 1 would indicate page frag is not in
use for the page *AND/OR* that the page has reached the state where
there are no other frags present so the page can be recycled. In
effect it would allow us to mix page frags and no frags within the
same pool. The added bonus would be we could get rid of the check for
PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
cannot read frag_count in that case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-22 15:18           ` Alexander Duyck
@ 2021-07-23 11:12             ` Yunsheng Lin
  2021-07-23 16:08               ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-23 11:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/22 23:18, Alexander Duyck wrote:
>>>
>>>> You are right that that may cover up the reference count errors. How about
>>>> something like below:
>>>>
>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>                                                           long nr)
>>>> {
>>>> #ifdef CONFIG_DEBUG_PAGE_REF
>>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>
>>>>         WARN_ON(ret < 0);
>>>>
>>>>         return ret;
>>>> #else
>>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
>>>>                 return 0;
>>>>
>>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
>>>> #end
>>>> }
>>>>
>>>> Or any better suggestion?
>>>
>>> So the one thing I might change would be to make it so that you only
>>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
>>> That way you would be performing the comparison in
>>> __page_pool_put_page and in the cases of freeing or draining the
>>> page_frags you would be using the atomic_long_sub_return which should
>>> be paths where you would not expect it to match or that are slowpath
>>> anyway.
>>>
>>> Also I would keep the WARN_ON in both paths just to be on the safe side.
>>
>> If I understand it correctly, we should change it as below, right?
>>
>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>                                                           long nr)
>> {
>>         long ret;
>>
>>         /* As suggested by Alexander, atomic_long_read() may cover up the
>>          * reference count errors, so avoid calling atomic_long_read() in
>>          * the cases of freeing or draining the page_frags, where we would
>>          * not expect it to match or that are slowpath anyway.
>>          */
>>         if (__builtin_constant_p(nr) &&
>>             atomic_long_read(&page->pp_frag_count) == nr)
>>                 return 0;
>>
>>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>         WARN_ON(ret < 0);
>>         return ret;
>> }
> 
> Yes, that is what I had in mind.
> 
> One thought I had for a future optimization is that we could look at
> reducing the count by 1 so that we could essentially combine the
> non-frag and frag cases.Then instead of testing for 1 we would test
> for 0 at thee start of the function and test for < 0 to decide if we
> want to free it or not instead of testing for 0. With that we can
> essentially reduce the calls to the WARN_ON since we should only have
> one case where we actually return a value < 0, and we can then check
> to see if we overshot -1 which would be the WARN_ON case.
> 
> With that a value of 0 instead of 1 would indicate page frag is not in
> use for the page *AND/OR* that the page has reached the state where
> there are no other frags present so the page can be recycled. In
> effect it would allow us to mix page frags and no frags within the
> same pool. The added bonus would be we could get rid of the check for
> PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
> replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
> cannot read frag_count in that case.

Let's leave it for a future optimization.
I am not sure if there is use case to support both frag page and non-frag
page for the same page pool. If there is, maybe we can use "page->pp_frag_count
> 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
to indicate that the page is non-frag page, so that we can support frag page and
non-frag page for the same page pool instead of disabling non-frag page support
when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
optimization?


Also, I am prototyping the tx recycling based on page pool in order to see
if there is any value supporting the tx recycling.
As the busypoll has enable the one-to-one relation between NAPI and sock,
and there is one-to-one relation between NAPI and page pool, perhaps it make
senses that we use page pool to recycle the tx page too?

There are possibly below problems when doing that as I am aware of now:
1. busypoll is for rx, and tx may not be using the same queue as rx even if
   there are *technically* the same flow, so I am not sure it is ok to use
   busypoll infrastructure to get the page pool ptr for a specific sock.

2. There may be multi socks using the same page pool ptr to allocate page for
   multi flow, so we can not assume the same NAPI polling protection as rx,
   which might mean we can only use the recyclable page from pool->ring under the
   r->consumer_lock protection.

3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
   frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
   based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
   tcp fragment in tso_fragment() might mess with the page_ref_count directly.

As the above the problem I am aware of(I believe there are other problems I am not
aware of yet), I am not sure if the tcp tx page recycling based on page pool is
doable or not, I would like to hear about your opinion about tcp tx recycling support
based on page pool first, in case it is a dead end to support that.

> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-23 11:12             ` Yunsheng Lin
@ 2021-07-23 16:08               ` Alexander Duyck
  2021-07-24 13:07                 ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-23 16:08 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/7/22 23:18, Alexander Duyck wrote:
> >>>
> >>>> You are right that that may cover up the reference count errors. How about
> >>>> something like below:
> >>>>
> >>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >>>>                                                           long nr)
> >>>> {
> >>>> #ifdef CONFIG_DEBUG_PAGE_REF
> >>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >>>>
> >>>>         WARN_ON(ret < 0);
> >>>>
> >>>>         return ret;
> >>>> #else
> >>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
> >>>>                 return 0;
> >>>>
> >>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
> >>>> #end
> >>>> }
> >>>>
> >>>> Or any better suggestion?
> >>>
> >>> So the one thing I might change would be to make it so that you only
> >>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
> >>> That way you would be performing the comparison in
> >>> __page_pool_put_page and in the cases of freeing or draining the
> >>> page_frags you would be using the atomic_long_sub_return which should
> >>> be paths where you would not expect it to match or that are slowpath
> >>> anyway.
> >>>
> >>> Also I would keep the WARN_ON in both paths just to be on the safe side.
> >>
> >> If I understand it correctly, we should change it as below, right?
> >>
> >> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> >>                                                           long nr)
> >> {
> >>         long ret;
> >>
> >>         /* As suggested by Alexander, atomic_long_read() may cover up the
> >>          * reference count errors, so avoid calling atomic_long_read() in
> >>          * the cases of freeing or draining the page_frags, where we would
> >>          * not expect it to match or that are slowpath anyway.
> >>          */
> >>         if (__builtin_constant_p(nr) &&
> >>             atomic_long_read(&page->pp_frag_count) == nr)
> >>                 return 0;
> >>
> >>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >>         WARN_ON(ret < 0);
> >>         return ret;
> >> }
> >
> > Yes, that is what I had in mind.
> >
> > One thought I had for a future optimization is that we could look at
> > reducing the count by 1 so that we could essentially combine the
> > non-frag and frag cases.Then instead of testing for 1 we would test
> > for 0 at thee start of the function and test for < 0 to decide if we
> > want to free it or not instead of testing for 0. With that we can
> > essentially reduce the calls to the WARN_ON since we should only have
> > one case where we actually return a value < 0, and we can then check
> > to see if we overshot -1 which would be the WARN_ON case.
> >
> > With that a value of 0 instead of 1 would indicate page frag is not in
> > use for the page *AND/OR* that the page has reached the state where
> > there are no other frags present so the page can be recycled. In
> > effect it would allow us to mix page frags and no frags within the
> > same pool. The added bonus would be we could get rid of the check for
> > PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
> > replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
> > cannot read frag_count in that case.
>
> Let's leave it for a future optimization.
> I am not sure if there is use case to support both frag page and non-frag
> page for the same page pool. If there is, maybe we can use "page->pp_frag_count
> > 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
> to indicate that the page is non-frag page, so that we can support frag page and
> non-frag page for the same page pool instead of disabling non-frag page support
> when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
> optimization?

As far as use case I can see a number of potential uses. For example
in the case of drivers that do something like a header/data split I
could see potentially having the header pages be frags while the data
pages being 4K blocks. Basically the big optimization of the count ==
1/0/nr case is that you aren't increasing/decreasing the count and it
is immediately being recycled/reused. So in such a case being able to
add frag count some pages, and not to others would likely be quite
useful.

Basically by shifting the pool values by 1 you can have both in the
same pool with little issue. However the big change is that instead of
testing for count = nr it would end up being pp_frag_count = nr - 1.
So in the case of the standard page pool pages being freed or the last
frag you would be looking at pp_frag_count = 0. In addition we can
mask the WARN_ON overhead as you would be using -1 as the point to
free so you would only have to perform the WARN_ON check for the last
frag instead of every frag.

> Also, I am prototyping the tx recycling based on page pool in order to see
> if there is any value supporting the tx recycling.

Just to clarify here when you say Tx recycling you are talking about
socket to netdev correct? Just want to be certain since the netdev to
netdev case should already have recycling for page pool pages as long
as it follows a 1<->1 path.

> As the busypoll has enable the one-to-one relation between NAPI and sock,
> and there is one-to-one relation between NAPI and page pool, perhaps it make
> senses that we use page pool to recycle the tx page too?
>
> There are possibly below problems when doing that as I am aware of now:
> 1. busypoll is for rx, and tx may not be using the same queue as rx even if
>    there are *technically* the same flow, so I am not sure it is ok to use
>    busypoll infrastructure to get the page pool ptr for a specific sock.
>
> 2. There may be multi socks using the same page pool ptr to allocate page for
>    multi flow, so we can not assume the same NAPI polling protection as rx,
>    which might mean we can only use the recyclable page from pool->ring under the
>    r->consumer_lock protection.
>
> 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
>    frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
>    based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
>    tcp fragment in tso_fragment() might mess with the page_ref_count directly.
>
> As the above the problem I am aware of(I believe there are other problems I am not
> aware of yet), I am not sure if the tcp tx page recycling based on page pool is
> doable or not, I would like to hear about your opinion about tcp tx recycling support
> based on page pool first, in case it is a dead end to support that.

I'm honestly not sure there is much there to gain. Last I knew TCP was
using order 3 pages for transmitting and as a result the overhead for
the pages should already be greatly reduced. In addition one of the
main reasons for page_pool  is the fact that the device has to DMA map
the page and that can have very high overhead on systems with an
IOMMU.

Rather than trying to reuse the devices page pool it might make more
sense to see if you couldn't have TCP just use some sort of circular
buffer of memory that is directly mapped for the device that it is
going to be transmitting to. Essentially what you would be doing is
creating a pre-mapped page and would need to communicate that the
memory is already mapped for the device you want to send it to so that
it could skip that step.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-23 16:08               ` Alexander Duyck
@ 2021-07-24 13:07                 ` Yunsheng Lin
  2021-07-25 16:49                   ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-24 13:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski,
	Russell King - ARM Linux, Marcin Wojtas, linuxarm, yisen.zhuang,
	Salil Mehta, thomas.petazzoni, hawk, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, fenghua.yu, guro, Peter Xu, Feng Tang,
	Jason Gunthorpe, Matteo Croce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	nogikh, Marco Elver, Yonghong Song, kpsingh, andrii,
	Martin KaFai Lau, songliubraving, Netdev, LKML, bpf

On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >
> > On 2021/7/22 23:18, Alexander Duyck wrote:
> > >>>
> > >>>> You are right that that may cover up the reference count errors. How about
> > >>>> something like below:
> > >>>>
> > >>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> > >>>>                                                           long nr)
> > >>>> {
> > >>>> #ifdef CONFIG_DEBUG_PAGE_REF
> > >>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> > >>>>
> > >>>>         WARN_ON(ret < 0);
> > >>>>
> > >>>>         return ret;
> > >>>> #else
> > >>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
> > >>>>                 return 0;
> > >>>>
> > >>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
> > >>>> #end
> > >>>> }
> > >>>>
> > >>>> Or any better suggestion?
> > >>>
> > >>> So the one thing I might change would be to make it so that you only
> > >>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
> > >>> That way you would be performing the comparison in
> > >>> __page_pool_put_page and in the cases of freeing or draining the
> > >>> page_frags you would be using the atomic_long_sub_return which should
> > >>> be paths where you would not expect it to match or that are slowpath
> > >>> anyway.
> > >>>
> > >>> Also I would keep the WARN_ON in both paths just to be on the safe side.
> > >>
> > >> If I understand it correctly, we should change it as below, right?
> > >>
> > >> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> > >>                                                           long nr)
> > >> {
> > >>         long ret;
> > >>
> > >>         /* As suggested by Alexander, atomic_long_read() may cover up the
> > >>          * reference count errors, so avoid calling atomic_long_read() in
> > >>          * the cases of freeing or draining the page_frags, where we would
> > >>          * not expect it to match or that are slowpath anyway.
> > >>          */
> > >>         if (__builtin_constant_p(nr) &&
> > >>             atomic_long_read(&page->pp_frag_count) == nr)
> > >>                 return 0;
> > >>
> > >>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> > >>         WARN_ON(ret < 0);
> > >>         return ret;
> > >> }
> > >
> > > Yes, that is what I had in mind.
> > >
> > > One thought I had for a future optimization is that we could look at
> > > reducing the count by 1 so that we could essentially combine the
> > > non-frag and frag cases.Then instead of testing for 1 we would test
> > > for 0 at thee start of the function and test for < 0 to decide if we
> > > want to free it or not instead of testing for 0. With that we can
> > > essentially reduce the calls to the WARN_ON since we should only have
> > > one case where we actually return a value < 0, and we can then check
> > > to see if we overshot -1 which would be the WARN_ON case.
> > >
> > > With that a value of 0 instead of 1 would indicate page frag is not in
> > > use for the page *AND/OR* that the page has reached the state where
> > > there are no other frags present so the page can be recycled. In
> > > effect it would allow us to mix page frags and no frags within the
> > > same pool. The added bonus would be we could get rid of the check for
> > > PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
> > > replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
> > > cannot read frag_count in that case.
> >
> > Let's leave it for a future optimization.
> > I am not sure if there is use case to support both frag page and non-frag
> > page for the same page pool. If there is, maybe we can use "page->pp_frag_count
> > > 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
> > to indicate that the page is non-frag page, so that we can support frag page and
> > non-frag page for the same page pool instead of disabling non-frag page support
> > when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
> > optimization?
> 
> As far as use case I can see a number of potential uses. For example
> in the case of drivers that do something like a header/data split I
> could see potentially having the header pages be frags while the data
> pages being 4K blocks. Basically the big optimization of the count ==
> 1/0/nr case is that you aren't increasing/decreasing the count and it
> is immediately being recycled/reused. So in such a case being able to
> add frag count some pages, and not to others would likely be quite
> useful.

I am not sure how the header/data split is implemented in hw, but it
seems the driver is not able to tell which desc will be filled with
header or data in advance, so it might need to allocate 4K block for
all desc?

> 
> Basically by shifting the pool values by 1 you can have both in the
> same pool with little issue. However the big change is that instead of
> testing for count = nr it would end up being pp_frag_count = nr - 1.
> So in the case of the standard page pool pages being freed or the last
> frag you would be looking at pp_frag_count = 0. In addition we can
> mask the WARN_ON overhead as you would be using -1 as the point to
> free so you would only have to perform the WARN_ON check for the last
> frag instead of every frag.

Yes, it seems doable.

> 
> > Also, I am prototyping the tx recycling based on page pool in order to see
> > if there is any value supporting the tx recycling.
> 
> Just to clarify here when you say Tx recycling you are talking about
> socket to netdev correct? Just want to be certain since the netdev to
> netdev case should already have recycling for page pool pages as long
> as it follows a 1<->1 path.

Yes, the above Tx recycling meant socket to netdev.
Also, the above "netdev to netdev" only meant XDP now, but not the IP
forwarding path in the network stack, right?

> 
> > As the busypoll has enable the one-to-one relation between NAPI and sock,
> > and there is one-to-one relation between NAPI and page pool, perhaps it make
> > senses that we use page pool to recycle the tx page too?
> >
> > There are possibly below problems when doing that as I am aware of now:
> > 1. busypoll is for rx, and tx may not be using the same queue as rx even if
> >    there are *technically* the same flow, so I am not sure it is ok to use
> >    busypoll infrastructure to get the page pool ptr for a specific sock.
> >
> > 2. There may be multi socks using the same page pool ptr to allocate page for
> >    multi flow, so we can not assume the same NAPI polling protection as rx,
> >    which might mean we can only use the recyclable page from pool->ring under the
> >    r->consumer_lock protection.
> >
> > 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
> >    frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
> >    based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
> >    tcp fragment in tso_fragment() might mess with the page_ref_count directly.
> >
> > As the above the problem I am aware of(I believe there are other problems I am not
> > aware of yet), I am not sure if the tcp tx page recycling based on page pool is
> > doable or not, I would like to hear about your opinion about tcp tx recycling support
> > based on page pool first, in case it is a dead end to support that.
> 
> I'm honestly not sure there is much there to gain. Last I knew TCP was
> using order 3 pages for transmitting and as a result the overhead for
> the pages should already be greatly reduced. In addition one of the
> main reasons for page_pool  is the fact that the device has to DMA map
> the page and that can have very high overhead on systems with an
> IOMMU.

Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages"
seems to be disabled on defaut?

> 
> Rather than trying to reuse the devices page pool it might make more
> sense to see if you couldn't have TCP just use some sort of circular
> buffer of memory that is directly mapped for the device that it is
> going to be transmitting to. Essentially what you would be doing is
> creating a pre-mapped page and would need to communicate that the
> memory is already mapped for the device you want to send it to so that
> it could skip that step.

IIUC sk_page_frag_refill() is already doing a similar reusing as the
rx reusing implemented in most driver except for the not pre-mapping
part.

And it seems that even if we pre-map the page and communicate that the
memory is already mapped to the driver, it is likely that we will not
be able to reuse the page when the circular buffer is not big enough
or tx completion/tcp ack is not happening quickly enough, which might
means unmapping/deallocating old circular buffer and allocating/mapping
new circular buffer.

Using page pool we might be able to alleviate the above problem as it
does for rx?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-24 13:07                 ` Yunsheng Lin
@ 2021-07-25 16:49                   ` Alexander Duyck
  2021-07-27  7:54                     ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-25 16:49 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski,
	Russell King - ARM Linux, Marcin Wojtas, linuxarm, yisen.zhuang,
	Salil Mehta, thomas.petazzoni, hawk, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, fenghua.yu, guro, Peter Xu, Feng Tang,
	Jason Gunthorpe, Matteo Croce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	nogikh, Marco Elver, Yonghong Song, kpsingh, andrii,
	Martin KaFai Lau, songliubraving, Netdev, LKML, bpf

On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
>
> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
> > On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > >
> > > On 2021/7/22 23:18, Alexander Duyck wrote:
> > > >>>
> > > >>>> You are right that that may cover up the reference count errors. How about
> > > >>>> something like below:
> > > >>>>
> > > >>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> > > >>>>                                                           long nr)
> > > >>>> {
> > > >>>> #ifdef CONFIG_DEBUG_PAGE_REF
> > > >>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> > > >>>>
> > > >>>>         WARN_ON(ret < 0);
> > > >>>>
> > > >>>>         return ret;
> > > >>>> #else
> > > >>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
> > > >>>>                 return 0;
> > > >>>>
> > > >>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
> > > >>>> #end
> > > >>>> }
> > > >>>>
> > > >>>> Or any better suggestion?
> > > >>>
> > > >>> So the one thing I might change would be to make it so that you only
> > > >>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
> > > >>> That way you would be performing the comparison in
> > > >>> __page_pool_put_page and in the cases of freeing or draining the
> > > >>> page_frags you would be using the atomic_long_sub_return which should
> > > >>> be paths where you would not expect it to match or that are slowpath
> > > >>> anyway.
> > > >>>
> > > >>> Also I would keep the WARN_ON in both paths just to be on the safe side.
> > > >>
> > > >> If I understand it correctly, we should change it as below, right?
> > > >>
> > > >> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
> > > >>                                                           long nr)
> > > >> {
> > > >>         long ret;
> > > >>
> > > >>         /* As suggested by Alexander, atomic_long_read() may cover up the
> > > >>          * reference count errors, so avoid calling atomic_long_read() in
> > > >>          * the cases of freeing or draining the page_frags, where we would
> > > >>          * not expect it to match or that are slowpath anyway.
> > > >>          */
> > > >>         if (__builtin_constant_p(nr) &&
> > > >>             atomic_long_read(&page->pp_frag_count) == nr)
> > > >>                 return 0;
> > > >>
> > > >>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> > > >>         WARN_ON(ret < 0);
> > > >>         return ret;
> > > >> }
> > > >
> > > > Yes, that is what I had in mind.
> > > >
> > > > One thought I had for a future optimization is that we could look at
> > > > reducing the count by 1 so that we could essentially combine the
> > > > non-frag and frag cases.Then instead of testing for 1 we would test
> > > > for 0 at thee start of the function and test for < 0 to decide if we
> > > > want to free it or not instead of testing for 0. With that we can
> > > > essentially reduce the calls to the WARN_ON since we should only have
> > > > one case where we actually return a value < 0, and we can then check
> > > > to see if we overshot -1 which would be the WARN_ON case.
> > > >
> > > > With that a value of 0 instead of 1 would indicate page frag is not in
> > > > use for the page *AND/OR* that the page has reached the state where
> > > > there are no other frags present so the page can be recycled. In
> > > > effect it would allow us to mix page frags and no frags within the
> > > > same pool. The added bonus would be we could get rid of the check for
> > > > PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
> > > > replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
> > > > cannot read frag_count in that case.
> > >
> > > Let's leave it for a future optimization.
> > > I am not sure if there is use case to support both frag page and non-frag
> > > page for the same page pool. If there is, maybe we can use "page->pp_frag_count
> > > > 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
> > > to indicate that the page is non-frag page, so that we can support frag page and
> > > non-frag page for the same page pool instead of disabling non-frag page support
> > > when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
> > > optimization?
> >
> > As far as use case I can see a number of potential uses. For example
> > in the case of drivers that do something like a header/data split I
> > could see potentially having the header pages be frags while the data
> > pages being 4K blocks. Basically the big optimization of the count ==
> > 1/0/nr case is that you aren't increasing/decreasing the count and it
> > is immediately being recycled/reused. So in such a case being able to
> > add frag count some pages, and not to others would likely be quite
> > useful.
>
> I am not sure how the header/data split is implemented in hw, but it
> seems the driver is not able to tell which desc will be filled with
> header or data in advance, so it might need to allocate 4K block for
> all desc?

It all depends on the hardware config. In theory you could have
anything from a single use for a page to multiple uses for a page in
the case of headers and/or packets being small. The overhead for
adding/removing the frag count could end up being more than what is
needed if the page is only used once. That is why I was thinking it
might make sense to allow both to coexist in the same pool.

> >
> > Basically by shifting the pool values by 1 you can have both in the
> > same pool with little issue. However the big change is that instead of
> > testing for count = nr it would end up being pp_frag_count = nr - 1.
> > So in the case of the standard page pool pages being freed or the last
> > frag you would be looking at pp_frag_count = 0. In addition we can
> > mask the WARN_ON overhead as you would be using -1 as the point to
> > free so you would only have to perform the WARN_ON check for the last
> > frag instead of every frag.
>
> Yes, it seems doable.
>
> >
> > > Also, I am prototyping the tx recycling based on page pool in order to see
> > > if there is any value supporting the tx recycling.
> >
> > Just to clarify here when you say Tx recycling you are talking about
> > socket to netdev correct? Just want to be certain since the netdev to
> > netdev case should already have recycling for page pool pages as long
> > as it follows a 1<->1 path.
>
> Yes, the above Tx recycling meant socket to netdev.
> Also, the above "netdev to netdev" only meant XDP now, but not the IP
> forwarding path in the network stack, right?
>
> >
> > > As the busypoll has enable the one-to-one relation between NAPI and sock,
> > > and there is one-to-one relation between NAPI and page pool, perhaps it make
> > > senses that we use page pool to recycle the tx page too?
> > >
> > > There are possibly below problems when doing that as I am aware of now:
> > > 1. busypoll is for rx, and tx may not be using the same queue as rx even if
> > >    there are *technically* the same flow, so I am not sure it is ok to use
> > >    busypoll infrastructure to get the page pool ptr for a specific sock.
> > >
> > > 2. There may be multi socks using the same page pool ptr to allocate page for
> > >    multi flow, so we can not assume the same NAPI polling protection as rx,
> > >    which might mean we can only use the recyclable page from pool->ring under the
> > >    r->consumer_lock protection.
> > >
> > > 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
> > >    frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
> > >    based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
> > >    tcp fragment in tso_fragment() might mess with the page_ref_count directly.
> > >
> > > As the above the problem I am aware of(I believe there are other problems I am not
> > > aware of yet), I am not sure if the tcp tx page recycling based on page pool is
> > > doable or not, I would like to hear about your opinion about tcp tx recycling support
> > > based on page pool first, in case it is a dead end to support that.
> >
> > I'm honestly not sure there is much there to gain. Last I knew TCP was
> > using order 3 pages for transmitting and as a result the overhead for
> > the pages should already be greatly reduced. In addition one of the
> > main reasons for page_pool  is the fact that the device has to DMA map
> > the page and that can have very high overhead on systems with an
> > IOMMU.
>
> Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages"
> seems to be disabled on defaut?
>
> >
> > Rather than trying to reuse the devices page pool it might make more
> > sense to see if you couldn't have TCP just use some sort of circular
> > buffer of memory that is directly mapped for the device that it is
> > going to be transmitting to. Essentially what you would be doing is
> > creating a pre-mapped page and would need to communicate that the
> > memory is already mapped for the device you want to send it to so that
> > it could skip that step.
>
> IIUC sk_page_frag_refill() is already doing a similar reusing as the
> rx reusing implemented in most driver except for the not pre-mapping
> part.
>
> And it seems that even if we pre-map the page and communicate that the
> memory is already mapped to the driver, it is likely that we will not
> be able to reuse the page when the circular buffer is not big enough
> or tx completion/tcp ack is not happening quickly enough, which might
> means unmapping/deallocating old circular buffer and allocating/mapping
> new circular buffer.
>
> Using page pool we might be able to alleviate the above problem as it
> does for rx?

I would say that instead of looking at going straight for the page
pool it might make more sense to look at seeing if we can coalesce the
DMA mapping of the pages first at the socket layer rather than trying
to introduce the overhead for the page pool. In the case of sockets we
already have the destructors that are called when the memory is freed,
so instead of making sockets use page pool it might make more sense to
extend the socket buffer allocation/freeing to incorporate bulk
mapping and unmapping of pages to optimize the socket Tx path in the
32K page case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-25 16:49                   ` Alexander Duyck
@ 2021-07-27  7:54                     ` Yunsheng Lin
  2021-07-27 18:38                       ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2021-07-27  7:54 UTC (permalink / raw)
  To: Alexander Duyck, Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Russell King - ARM Linux,
	Marcin Wojtas, linuxarm, yisen.zhuang, Salil Mehta,
	thomas.petazzoni, hawk, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Morton, Peter Zijlstra,
	Will Deacon, Matthew Wilcox, Vlastimil Babka, fenghua.yu, guro,
	Peter Xu, Feng Tang, Jason Gunthorpe, Matteo Croce, Hugh Dickins,
	Jonathan Lemon, Alexander Lobakin, Willem de Bruijn, wenxu,
	Cong Wang, Kevin Hao, nogikh, Marco Elver, Yonghong Song,
	kpsingh, andrii, Martin KaFai Lau, songliubraving, Netdev, LKML,
	bpf

On 2021/7/26 0:49, Alexander Duyck wrote:
> On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
>>
>> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
>>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> On 2021/7/22 23:18, Alexander Duyck wrote:
>>>>>>>
>>>>>>>> You are right that that may cover up the reference count errors. How about
>>>>>>>> something like below:
>>>>>>>>
>>>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>>>>>                                                           long nr)
>>>>>>>> {
>>>>>>>> #ifdef CONFIG_DEBUG_PAGE_REF
>>>>>>>>         long ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>>>
>>>>>>>>         WARN_ON(ret < 0);
>>>>>>>>
>>>>>>>>         return ret;
>>>>>>>> #else
>>>>>>>>         if (atomic_long_read(&page->pp_frag_count) == nr)
>>>>>>>>                 return 0;
>>>>>>>>
>>>>>>>>         return atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>>> #end
>>>>>>>> }
>>>>>>>>
>>>>>>>> Or any better suggestion?
>>>>>>>
>>>>>>> So the one thing I might change would be to make it so that you only
>>>>>>> do the atomic_long_read if nr is a constant via __builtin_constant_p.
>>>>>>> That way you would be performing the comparison in
>>>>>>> __page_pool_put_page and in the cases of freeing or draining the
>>>>>>> page_frags you would be using the atomic_long_sub_return which should
>>>>>>> be paths where you would not expect it to match or that are slowpath
>>>>>>> anyway.
>>>>>>>
>>>>>>> Also I would keep the WARN_ON in both paths just to be on the safe side.
>>>>>>
>>>>>> If I understand it correctly, we should change it as below, right?
>>>>>>
>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
>>>>>>                                                           long nr)
>>>>>> {
>>>>>>         long ret;
>>>>>>
>>>>>>         /* As suggested by Alexander, atomic_long_read() may cover up the
>>>>>>          * reference count errors, so avoid calling atomic_long_read() in
>>>>>>          * the cases of freeing or draining the page_frags, where we would
>>>>>>          * not expect it to match or that are slowpath anyway.
>>>>>>          */
>>>>>>         if (__builtin_constant_p(nr) &&
>>>>>>             atomic_long_read(&page->pp_frag_count) == nr)
>>>>>>                 return 0;
>>>>>>
>>>>>>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>         WARN_ON(ret < 0);
>>>>>>         return ret;
>>>>>> }
>>>>>
>>>>> Yes, that is what I had in mind.
>>>>>
>>>>> One thought I had for a future optimization is that we could look at
>>>>> reducing the count by 1 so that we could essentially combine the
>>>>> non-frag and frag cases.Then instead of testing for 1 we would test
>>>>> for 0 at thee start of the function and test for < 0 to decide if we
>>>>> want to free it or not instead of testing for 0. With that we can
>>>>> essentially reduce the calls to the WARN_ON since we should only have
>>>>> one case where we actually return a value < 0, and we can then check
>>>>> to see if we overshot -1 which would be the WARN_ON case.
>>>>>
>>>>> With that a value of 0 instead of 1 would indicate page frag is not in
>>>>> use for the page *AND/OR* that the page has reached the state where
>>>>> there are no other frags present so the page can be recycled. In
>>>>> effect it would allow us to mix page frags and no frags within the
>>>>> same pool. The added bonus would be we could get rid of the check for
>>>>> PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and
>>>>> replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we
>>>>> cannot read frag_count in that case.
>>>>
>>>> Let's leave it for a future optimization.
>>>> I am not sure if there is use case to support both frag page and non-frag
>>>> page for the same page pool. If there is, maybe we can use "page->pp_frag_count
>>>>> 0" to indicate that the page is frag page, and "page->pp_frag_count == 0"
>>>> to indicate that the page is non-frag page, so that we can support frag page and
>>>> non-frag page for the same page pool instead of disabling non-frag page support
>>>> when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above
>>>> optimization?
>>>
>>> As far as use case I can see a number of potential uses. For example
>>> in the case of drivers that do something like a header/data split I
>>> could see potentially having the header pages be frags while the data
>>> pages being 4K blocks. Basically the big optimization of the count ==
>>> 1/0/nr case is that you aren't increasing/decreasing the count and it
>>> is immediately being recycled/reused. So in such a case being able to
>>> add frag count some pages, and not to others would likely be quite
>>> useful.
>>
>> I am not sure how the header/data split is implemented in hw, but it
>> seems the driver is not able to tell which desc will be filled with
>> header or data in advance, so it might need to allocate 4K block for
>> all desc?
> 
> It all depends on the hardware config. In theory you could have
> anything from a single use for a page to multiple uses for a page in
> the case of headers and/or packets being small. The overhead for
> adding/removing the frag count could end up being more than what is
> needed if the page is only used once. That is why I was thinking it
> might make sense to allow both to coexist in the same pool.

I am agreed that there may be usecase of using both frag page and non-frag
page of the same page pool. Let's leave it for now.

> 
>>>
>>> Basically by shifting the pool values by 1 you can have both in the
>>> same pool with little issue. However the big change is that instead of
>>> testing for count = nr it would end up being pp_frag_count = nr - 1.
>>> So in the case of the standard page pool pages being freed or the last
>>> frag you would be looking at pp_frag_count = 0. In addition we can
>>> mask the WARN_ON overhead as you would be using -1 as the point to
>>> free so you would only have to perform the WARN_ON check for the last
>>> frag instead of every frag.
>>
>> Yes, it seems doable.
>>
>>>
>>>> Also, I am prototyping the tx recycling based on page pool in order to see
>>>> if there is any value supporting the tx recycling.
>>>
>>> Just to clarify here when you say Tx recycling you are talking about
>>> socket to netdev correct? Just want to be certain since the netdev to
>>> netdev case should already have recycling for page pool pages as long
>>> as it follows a 1<->1 path.
>>
>> Yes, the above Tx recycling meant socket to netdev.
>> Also, the above "netdev to netdev" only meant XDP now, but not the IP
>> forwarding path in the network stack, right?
>>
>>>
>>>> As the busypoll has enable the one-to-one relation between NAPI and sock,
>>>> and there is one-to-one relation between NAPI and page pool, perhaps it make
>>>> senses that we use page pool to recycle the tx page too?
>>>>
>>>> There are possibly below problems when doing that as I am aware of now:
>>>> 1. busypoll is for rx, and tx may not be using the same queue as rx even if
>>>>    there are *technically* the same flow, so I am not sure it is ok to use
>>>>    busypoll infrastructure to get the page pool ptr for a specific sock.
>>>>
>>>> 2. There may be multi socks using the same page pool ptr to allocate page for
>>>>    multi flow, so we can not assume the same NAPI polling protection as rx,
>>>>    which might mean we can only use the recyclable page from pool->ring under the
>>>>    r->consumer_lock protection.
>>>>
>>>> 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page
>>>>    frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill()
>>>>    based on page pool, I found that tcp coalesce in tcp_mtu_probe() and
>>>>    tcp fragment in tso_fragment() might mess with the page_ref_count directly.
>>>>
>>>> As the above the problem I am aware of(I believe there are other problems I am not
>>>> aware of yet), I am not sure if the tcp tx page recycling based on page pool is
>>>> doable or not, I would like to hear about your opinion about tcp tx recycling support
>>>> based on page pool first, in case it is a dead end to support that.
>>>
>>> I'm honestly not sure there is much there to gain. Last I knew TCP was
>>> using order 3 pages for transmitting and as a result the overhead for
>>> the pages should already be greatly reduced. In addition one of the
>>> main reasons for page_pool  is the fact that the device has to DMA map
>>> the page and that can have very high overhead on systems with an
>>> IOMMU.
>>
>> Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages"
>> seems to be disabled on defaut?
>>
>>>
>>> Rather than trying to reuse the devices page pool it might make more
>>> sense to see if you couldn't have TCP just use some sort of circular
>>> buffer of memory that is directly mapped for the device that it is
>>> going to be transmitting to. Essentially what you would be doing is
>>> creating a pre-mapped page and would need to communicate that the
>>> memory is already mapped for the device you want to send it to so that
>>> it could skip that step.
>>
>> IIUC sk_page_frag_refill() is already doing a similar reusing as the
>> rx reusing implemented in most driver except for the not pre-mapping
>> part.
>>
>> And it seems that even if we pre-map the page and communicate that the
>> memory is already mapped to the driver, it is likely that we will not
>> be able to reuse the page when the circular buffer is not big enough
>> or tx completion/tcp ack is not happening quickly enough, which might
>> means unmapping/deallocating old circular buffer and allocating/mapping
>> new circular buffer.
>>
>> Using page pool we might be able to alleviate the above problem as it
>> does for rx?
> 
> I would say that instead of looking at going straight for the page
> pool it might make more sense to look at seeing if we can coalesce the
> DMA mapping of the pages first at the socket layer rather than trying
> to introduce the overhead for the page pool. In the case of sockets we
> already have the destructors that are called when the memory is freed,
> so instead of making sockets use page pool it might make more sense to
> extend the socket buffer allocation/freeing to incorporate bulk
> mapping and unmapping of pages to optimize the socket Tx path in the
> 32K page case.

I was able to enable tx recycling prototyping based on page pool to
run some performance test, the performance improvement is about +20%
(30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in
strict mode. And CPU usage descreases about 10% for four threads iperf
tcp flow for line speed of 100Gbit when IOMMU is in strict mode.

Looking at the prototyping code, I am agreed that it is a bit controversial
to use the page pool for tx as the page pool is assuming NAPI polling
protection for allocation side.

So I will take a deeper look about your suggestion above to see how to
implement it.

Also, I am assuming the "destructors" means tcp_wfree() for TCP, right?
It seems tcp_wfree() is mainly used to do memory accounting and free
"struct sock" if necessary.
I am not so familiar with socket layer to understand how the "destructors"
will be helpful here, any detailed idea how to use "destructors" here?

> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-27  7:54                     ` Yunsheng Lin
@ 2021-07-27 18:38                       ` Alexander Duyck
  2021-08-02  9:17                         ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2021-07-27 18:38 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski,
	Russell King - ARM Linux, Marcin Wojtas, linuxarm, yisen.zhuang,
	Salil Mehta, thomas.petazzoni, hawk, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, fenghua.yu, guro, Peter Xu, Feng Tang,
	Jason Gunthorpe, Matteo Croce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	nogikh, Marco Elver, Yonghong Song, kpsingh, andrii,
	Martin KaFai Lau, songliubraving, Netdev, LKML, bpf

On Tue, Jul 27, 2021 at 12:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/7/26 0:49, Alexander Duyck wrote:
> > On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
> >>
> >> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
> >>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>
> >>>> On 2021/7/22 23:18, Alexander Duyck wrote:
> >>>>>>>

<snip>

> >>>
> >>> Rather than trying to reuse the devices page pool it might make more
> >>> sense to see if you couldn't have TCP just use some sort of circular
> >>> buffer of memory that is directly mapped for the device that it is
> >>> going to be transmitting to. Essentially what you would be doing is
> >>> creating a pre-mapped page and would need to communicate that the
> >>> memory is already mapped for the device you want to send it to so that
> >>> it could skip that step.
> >>
> >> IIUC sk_page_frag_refill() is already doing a similar reusing as the
> >> rx reusing implemented in most driver except for the not pre-mapping
> >> part.
> >>
> >> And it seems that even if we pre-map the page and communicate that the
> >> memory is already mapped to the driver, it is likely that we will not
> >> be able to reuse the page when the circular buffer is not big enough
> >> or tx completion/tcp ack is not happening quickly enough, which might
> >> means unmapping/deallocating old circular buffer and allocating/mapping
> >> new circular buffer.
> >>
> >> Using page pool we might be able to alleviate the above problem as it
> >> does for rx?
> >
> > I would say that instead of looking at going straight for the page
> > pool it might make more sense to look at seeing if we can coalesce the
> > DMA mapping of the pages first at the socket layer rather than trying
> > to introduce the overhead for the page pool. In the case of sockets we
> > already have the destructors that are called when the memory is freed,
> > so instead of making sockets use page pool it might make more sense to
> > extend the socket buffer allocation/freeing to incorporate bulk
> > mapping and unmapping of pages to optimize the socket Tx path in the
> > 32K page case.
>
> I was able to enable tx recycling prototyping based on page pool to
> run some performance test, the performance improvement is about +20%
> (30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in
> strict mode. And CPU usage descreases about 10% for four threads iperf
> tcp flow for line speed of 100Gbit when IOMMU is in strict mode.

That isn't surprising given that for most devices the IOMMU will be
called per frag which can add a fair bit of overhead.

> Looking at the prototyping code, I am agreed that it is a bit controversial
> to use the page pool for tx as the page pool is assuming NAPI polling
> protection for allocation side.
>
> So I will take a deeper look about your suggestion above to see how to
> implement it.
>
> Also, I am assuming the "destructors" means tcp_wfree() for TCP, right?
> It seems tcp_wfree() is mainly used to do memory accounting and free
> "struct sock" if necessary.

Yes, that is what I was thinking. If we had some way to add something
like an argument or way to push the information about where the skbs
are being freed back to the socket the socket could then be looking at
pre-mapping the pages for the device if we assume a 1:1 mapping from
the socket to the device.

> I am not so familiar with socket layer to understand how the "destructors"
> will be helpful here, any detailed idea how to use "destructors" here?

The basic idea is the destructors are called when the skb is orphaned
or freed. So it might be a good spot to put in any logic to free pages
from your special pool. The only thing you would need to sort out is
making certain to bump reference counts appropriately if the skb is
cloned and the destructor is copied.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool
  2021-07-27 18:38                       ` Alexander Duyck
@ 2021-08-02  9:17                         ` Yunsheng Lin
  0 siblings, 0 replies; 17+ messages in thread
From: Yunsheng Lin @ 2021-08-02  9:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski,
	Russell King - ARM Linux, Marcin Wojtas, linuxarm, yisen.zhuang,
	Salil Mehta, thomas.petazzoni, hawk, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, fenghua.yu, guro, Peter Xu, Feng Tang,
	Jason Gunthorpe, Matteo Croce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	nogikh, Marco Elver, Yonghong Song, kpsingh, andrii,
	Martin KaFai Lau, songliubraving, Netdev, LKML, bpf

On 2021/7/28 2:38, Alexander Duyck wrote:
> On Tue, Jul 27, 2021 at 12:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2021/7/26 0:49, Alexander Duyck wrote:
>>> On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
>>>>
>>>> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
>>>>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>>>
>>>>>> On 2021/7/22 23:18, Alexander Duyck wrote:
>>>>>>>>>
> 
> <snip>
> 
>>>>>
>>>>> Rather than trying to reuse the devices page pool it might make more
>>>>> sense to see if you couldn't have TCP just use some sort of circular
>>>>> buffer of memory that is directly mapped for the device that it is
>>>>> going to be transmitting to. Essentially what you would be doing is
>>>>> creating a pre-mapped page and would need to communicate that the
>>>>> memory is already mapped for the device you want to send it to so that
>>>>> it could skip that step.
>>>>
>>>> IIUC sk_page_frag_refill() is already doing a similar reusing as the
>>>> rx reusing implemented in most driver except for the not pre-mapping
>>>> part.
>>>>
>>>> And it seems that even if we pre-map the page and communicate that the
>>>> memory is already mapped to the driver, it is likely that we will not
>>>> be able to reuse the page when the circular buffer is not big enough
>>>> or tx completion/tcp ack is not happening quickly enough, which might
>>>> means unmapping/deallocating old circular buffer and allocating/mapping
>>>> new circular buffer.
>>>>
>>>> Using page pool we might be able to alleviate the above problem as it
>>>> does for rx?
>>>
>>> I would say that instead of looking at going straight for the page
>>> pool it might make more sense to look at seeing if we can coalesce the
>>> DMA mapping of the pages first at the socket layer rather than trying
>>> to introduce the overhead for the page pool. In the case of sockets we
>>> already have the destructors that are called when the memory is freed,
>>> so instead of making sockets use page pool it might make more sense to
>>> extend the socket buffer allocation/freeing to incorporate bulk
>>> mapping and unmapping of pages to optimize the socket Tx path in the
>>> 32K page case.
>>
>> I was able to enable tx recycling prototyping based on page pool to
>> run some performance test, the performance improvement is about +20%
>> (30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in
>> strict mode. And CPU usage descreases about 10% for four threads iperf
>> tcp flow for line speed of 100Gbit when IOMMU is in strict mode.
> 
> That isn't surprising given that for most devices the IOMMU will be
> called per frag which can add a fair bit of overhead.
> 
>> Looking at the prototyping code, I am agreed that it is a bit controversial
>> to use the page pool for tx as the page pool is assuming NAPI polling
>> protection for allocation side.
>>
>> So I will take a deeper look about your suggestion above to see how to
>> implement it.
>>
>> Also, I am assuming the "destructors" means tcp_wfree() for TCP, right?
>> It seems tcp_wfree() is mainly used to do memory accounting and free
>> "struct sock" if necessary.
> 
> Yes, that is what I was thinking. If we had some way to add something
> like an argument or way to push the information about where the skbs
> are being freed back to the socket the socket could then be looking at
> pre-mapping the pages for the device if we assume a 1:1 mapping from
> the socket to the device.
> 
>> I am not so familiar with socket layer to understand how the "destructors"
>> will be helpful here, any detailed idea how to use "destructors" here?
> 
> The basic idea is the destructors are called when the skb is orphaned
> or freed. So it might be a good spot to put in any logic to free pages
> from your special pool. The only thing you would need to sort out is
> making certain to bump reference counts appropriately if the skb is
> cloned and the destructor is copied.

It seems the destructor is not copied when a skb is cloned, see:
https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L1050

For IPv4 TCP, tcp_write_xmit() calls __tcp_transmit_skb() to send the
new cloned skb using ip_queue_xmit(), and the original skb is kept in
sk->tcp_rtx_queue to wait for the ack packet. The destructor is assigned
to the new cloned skb in __tcp_transmit_skb(), and when destructor is
called in the tx completion process, it seems the frag page might be
still used by the retransmiting process, which means it is better not to
unmap or recycle that frag page in skb->destructor?

Also I tried to implement a frag pool to replace the page pool, but it
seems the feature needed for frag pool is already implemented by the
page pool, so implementing a new frag pool does not make sense.

For Rx, we have 1 : 1 relation between struct napi_struct instance and
struct page_pool instance, It seems we have the below options if the
recycling pool makes sense for Tx too:
1. 1 : 1 relation between struct net_device instance and struct page_pool
   instance.
2. 1 : 1 relation between struct napi_struct instance and struct page_pool
   instance.
3. 1 : 1 relation between struct sock instance and struct page_pool instance.

For me, the option 2 seems to make more sense to me if we can reuse the same
page pool for both Tx and Rx.

As the where or when to "bump reference counts appropriately", it seems the
__skb_frag_ref() might be a good spot to increment frag count if the frag is
copied, and the bit 0 for "struct page" ptr in frag->bv_page is reversed as
the lower 12 bits for dma address, so we can use bit 0 in frag->bv_page to
indicate the corresponding page is from a page pool or not? If the page is
from a page pool, then __skb_frag_ref() can do a atomic_inc(pp_frag_count)
instead of get_page(). This also might means that skb->pp_recycle is only used
to indicate if the head data page is from page pool or not when the bit 0
of frag->bv_page can be used to indicate if the corresponding frag page is
from page pool or not.

And doing atomic_inc(pp_frag_count) in __skb_frag_ref() seems to match the
semantics of "recycle after all users of page is done with the page", at least
for most users in the netstack?

> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-02  9:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20  3:35 [PATCH rfc v6 0/4] add frag page support in page pool Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 1/4] page_pool: keep pp info as long as page pool owns the page Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool Yunsheng Lin
2021-07-20 15:43   ` Alexander Duyck
2021-07-21  8:15     ` Yunsheng Lin
2021-07-21 14:06       ` Alexander Duyck
2021-07-22  8:07         ` Yunsheng Lin
2021-07-22 15:18           ` Alexander Duyck
2021-07-23 11:12             ` Yunsheng Lin
2021-07-23 16:08               ` Alexander Duyck
2021-07-24 13:07                 ` Yunsheng Lin
2021-07-25 16:49                   ` Alexander Duyck
2021-07-27  7:54                     ` Yunsheng Lin
2021-07-27 18:38                       ` Alexander Duyck
2021-08-02  9:17                         ` Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 3/4] page_pool: add frag page recycling support " Yunsheng Lin
2021-07-20  3:35 ` [PATCH rfc v6 4/4] net: hns3: support skb's frag page recycling based on " Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).