bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/4] page_pool: recycle buffers
@ 2021-05-11 13:31 Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 1/4] mm: add a signature in struct page Matteo Croce
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Matteo Croce @ 2021-05-11 13:31 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

From: Matteo Croce <mcroce@microsoft.com>

This is a respin of [1]

This patchset shows the plans for allowing page_pool to handle and
maintain DMA map/unmap of the pages it serves to the driver. For this
to work a return hook in the network core is introduced.

The overall purpose is to simplify drivers, by providing a page
allocation API that does recycling, such that each driver doesn't have
to reinvent its own recycling scheme. Using page_pool in a driver
does not require implementing XDP support, but it makes it trivially
easy to do so. Instead of allocating buffers specifically for SKBs
we now allocate a generic buffer and either wrap it on an SKB
(via build_skb) or create an XDP frame.
The recycling code leverages the XDP recycle APIs.

The Marvell mvpp2 and mvneta drivers are used in this patchset to
demonstrate how to use the API, and tested on a MacchiatoBIN
and EspressoBIN boards respectively.

Please let this going in on a future -rc1 so to allow enough time
to have wider tests.

Note that this series depends on the change "mm: fix struct page layout
on 32-bit systems"[2] which is not yet in master.

[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
[2] https://lore.kernel.org/linux-mm/20210510153211.1504886-1-willy@infradead.org/

Ilias Apalodimas (1):
  page_pool: Allow drivers to hint on SKB recycling

Matteo Croce (3):
  mm: add a signature in struct page
  mvpp2: recycle buffers
  mvneta: recycle buffers

 drivers/net/ethernet/marvell/mvneta.c         | 11 +++---
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++++-----
 drivers/net/ethernet/marvell/sky2.c           |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  2 +-
 include/linux/mm_types.h                      |  1 +
 include/linux/skbuff.h                        | 34 ++++++++++++++++---
 include/net/page_pool.h                       | 11 ++++++
 net/core/page_pool.c                          | 27 +++++++++++++++
 net/core/skbuff.c                             | 20 +++++++++--
 net/tls/tls_device.c                          |  2 +-
 10 files changed, 105 insertions(+), 22 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 13:31 [PATCH net-next v4 0/4] page_pool: recycle buffers Matteo Croce
@ 2021-05-11 13:31 ` Matteo Croce
  2021-05-11 13:45   ` Matthew Wilcox
  2021-05-11 13:31 ` [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling Matteo Croce
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Matteo Croce @ 2021-05-11 13:31 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

From: Matteo Croce <mcroce@microsoft.com>

This is needed by the page_pool to avoid recycling a page not allocated
via page_pool.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 include/linux/mm_types.h | 1 +
 include/net/page_pool.h  | 2 ++
 net/core/page_pool.c     | 4 ++++
 3 files changed, 7 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c10a45..77c04210e474 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -101,6 +101,7 @@ struct page {
 			 * 32-bit architectures.
 			 */
 			unsigned long dma_addr[2];
+			unsigned long signature;
 		};
 		struct {	/* slab, slob and slub */
 			union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b4b6de909c93..9814e36becc1 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -63,6 +63,8 @@
  */
 #define PP_ALLOC_CACHE_SIZE	128
 #define PP_ALLOC_CACHE_REFILL	64
+#define PP_SIGNATURE		0x20210303
+
 struct pp_alloc_cache {
 	u32 count;
 	struct page *cache[PP_ALLOC_CACHE_SIZE];
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 3c4c4c7a0402..2e5e2b8c3a02 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -221,6 +221,8 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 		return NULL;
 	}
 
+	page->signature = PP_SIGNATURE;
+
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
@@ -341,6 +343,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 			     DMA_ATTR_SKIP_CPU_SYNC);
 	page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
+	page->signature = 0;
+
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
 	 */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling
  2021-05-11 13:31 [PATCH net-next v4 0/4] page_pool: recycle buffers Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 1/4] mm: add a signature in struct page Matteo Croce
@ 2021-05-11 13:31 ` Matteo Croce
  2021-05-11 15:24   ` Eric Dumazet
  2021-05-11 13:31 ` [PATCH net-next v4 3/4] mvpp2: recycle buffers Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 4/4] mvneta: " Matteo Croce
  3 siblings, 1 reply; 20+ messages in thread
From: Matteo Croce @ 2021-05-11 13:31 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

From: Ilias Apalodimas <ilias.apalodimas@linaro.org>

Up to now several high speed NICs have custom mechanisms of recycling
the allocated memory they use for their payloads.
Our page_pool API already has recycling capabilities that are always
used when we are running in 'XDP mode'. So let's tweak the API and the
kernel network stack slightly and allow the recycling to happen even
during the standard operation.
The API doesn't take into account 'split page' policies used by those
drivers currently, but can be extended once we have users for that.

The idea is to be able to intercept the packet on skb_release_data().
If it's a buffer coming from our page_pool API recycle it back to the
pool for further usage or just release the packet entirely.

To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
store the page_pool pointer in page->private. Storing the information in
page->private allows us to recycle both SKBs and their fragments.
The SKB bit is needed for a couple of reasons. First of all in an
effort to affect the free path as less as possible, reading a single bit,
is better that trying to derive identical information for the page stored
data. Moreover page->private is used by skb_copy_ubufs. We do have a
special mark in the page, that won't allow this to happen, but again
deciding without having to read the entire page is preferable.

The driver has to take care of the sync operations on it's own
during the buffer recycling since the buffer is, after opting-in to the
recycling, never unmapped.

Since the gain on the drivers depends on the architecture, we are not
enabling recycling by default if the page_pool API is used on a driver.
In order to enable recycling the driver must call skb_mark_for_recycle()
to store the information we need for recycling in page->private and
enabling the recycling bit, or page_pool_store_mem_info() for a fragment.

Since we added an extra argument on __skb_frag_unref() to handle
recycling, update the current users of the function with that.

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 drivers/net/ethernet/marvell/sky2.c        |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
 include/linux/skbuff.h                     | 34 ++++++++++++++++++----
 include/net/page_pool.h                    |  9 ++++++
 net/core/page_pool.c                       | 23 +++++++++++++++
 net/core/skbuff.c                          | 20 +++++++++++--
 net/tls/tls_device.c                       |  2 +-
 7 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index 222c32367b2c..aa0cde1dc5c0 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 
 		if (length == 0) {
 			/* don't need this page */
-			__skb_frag_unref(frag);
+			__skb_frag_unref(frag, false);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index e35e4d7ef4d1..cea62b8f554c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 fail:
 	while (nr > 0) {
 		nr--;
-		__skb_frag_unref(skb_shinfo(skb)->frags + nr);
+		__skb_frag_unref(skb_shinfo(skb)->frags + nr, false);
 	}
 	return 0;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dbf820a50a39..1a2ce52c29f9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,9 @@
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 #include <linux/netfilter/nf_conntrack_common.h>
 #endif
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+#include <net/page_pool.h>
+#endif
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -667,6 +670,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@head_frag: skb was allocated from page fragments,
  *		not allocated by kmalloc() or vmalloc().
  *	@pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
+ *	@pp_recycle: mark the packet for recycling instead of freeing (implies
+ *		page_pool support on driver)
  *	@active_extensions: active extensions (skb_ext_id types)
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
@@ -791,10 +796,12 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				pfmemalloc:1;
+				pfmemalloc:1,
+				pp_recycle:1; /* page_pool recycle indicator */
 #ifdef CONFIG_SKB_EXTENSIONS
 	__u8			active_extensions;
 #endif
+
 	/* fields enclosed in headers_start/headers_end are copied
 	 * using a single memcpy() in __copy_skb_header()
 	 */
@@ -3081,12 +3088,20 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
 /**
  * __skb_frag_unref - release a reference on a paged fragment.
  * @frag: the paged fragment
+ * @recycle: recycle the page if allocated via page_pool
  *
- * Releases a reference on the paged fragment @frag.
+ * Releases a reference on the paged fragment @frag
+ * or recycles the page via the page_pool API.
  */
-static inline void __skb_frag_unref(skb_frag_t *frag)
+static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
 {
-	put_page(skb_frag_page(frag));
+	struct page *page = skb_frag_page(frag);
+
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+	if (recycle && page_pool_return_skb_page(page_address(page)))
+		return;
+#endif
+	put_page(page);
 }
 
 /**
@@ -3098,7 +3113,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag)
  */
 static inline void skb_frag_unref(struct sk_buff *skb, int f)
 {
-	__skb_frag_unref(&skb_shinfo(skb)->frags[f]);
+	__skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
 }
 
 /**
@@ -4697,5 +4712,14 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb)
 #endif
 }
 
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
+					struct page_pool *pp)
+{
+	skb->pp_recycle = 1;
+	page_pool_store_mem_info(page, pp);
+}
+#endif
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 9814e36becc1..b34b8b128206 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -148,6 +148,8 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
 	return pool->p.dma_dir;
 }
 
+bool page_pool_return_skb_page(void *data);
+
 struct page_pool *page_pool_create(const struct page_pool_params *params);
 
 #ifdef CONFIG_PAGE_POOL
@@ -253,4 +255,11 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
 		spin_unlock_bh(&pool->ring.producer_lock);
 }
 
+/* Store mem_info on struct page and use it while recycling skb frags */
+static inline
+void page_pool_store_mem_info(struct page *page, struct page_pool *pp)
+{
+	set_page_private(page, (unsigned long)pp);
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2e5e2b8c3a02..52e4f16b5e92 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -626,3 +626,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
+
+bool page_pool_return_skb_page(void *data)
+{
+	struct page_pool *pp;
+	struct page *page;
+
+	page = virt_to_head_page(data);
+	if (unlikely(page->signature != PP_SIGNATURE))
+		return false;
+
+	pp = (struct page_pool *)page_private(page);
+
+	/* Driver set this to memory recycling info. Reset it on recycle.
+	 * This will *not* work for NIC using a split-page memory model.
+	 * The page will be returned to the pool here regardless of the
+	 * 'flipped' fragment being in use or not.
+	 */
+	set_page_private(page, 0);
+	page_pool_put_full_page(pp, virt_to_head_page(data), false);
+
+	return true;
+}
+EXPORT_SYMBOL(page_pool_return_skb_page);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ad22870298c..dc4a5c56b8dc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,6 +70,9 @@
 #include <net/xfrm.h>
 #include <net/mpls.h>
 #include <net/mptcp.h>
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+#include <net/page_pool.h>
+#endif
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -645,6 +648,11 @@ static void skb_free_head(struct sk_buff *skb)
 {
 	unsigned char *head = skb->head;
 
+#if IS_BUILTIN(CONFIG_PAGE_POOL)
+	if (skb->pp_recycle && page_pool_return_skb_page(head))
+		return;
+#endif
+
 	if (skb->head_frag)
 		skb_free_frag(head);
 	else
@@ -664,7 +672,7 @@ static void skb_release_data(struct sk_buff *skb)
 	skb_zcopy_clear(skb, true);
 
 	for (i = 0; i < shinfo->nr_frags; i++)
-		__skb_frag_unref(&shinfo->frags[i]);
+		__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
 
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
@@ -1046,6 +1054,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	n->nohdr = 0;
 	n->peeked = 0;
 	C(pfmemalloc);
+	C(pp_recycle);
 	n->destructor = NULL;
 	C(tail);
 	C(end);
@@ -3495,7 +3504,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 		fragto = &skb_shinfo(tgt)->frags[merge];
 
 		skb_frag_size_add(fragto, skb_frag_size(fragfrom));
-		__skb_frag_unref(fragfrom);
+		__skb_frag_unref(fragfrom, skb->pp_recycle);
 	}
 
 	/* Reposition in the original skb */
@@ -5285,6 +5294,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	if (skb_cloned(to))
 		return false;
 
+	/* We can't coalesce skb that are allocated from slab and page_pool
+	 * The recycle mark is on the skb, so that might end up trying to
+	 * recycle slab allocated skb->head
+	 */
+	if (to->pp_recycle != from->pp_recycle)
+		return false;
+
 	if (len <= skb_tailroom(to)) {
 		if (len)
 			BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 76a6f8c2eec4..ad11db2c4f63 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -127,7 +127,7 @@ static void destroy_record(struct tls_record_info *record)
 	int i;
 
 	for (i = 0; i < record->num_frags; i++)
-		__skb_frag_unref(&record->frags[i]);
+		__skb_frag_unref(&record->frags[i], false);
 	kfree(record);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v4 3/4] mvpp2: recycle buffers
  2021-05-11 13:31 [PATCH net-next v4 0/4] page_pool: recycle buffers Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 1/4] mm: add a signature in struct page Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling Matteo Croce
@ 2021-05-11 13:31 ` Matteo Croce
  2021-05-11 13:31 ` [PATCH net-next v4 4/4] mvneta: " Matteo Croce
  3 siblings, 0 replies; 20+ messages in thread
From: Matteo Croce @ 2021-05-11 13:31 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

From: Matteo Croce <mcroce@microsoft.com>

Use the new recycling API for page_pool.
In a drop rate test, the packet rate is more than doubled,
from 962 Kpps to 2047 Kpps.

perf top on a stock system shows:

Overhead  Shared Object     Symbol
  30.67%  [kernel]          [k] page_pool_release_page
   8.37%  [kernel]          [k] get_page_from_freelist
   7.34%  [kernel]          [k] free_unref_page
   6.47%  [mvpp2]           [k] mvpp2_rx
   4.69%  [kernel]          [k] eth_type_trans
   4.55%  [kernel]          [k] __netif_receive_skb_core
   4.40%  [kernel]          [k] build_skb
   4.29%  [kernel]          [k] kmem_cache_free
   4.00%  [kernel]          [k] kmem_cache_alloc
   3.81%  [kernel]          [k] dev_gro_receive

With packet rate stable at 962 Kpps:

tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps
tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps
tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps
tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps
tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps

And this is the same output with recycling enabled:

Overhead  Shared Object     Symbol
  12.75%  [mvpp2]           [k] mvpp2_rx
   9.56%  [kernel]          [k] __netif_receive_skb_core
   9.29%  [kernel]          [k] build_skb
   9.27%  [kernel]          [k] eth_type_trans
   8.39%  [kernel]          [k] kmem_cache_alloc
   7.85%  [kernel]          [k] kmem_cache_free
   7.36%  [kernel]          [k] page_pool_put_page
   6.45%  [kernel]          [k] dev_gro_receive
   4.72%  [kernel]          [k] __xdp_return
   3.06%  [kernel]          [k] page_pool_refill_alloc_cache

With packet rate above 2000 Kpps:

tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps
tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps

The major performance increase is explained by the fact that the most CPU
consuming functions (page_pool_release_page, get_page_from_freelist
and free_unref_page) are no longer called on a per packet basis.

The test was done by sending to the macchiatobin 64 byte ethernet frames
with an invalid ethertype, so the packets are dropped early in the RX path.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index b2259bf1d299..9dceabece56c 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3847,6 +3847,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 	struct mvpp2_pcpu_stats ps = {};
 	enum dma_data_direction dma_dir;
 	struct bpf_prog *xdp_prog;
+	struct xdp_rxq_info *rxqi;
 	struct xdp_buff xdp;
 	int rx_received;
 	int rx_done = 0;
@@ -3912,15 +3913,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 		else
 			frag_size = bm_pool->frag_size;
 
-		if (xdp_prog) {
-			struct xdp_rxq_info *xdp_rxq;
+		if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
+			rxqi = &rxq->xdp_rxq_short;
+		else
+			rxqi = &rxq->xdp_rxq_long;
 
-			if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
-				xdp_rxq = &rxq->xdp_rxq_short;
-			else
-				xdp_rxq = &rxq->xdp_rxq_long;
+		if (xdp_prog) {
+			xdp.rxq = rxqi;
 
-			xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq);
+			xdp_init_buff(&xdp, PAGE_SIZE, rxqi);
 			xdp_prepare_buff(&xdp, data,
 					 MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM,
 					 rx_bytes, false);
@@ -3964,7 +3965,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 		}
 
 		if (pp)
-			page_pool_release_page(pp, virt_to_page(data));
+			skb_mark_for_recycle(skb, virt_to_page(data), pp);
 		else
 			dma_unmap_single_attrs(dev->dev.parent, dma_addr,
 					       bm_pool->buf_size, DMA_FROM_DEVICE,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next v4 4/4] mvneta: recycle buffers
  2021-05-11 13:31 [PATCH net-next v4 0/4] page_pool: recycle buffers Matteo Croce
                   ` (2 preceding siblings ...)
  2021-05-11 13:31 ` [PATCH net-next v4 3/4] mvpp2: recycle buffers Matteo Croce
@ 2021-05-11 13:31 ` Matteo Croce
  3 siblings, 0 replies; 20+ messages in thread
From: Matteo Croce @ 2021-05-11 13:31 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

From: Matteo Croce <mcroce@microsoft.com>

Use the new recycling API for page_pool.
In a drop rate test, the packet rate increased di 10%,
from 269 Kpps to 296 Kpps.

perf top on a stock system shows:

Overhead  Shared Object     Symbol
  21.78%  [kernel]          [k] __pi___inval_dcache_area
  21.66%  [mvneta]          [k] mvneta_rx_swbm
   7.00%  [kernel]          [k] kmem_cache_alloc
   6.05%  [kernel]          [k] eth_type_trans
   4.44%  [kernel]          [k] kmem_cache_free.part.0
   3.80%  [kernel]          [k] __netif_receive_skb_core
   3.68%  [kernel]          [k] dev_gro_receive
   3.65%  [kernel]          [k] get_page_from_freelist
   3.43%  [kernel]          [k] page_pool_release_page
   3.35%  [kernel]          [k] free_unref_page

And this is the same output with recycling enabled:

Overhead  Shared Object     Symbol
  24.10%  [kernel]          [k] __pi___inval_dcache_area
  23.02%  [mvneta]          [k] mvneta_rx_swbm
   7.19%  [kernel]          [k] kmem_cache_alloc
   6.50%  [kernel]          [k] eth_type_trans
   4.93%  [kernel]          [k] __netif_receive_skb_core
   4.77%  [kernel]          [k] kmem_cache_free.part.0
   3.93%  [kernel]          [k] dev_gro_receive
   3.03%  [kernel]          [k] build_skb
   2.91%  [kernel]          [k] page_pool_put_page
   2.85%  [kernel]          [k] __xdp_return

The test was done with mausezahn on the TX side with 64 byte raw
ethernet frames.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 7d5cd9bc6c99..6d2f8dce4900 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2320,7 +2320,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 }
 
 static struct sk_buff *
-mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
+mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,
 		      struct xdp_buff *xdp, u32 desc_status)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
@@ -2331,7 +2331,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
-	page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data));
+	skb_mark_for_recycle(skb, virt_to_page(xdp->data), pool);
 
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
@@ -2343,7 +2343,10 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 				skb_frag_page(frag), skb_frag_off(frag),
 				skb_frag_size(frag), PAGE_SIZE);
-		page_pool_release_page(rxq->page_pool, skb_frag_page(frag));
+		/* We don't need to reset pp_recycle here. It's already set, so
+		 * just mark fragments for recycling.
+		 */
+		page_pool_store_mem_info(skb_frag_page(frag), pool);
 	}
 
 	return skb;
@@ -2425,7 +2428,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 		    mvneta_run_xdp(pp, rxq, xdp_prog, &xdp_buf, frame_sz, &ps))
 			goto next;
 
-		skb = mvneta_swbm_build_skb(pp, rxq, &xdp_buf, desc_status);
+		skb = mvneta_swbm_build_skb(pp, pp, &xdp_buf, desc_status);
 		if (IS_ERR(skb)) {
 			struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 13:31 ` [PATCH net-next v4 1/4] mm: add a signature in struct page Matteo Croce
@ 2021-05-11 13:45   ` Matthew Wilcox
  2021-05-11 14:11     ` Ilias Apalodimas
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-05-11 13:45 UTC (permalink / raw)
  To: Matteo Croce
  Cc: netdev, linux-mm, Ayush Sawal, Vinay Kumar Yadav,
	Rohit Maheshwari, David S. Miller, Jakub Kicinski,
	Thomas Petazzoni, Marcin Wojtas, Russell King, Mirko Lindner,
	Stephen Hemminger, Tariq Toukan, Jesper Dangaard Brouer,
	Ilias Apalodimas, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi,
	Saeed Mahameed, Andrew Lunn, Paolo Abeni, Sven Auhagen

On Tue, May 11, 2021 at 03:31:15PM +0200, Matteo Croce wrote:
> @@ -101,6 +101,7 @@ struct page {
>  			 * 32-bit architectures.
>  			 */
>  			unsigned long dma_addr[2];
> +			unsigned long signature;
>  		};
>  		struct {	/* slab, slob and slub */
>  			union {

No.  Signature now aliases with page->mapping, which is going to go
badly wrong for drivers which map this page into userspace.

I had this as:

+                       unsigned long pp_magic;
+                       unsigned long xmi;
+                       unsigned long _pp_mapping_pad;
                        unsigned long dma_addr[2];

and pp_magic needs to be set to something with bits 0&1 clear and
clearly isn't a pointer.  I went with POISON_POINTER_DELTA + 0x40.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 13:45   ` Matthew Wilcox
@ 2021-05-11 14:11     ` Ilias Apalodimas
  2021-05-11 14:18       ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Ilias Apalodimas @ 2021-05-11 14:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matteo Croce, netdev, linux-mm, Ayush Sawal, Vinay Kumar Yadav,
	Rohit Maheshwari, David S. Miller, Jakub Kicinski,
	Thomas Petazzoni, Marcin Wojtas, Russell King, Mirko Lindner,
	Stephen Hemminger, Tariq Toukan, Jesper Dangaard Brouer,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi,
	Saeed Mahameed, Andrew Lunn, Paolo Abeni, Sven Auhagen

Hi Matthew,

On Tue, May 11, 2021 at 02:45:32PM +0100, Matthew Wilcox wrote:
> On Tue, May 11, 2021 at 03:31:15PM +0200, Matteo Croce wrote:
> > @@ -101,6 +101,7 @@ struct page {
> >  			 * 32-bit architectures.
> >  			 */
> >  			unsigned long dma_addr[2];
> > +			unsigned long signature;
> >  		};
> >  		struct {	/* slab, slob and slub */
> >  			union {
> 
> No.  Signature now aliases with page->mapping, which is going to go
> badly wrong for drivers which map this page into userspace.
> 
> I had this as:
> 
> +                       unsigned long pp_magic;
> +                       unsigned long xmi;
> +                       unsigned long _pp_mapping_pad;
>                         unsigned long dma_addr[2];
> 
> and pp_magic needs to be set to something with bits 0&1 clear and
> clearly isn't a pointer.  I went with POISON_POINTER_DELTA + 0x40.

Regardless to the changes required, there's another thing we'd like your
opinion on.
There was a change wrt to the previous patchset. We used to store the
struct xdp_mem_info into page->private.  On the new version we store the
page_pool ptr address in page->private (there's an explanation why on the
mail thread, but the tl;dr is that we can get some more speed and keeping
xdp_mem_info is not that crucial). So since we can just store the page_pool
address directly, should we keep using page->private or it's better to
do: 

+                       unsigned long pp_magic;
+                       unsigned long pp_ptr;
+                       unsigned long _pp_mapping_pad;
                        unsigned long dma_addr[2];
and use pp_ptr?

Thanks
/Ilias

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 14:11     ` Ilias Apalodimas
@ 2021-05-11 14:18       ` Matthew Wilcox
  2021-05-11 14:25         ` Ilias Apalodimas
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-05-11 14:18 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Matteo Croce, netdev, linux-mm, Ayush Sawal, Vinay Kumar Yadav,
	Rohit Maheshwari, David S. Miller, Jakub Kicinski,
	Thomas Petazzoni, Marcin Wojtas, Russell King, Mirko Lindner,
	Stephen Hemminger, Tariq Toukan, Jesper Dangaard Brouer,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi,
	Saeed Mahameed, Andrew Lunn, Paolo Abeni, Sven Auhagen

On Tue, May 11, 2021 at 05:11:13PM +0300, Ilias Apalodimas wrote:
> Hi Matthew,
> 
> On Tue, May 11, 2021 at 02:45:32PM +0100, Matthew Wilcox wrote:
> > On Tue, May 11, 2021 at 03:31:15PM +0200, Matteo Croce wrote:
> > > @@ -101,6 +101,7 @@ struct page {
> > >  			 * 32-bit architectures.
> > >  			 */
> > >  			unsigned long dma_addr[2];
> > > +			unsigned long signature;
> > >  		};
> > >  		struct {	/* slab, slob and slub */
> > >  			union {
> > 
> > No.  Signature now aliases with page->mapping, which is going to go
> > badly wrong for drivers which map this page into userspace.
> > 
> > I had this as:
> > 
> > +                       unsigned long pp_magic;
> > +                       unsigned long xmi;
> > +                       unsigned long _pp_mapping_pad;
> >                         unsigned long dma_addr[2];
> > 
> > and pp_magic needs to be set to something with bits 0&1 clear and
> > clearly isn't a pointer.  I went with POISON_POINTER_DELTA + 0x40.
> 
> Regardless to the changes required, there's another thing we'd like your
> opinion on.
> There was a change wrt to the previous patchset. We used to store the
> struct xdp_mem_info into page->private.  On the new version we store the
> page_pool ptr address in page->private (there's an explanation why on the
> mail thread, but the tl;dr is that we can get some more speed and keeping
> xdp_mem_info is not that crucial). So since we can just store the page_pool
> address directly, should we keep using page->private or it's better to
> do: 
> 
> +                       unsigned long pp_magic;
> +                       unsigned long pp_ptr;
> +                       unsigned long _pp_mapping_pad;
>                         unsigned long dma_addr[2];
> and use pp_ptr?

I'd rather you didn't use page_private ... Any reason not to use:

			unsigned long pp_magic;
			struct page_pool *pp;
			unsigned long _pp_mapping_pad;
			unsigned long dma_addr[2];

?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 14:18       ` Matthew Wilcox
@ 2021-05-11 14:25         ` Ilias Apalodimas
  2021-05-12 15:57           ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Ilias Apalodimas @ 2021-05-11 14:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matteo Croce, Networking, Linux-MM, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, open list, linux-rdma,
	bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On Tue, 11 May 2021 at 17:19, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 11, 2021 at 05:11:13PM +0300, Ilias Apalodimas wrote:
> > Hi Matthew,
> >
> > On Tue, May 11, 2021 at 02:45:32PM +0100, Matthew Wilcox wrote:
> > > On Tue, May 11, 2021 at 03:31:15PM +0200, Matteo Croce wrote:
> > > > @@ -101,6 +101,7 @@ struct page {
> > > >                    * 32-bit architectures.
> > > >                    */
> > > >                   unsigned long dma_addr[2];
> > > > +                 unsigned long signature;
> > > >           };
> > > >           struct {        /* slab, slob and slub */
> > > >                   union {
> > >
> > > No.  Signature now aliases with page->mapping, which is going to go
> > > badly wrong for drivers which map this page into userspace.
> > >
> > > I had this as:
> > >
> > > +                       unsigned long pp_magic;
> > > +                       unsigned long xmi;
> > > +                       unsigned long _pp_mapping_pad;
> > >                         unsigned long dma_addr[2];
> > >
> > > and pp_magic needs to be set to something with bits 0&1 clear and
> > > clearly isn't a pointer.  I went with POISON_POINTER_DELTA + 0x40.
> >
> > Regardless to the changes required, there's another thing we'd like your
> > opinion on.
> > There was a change wrt to the previous patchset. We used to store the
> > struct xdp_mem_info into page->private.  On the new version we store the
> > page_pool ptr address in page->private (there's an explanation why on the
> > mail thread, but the tl;dr is that we can get some more speed and keeping
> > xdp_mem_info is not that crucial). So since we can just store the page_pool
> > address directly, should we keep using page->private or it's better to
> > do:
> >
> > +                       unsigned long pp_magic;
> > +                       unsigned long pp_ptr;
> > +                       unsigned long _pp_mapping_pad;
> >                         unsigned long dma_addr[2];
> > and use pp_ptr?
>
> I'd rather you didn't use page_private ... Any reason not to use:
>
>                         unsigned long pp_magic;
>                         struct page_pool *pp;
>                         unsigned long _pp_mapping_pad;
>                         unsigned long dma_addr[2];
>
> ?

Nope not at all, either would work. we'll switch to that

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling
  2021-05-11 13:31 ` [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling Matteo Croce
@ 2021-05-11 15:24   ` Eric Dumazet
  2021-05-12  9:50     ` Ilias Apalodimas
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2021-05-11 15:24 UTC (permalink / raw)
  To: Matteo Croce, netdev, linux-mm
  Cc: Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Ilias Apalodimas, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Boris Pismenny, Arnd Bergmann,
	Andrew Morton, Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen



On 5/11/21 3:31 PM, Matteo Croce wrote:
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> 
> Up to now several high speed NICs have custom mechanisms of recycling
> the allocated memory they use for their payloads.
> Our page_pool API already has recycling capabilities that are always
> used when we are running in 'XDP mode'. So let's tweak the API and the
> kernel network stack slightly and allow the recycling to happen even
> during the standard operation.
> The API doesn't take into account 'split page' policies used by those
> drivers currently, but can be extended once we have users for that.
> 
> The idea is to be able to intercept the packet on skb_release_data().
> If it's a buffer coming from our page_pool API recycle it back to the
> pool for further usage or just release the packet entirely.
> 
> To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
> store the page_pool pointer in page->private. Storing the information in
> page->private allows us to recycle both SKBs and their fragments.
> The SKB bit is needed for a couple of reasons. First of all in an
> effort to affect the free path as less as possible, reading a single bit,
> is better that trying to derive identical information for the page stored
> data. Moreover page->private is used by skb_copy_ubufs. We do have a
> special mark in the page, that won't allow this to happen, but again
> deciding without having to read the entire page is preferable.
> 
> The driver has to take care of the sync operations on it's own
> during the buffer recycling since the buffer is, after opting-in to the
> recycling, never unmapped.
> 
> Since the gain on the drivers depends on the architecture, we are not
> enabling recycling by default if the page_pool API is used on a driver.
> In order to enable recycling the driver must call skb_mark_for_recycle()
> to store the information we need for recycling in page->private and
> enabling the recycling bit, or page_pool_store_mem_info() for a fragment.
> 
> Since we added an extra argument on __skb_frag_unref() to handle
> recycling, update the current users of the function with that.

This part could be done with a preliminary patch, only adding this
extra boolean, this would keep the 'complex' patch smaller.

> 
> Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Co-developed-by: Matteo Croce <mcroce@microsoft.com>
> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  drivers/net/ethernet/marvell/sky2.c        |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
>  include/linux/skbuff.h                     | 34 ++++++++++++++++++----
>  include/net/page_pool.h                    |  9 ++++++
>  net/core/page_pool.c                       | 23 +++++++++++++++
>  net/core/skbuff.c                          | 20 +++++++++++--
>  net/tls/tls_device.c                       |  2 +-
>  7 files changed, 82 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
> index 222c32367b2c..aa0cde1dc5c0 100644
> --- a/drivers/net/ethernet/marvell/sky2.c
> +++ b/drivers/net/ethernet/marvell/sky2.c
> @@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
>  
>  		if (length == 0) {
>  			/* don't need this page */
> -			__skb_frag_unref(frag);
> +			__skb_frag_unref(frag, false);
>  			--skb_shinfo(skb)->nr_frags;
>  		} else {
>  			size = min(length, (unsigned) PAGE_SIZE);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index e35e4d7ef4d1..cea62b8f554c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>  fail:
>  	while (nr > 0) {
>  		nr--;
> -		__skb_frag_unref(skb_shinfo(skb)->frags + nr);
> +		__skb_frag_unref(skb_shinfo(skb)->frags + nr, false);
>  	}
>  	return 0;
>  }
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index dbf820a50a39..1a2ce52c29f9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -40,6 +40,9 @@
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
>  #include <linux/netfilter/nf_conntrack_common.h>
>  #endif
> +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> +#include <net/page_pool.h>
> +#endif
>  
>  /* The interface for checksum offload between the stack and networking drivers
>   * is as follows...
> @@ -667,6 +670,8 @@ typedef unsigned char *sk_buff_data_t;
>   *	@head_frag: skb was allocated from page fragments,
>   *		not allocated by kmalloc() or vmalloc().
>   *	@pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
> + *	@pp_recycle: mark the packet for recycling instead of freeing (implies
> + *		page_pool support on driver)
>   *	@active_extensions: active extensions (skb_ext_id types)
>   *	@ndisc_nodetype: router type (from link layer)
>   *	@ooo_okay: allow the mapping of a socket to a queue to be changed
> @@ -791,10 +796,12 @@ struct sk_buff {
>  				fclone:2,
>  				peeked:1,
>  				head_frag:1,
> -				pfmemalloc:1;
> +				pfmemalloc:1,
> +				pp_recycle:1; /* page_pool recycle indicator */
>  #ifdef CONFIG_SKB_EXTENSIONS
>  	__u8			active_extensions;
>  #endif
> +
>  	/* fields enclosed in headers_start/headers_end are copied
>  	 * using a single memcpy() in __copy_skb_header()
>  	 */
> @@ -3081,12 +3088,20 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
>  /**
>   * __skb_frag_unref - release a reference on a paged fragment.
>   * @frag: the paged fragment
> + * @recycle: recycle the page if allocated via page_pool
>   *
> - * Releases a reference on the paged fragment @frag.
> + * Releases a reference on the paged fragment @frag
> + * or recycles the page via the page_pool API.
>   */
> -static inline void __skb_frag_unref(skb_frag_t *frag)
> +static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
>  {
> -	put_page(skb_frag_page(frag));
> +	struct page *page = skb_frag_page(frag);
> +
> +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> +	if (recycle && page_pool_return_skb_page(page_address(page)))
> +		return;
> +#endif
> +	put_page(page);
>  }
>  
>  /**
> @@ -3098,7 +3113,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag)
>   */
>  static inline void skb_frag_unref(struct sk_buff *skb, int f)
>  {
> -	__skb_frag_unref(&skb_shinfo(skb)->frags[f]);
> +	__skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
>  }
>  
>  /**
> @@ -4697,5 +4712,14 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb)
>  #endif
>  }
>  
> +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
> +					struct page_pool *pp)
> +{
> +	skb->pp_recycle = 1;
> +	page_pool_store_mem_info(page, pp);
> +}
> +#endif
> +
>  #endif	/* __KERNEL__ */
>  #endif	/* _LINUX_SKBUFF_H */
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 9814e36becc1..b34b8b128206 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -148,6 +148,8 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
>  	return pool->p.dma_dir;
>  }
>  
> +bool page_pool_return_skb_page(void *data);
> +
>  struct page_pool *page_pool_create(const struct page_pool_params *params);
>  
>  #ifdef CONFIG_PAGE_POOL
> @@ -253,4 +255,11 @@ static inline void page_pool_ring_unlock(struct page_pool *pool)
>  		spin_unlock_bh(&pool->ring.producer_lock);
>  }
>  
> +/* Store mem_info on struct page and use it while recycling skb frags */
> +static inline
> +void page_pool_store_mem_info(struct page *page, struct page_pool *pp)
> +{
> +	set_page_private(page, (unsigned long)pp);
> +}
> +
>  #endif /* _NET_PAGE_POOL_H */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2e5e2b8c3a02..52e4f16b5e92 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -626,3 +626,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>  	}
>  }
>  EXPORT_SYMBOL(page_pool_update_nid);
> +
> +bool page_pool_return_skb_page(void *data)
> +{
> +	struct page_pool *pp;
> +	struct page *page;
> +
> +	page = virt_to_head_page(data);
> +	if (unlikely(page->signature != PP_SIGNATURE))
> +		return false;
> +
> +	pp = (struct page_pool *)page_private(page);
> +
> +	/* Driver set this to memory recycling info. Reset it on recycle.
> +	 * This will *not* work for NIC using a split-page memory model.
> +	 * The page will be returned to the pool here regardless of the
> +	 * 'flipped' fragment being in use or not.
> +	 */
> +	set_page_private(page, 0);
> +	page_pool_put_full_page(pp, virt_to_head_page(data), false);
> +
> +	return true;
> +}
> +EXPORT_SYMBOL(page_pool_return_skb_page);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ad22870298c..dc4a5c56b8dc 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -70,6 +70,9 @@
>  #include <net/xfrm.h>
>  #include <net/mpls.h>
>  #include <net/mptcp.h>
> +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> +#include <net/page_pool.h>
> +#endif
>  
>  #include <linux/uaccess.h>
>  #include <trace/events/skb.h>
> @@ -645,6 +648,11 @@ static void skb_free_head(struct sk_buff *skb)
>  {
>  	unsigned char *head = skb->head;
>  
> +#if IS_BUILTIN(CONFIG_PAGE_POOL)

Why IS_BUILTIN() ? 

PAGE_POOL is either y or n

IS_ENABLED() would look better, since we use IS_BUILTIN() for the cases where a module might be used.

Or simply #ifdef CONFIG_PAGE_POOL

> +	if (skb->pp_recycle && page_pool_return_skb_page(head))

This probably should be attempted only in the (skb->head_frag) case ?

Also this patch misses pskb_expand_head()

> +		return;
> +#endif
> +
>  	if (skb->head_frag)
>  		skb_free_frag(head);
>  	else
> @@ -664,7 +672,7 @@ static void skb_release_data(struct sk_buff *skb)
>  	skb_zcopy_clear(skb, true);
>  
>  	for (i = 0; i < shinfo->nr_frags; i++)
> -		__skb_frag_unref(&shinfo->frags[i]);
> +		__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
>  
>  	if (shinfo->frag_list)
>  		kfree_skb_list(shinfo->frag_list);
> @@ -1046,6 +1054,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>  	n->nohdr = 0;
>  	n->peeked = 0;
>  	C(pfmemalloc);
> +	C(pp_recycle);
>  	n->destructor = NULL;
>  	C(tail);
>  	C(end);
> @@ -3495,7 +3504,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
>  		fragto = &skb_shinfo(tgt)->frags[merge];
>  
>  		skb_frag_size_add(fragto, skb_frag_size(fragfrom));
> -		__skb_frag_unref(fragfrom);
> +		__skb_frag_unref(fragfrom, skb->pp_recycle);
>  	}
>  
>  	/* Reposition in the original skb */
> @@ -5285,6 +5294,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	if (skb_cloned(to))
>  		return false;
>  
> +	/* We can't coalesce skb that are allocated from slab and page_pool
> +	 * The recycle mark is on the skb, so that might end up trying to
> +	 * recycle slab allocated skb->head
> +	 */
> +	if (to->pp_recycle != from->pp_recycle)
> +		return false;
> +
>  	if (len <= skb_tailroom(to)) {
>  		if (len)
>  			BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index 76a6f8c2eec4..ad11db2c4f63 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
> @@ -127,7 +127,7 @@ static void destroy_record(struct tls_record_info *record)
>  	int i;
>  
>  	for (i = 0; i < record->num_frags; i++)
> -		__skb_frag_unref(&record->frags[i]);
> +		__skb_frag_unref(&record->frags[i], false);
>  	kfree(record);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling
  2021-05-11 15:24   ` Eric Dumazet
@ 2021-05-12  9:50     ` Ilias Apalodimas
  2021-05-12 14:09       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Ilias Apalodimas @ 2021-05-12  9:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matteo Croce, netdev, linux-mm, Ayush Sawal, Vinay Kumar Yadav,
	Rohit Maheshwari, David S. Miller, Jakub Kicinski,
	Thomas Petazzoni, Marcin Wojtas, Russell King, Mirko Lindner,
	Stephen Hemminger, Tariq Toukan, Jesper Dangaard Brouer,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, linux-kernel,
	linux-rdma, bpf, Matthew Wilcox, Eric Dumazet, David Ahern,
	Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn, Paolo Abeni,
	Sven Auhagen

[...]
> > Since we added an extra argument on __skb_frag_unref() to handle
> > recycling, update the current users of the function with that.
> 
> This part could be done with a preliminary patch, only adding this
> extra boolean, this would keep the 'complex' patch smaller.

Sure 

[...]
> >  #include <linux/uaccess.h>
> >  #include <trace/events/skb.h>
> > @@ -645,6 +648,11 @@ static void skb_free_head(struct sk_buff *skb)
> >  {
> >  	unsigned char *head = skb->head;
> >  
> > +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> 
> Why IS_BUILTIN() ? 

No reason, we'll replace it with an ifdef

> 
> PAGE_POOL is either y or n
> 
> IS_ENABLED() would look better, since we use IS_BUILTIN() for the cases where a module might be used.
> 
> Or simply #ifdef CONFIG_PAGE_POOL
> 
> > +	if (skb->pp_recycle && page_pool_return_skb_page(head))
> 
> This probably should be attempted only in the (skb->head_frag) case ?

I think the extra check makes sense.

> 
> Also this patch misses pskb_expand_head()

I am not sure I am following. Misses what? pskb_expand_head() will either
call skb_release_data() or skb_free_head(), which would either recycle or
unmap the buffer for us (depending on the page refcnt)

[...]

Thanks
/Ilias

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling
  2021-05-12  9:50     ` Ilias Apalodimas
@ 2021-05-12 14:09       ` Eric Dumazet
  2021-05-12 14:39         ` Ilias Apalodimas
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2021-05-12 14:09 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Eric Dumazet, Matteo Croce, netdev, linux-mm, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, LKML, linux-rdma, bpf,
	Matthew Wilcox, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On Wed, May 12, 2021 at 11:50 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> [...]
> > > Since we added an extra argument on __skb_frag_unref() to handle
> > > recycling, update the current users of the function with that.
> >
> > This part could be done with a preliminary patch, only adding this
> > extra boolean, this would keep the 'complex' patch smaller.
>
> Sure
>
> [...]
> > >  #include <linux/uaccess.h>
> > >  #include <trace/events/skb.h>
> > > @@ -645,6 +648,11 @@ static void skb_free_head(struct sk_buff *skb)
> > >  {
> > >     unsigned char *head = skb->head;
> > >
> > > +#if IS_BUILTIN(CONFIG_PAGE_POOL)
> >
> > Why IS_BUILTIN() ?
>
> No reason, we'll replace it with an ifdef
>
> >
> > PAGE_POOL is either y or n
> >
> > IS_ENABLED() would look better, since we use IS_BUILTIN() for the cases where a module might be used.
> >
> > Or simply #ifdef CONFIG_PAGE_POOL
> >
> > > +   if (skb->pp_recycle && page_pool_return_skb_page(head))
> >
> > This probably should be attempted only in the (skb->head_frag) case ?
>
> I think the extra check makes sense.

What do you mean here ?

>
> >
> > Also this patch misses pskb_expand_head()
>
> I am not sure I am following. Misses what? pskb_expand_head() will either
> call skb_release_data() or skb_free_head(), which would either recycle or
> unmap the buffer for us (depending on the page refcnt)

 pskb_expand_head() allocates a new skb->head, from slab.

We should clear skb->pp_recycle for consistency of the skb->head_frag
clearing we perform there.

But then, I now realize you use skb->pp_recycle bit for both skb->head
and fragments,
and rely on this PP_SIGNATURE thing (I note that patch 1 changelog
does not describe why a random page will _not_ have this signature by
bad luck)

Please document/describe which struct page fields are aliased with
page->signature ?

Thanks !

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling
  2021-05-12 14:09       ` Eric Dumazet
@ 2021-05-12 14:39         ` Ilias Apalodimas
  0 siblings, 0 replies; 20+ messages in thread
From: Ilias Apalodimas @ 2021-05-12 14:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Matteo Croce, netdev, linux-mm, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, LKML, linux-rdma, bpf,
	Matthew Wilcox, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

Hi Eric,

[...]
> > > > +   if (skb->pp_recycle && page_pool_return_skb_page(head))
> > >
> > > This probably should be attempted only in the (skb->head_frag) case ?
> >
> > I think the extra check makes sense.
> 
> What do you mean here ?
> 

I thought you wanted an extra check in the if statement above.  So move the
block under the existing if. Something like

	if (skb->head_frag) {
		#ifdef (CONFIG_PAGE_POOL)
			if (skb->pp_recycle && page_pool_return_skb_page(head))
			return;
		#endif
        skb_free_frag(head);
	} else {
	.....

> >
> > >
> > > Also this patch misses pskb_expand_head()
> >
> > I am not sure I am following. Misses what? pskb_expand_head() will either
> > call skb_release_data() or skb_free_head(), which would either recycle or
> > unmap the buffer for us (depending on the page refcnt)
> 
>  pskb_expand_head() allocates a new skb->head, from slab.
> 
> We should clear skb->pp_recycle for consistency of the skb->head_frag
> clearing we perform there.

Ah right, good catch. I was mostly worried we are not freeing/unmapping
buffers and I completely missed that.  I think nothing bad will happen even
if we don't, since the signature will eventually protect us, but it's
definitely the right thing to do.

> 
> But then, I now realize you use skb->pp_recycle bit for both skb->head
> and fragments,
> and rely on this PP_SIGNATURE thing (I note that patch 1 changelog
> does not describe why a random page will _not_ have this signature by
> bad luck)

Correct.  I've tried to explain in the previous posting as well, but that's
the big difference compared to the initial RFC we sent a few years ago (the
ability to recycle frags as well).

> 
> Please document/describe which struct page fields are aliased with
> page->signature ?
> 

Sure, any preference on this? Right above page_pool_return_skb_page() ?

Keep in mind the current [1/4] patch is wrong, since it will overlap
pp_signature with mapping.  So we'll have interesting results if a page
gets mapped to userspace :).
What Matthew proposed makes sense, we can add something along the lines of: 

+ unsigned long pp_magic;
+ struct page_pool *pp;
+ unsigned long _pp_mapping_pad;
+ unsigned long dma_addr[2];

in struct page. In this case page->mapping aliases to pa->_pp_mapping_pad

The first word (that we'll now be using) is used for a pointer or a
compound_head.  So as long as pp_magic doesn't resemble a pointer and has 
bits 0/1 set to 0 we should be safe.

Thanks!
/Ilias

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-11 14:25         ` Ilias Apalodimas
@ 2021-05-12 15:57           ` Matthew Wilcox
  2021-05-12 16:09             ` Eric Dumazet
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Matthew Wilcox @ 2021-05-12 15:57 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Matteo Croce, Networking, Linux-MM, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, open list, linux-rdma,
	bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On Tue, May 11, 2021 at 05:25:36PM +0300, Ilias Apalodimas wrote:
> Nope not at all, either would work. we'll switch to that

You'll need something like this because of the current use of
page->index to mean "pfmemalloc".

From ecd6d912056a21bbe55d997c01f96b0b8b9fbc31 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 16 Apr 2021 18:12:33 -0400
Subject: [PATCH] mm: Indicate pfmemalloc pages in compound_head

The net page_pool wants to use a magic value to identify page pool pages.
The best place to put it is in the first word where it can be clearly a
non-pointer value.  That means shifting dma_addr up to alias with ->index,
which means we need to find another way to indicate page_is_pfmemalloc().
Since page_pool doesn't want to set its magic value on pages which are
pfmemalloc, we can use bit 1 of compound_head to indicate that the page
came from the memory reserves.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h       | 12 +++++++-----
 include/linux/mm_types.h |  7 +++----
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bd21864449bf..4f9b2007efad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1670,10 +1670,12 @@ struct address_space *page_mapping(struct page *page);
 static inline bool page_is_pfmemalloc(const struct page *page)
 {
 	/*
-	 * Page index cannot be this large so this must be
-	 * a pfmemalloc page.
+	 * This is not a tail page; compound_head of a head page is unused
+	 * at return from the page allocator, and will be overwritten
+	 * by callers who do not care whether the page came from the
+	 * reserves.
 	 */
-	return page->index == -1UL;
+	return page->compound_head & 2;
 }
 
 /*
@@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
  */
 static inline void set_page_pfmemalloc(struct page *page)
 {
-	page->index = -1UL;
+	page->compound_head = 2;
 }
 
 static inline void clear_page_pfmemalloc(struct page *page)
 {
-	page->index = 0;
+	page->compound_head = 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c10a45..1352e278939b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -96,10 +96,9 @@ struct page {
 			unsigned long private;
 		};
 		struct {	/* page_pool used by netstack */
-			/**
-			 * @dma_addr: might require a 64-bit value on
-			 * 32-bit architectures.
-			 */
+			unsigned long pp_magic;
+			struct page_pool *pp;
+			unsigned long _pp_mapping_pad;
 			unsigned long dma_addr[2];
 		};
 		struct {	/* slab, slob and slub */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-12 15:57           ` Matthew Wilcox
@ 2021-05-12 16:09             ` Eric Dumazet
  2021-05-12 16:26               ` Matthew Wilcox
  2021-05-12 16:47             ` Ilias Apalodimas
  2021-05-13  2:15             ` Yunsheng Lin
  2 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2021-05-12 16:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ilias Apalodimas, Matteo Croce, Networking, Linux-MM,
	Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, open list, linux-rdma,
	bpf, David Ahern, Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn,
	Paolo Abeni, Sven Auhagen

On Wed, May 12, 2021 at 6:03 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 11, 2021 at 05:25:36PM +0300, Ilias Apalodimas wrote:
> > Nope not at all, either would work. we'll switch to that
>
> You'll need something like this because of the current use of
> page->index to mean "pfmemalloc".
>
> From ecd6d912056a21bbe55d997c01f96b0b8b9fbc31 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Fri, 16 Apr 2021 18:12:33 -0400
> Subject: [PATCH] mm: Indicate pfmemalloc pages in compound_head
>
> The net page_pool wants to use a magic value to identify page pool pages.
> The best place to put it is in the first word where it can be clearly a
> non-pointer value.  That means shifting dma_addr up to alias with ->index,
> which means we need to find another way to indicate page_is_pfmemalloc().
> Since page_pool doesn't want to set its magic value on pages which are
> pfmemalloc, we can use bit 1 of compound_head to indicate that the page
> came from the memory reserves.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h       | 12 +++++++-----
>  include/linux/mm_types.h |  7 +++----
>  2 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bd21864449bf..4f9b2007efad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1670,10 +1670,12 @@ struct address_space *page_mapping(struct page *page);
>  static inline bool page_is_pfmemalloc(const struct page *page)
>  {
>         /*
> -        * Page index cannot be this large so this must be
> -        * a pfmemalloc page.
> +        * This is not a tail page; compound_head of a head page is unused
> +        * at return from the page allocator, and will be overwritten
> +        * by callers who do not care whether the page came from the
> +        * reserves.
>          */
> -       return page->index == -1UL;
> +       return page->compound_head & 2;
>  }
>
>  /*
> @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
>   */
>  static inline void set_page_pfmemalloc(struct page *page)
>  {
> -       page->index = -1UL;
> +       page->compound_head = 2;
>  }
>
>  static inline void clear_page_pfmemalloc(struct page *page)
>  {
> -       page->index = 0;
> +       page->compound_head = 0;
>  }
>
>  /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c10a45..1352e278939b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -96,10 +96,9 @@ struct page {
>                         unsigned long private;
>                 };
>                 struct {        /* page_pool used by netstack */
> -                       /**
> -                        * @dma_addr: might require a 64-bit value on
> -                        * 32-bit architectures.
> -                        */
> +                       unsigned long pp_magic;
> +                       struct page_pool *pp;
> +                       unsigned long _pp_mapping_pad;
>                         unsigned long dma_addr[2];
>                 };
>                 struct {        /* slab, slob and slub */
> --
> 2.30.2
>

This would break compound_head() ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-12 16:09             ` Eric Dumazet
@ 2021-05-12 16:26               ` Matthew Wilcox
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2021-05-12 16:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ilias Apalodimas, Matteo Croce, Networking, Linux-MM,
	Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, open list, linux-rdma,
	bpf, David Ahern, Lorenzo Bianconi, Saeed Mahameed, Andrew Lunn,
	Paolo Abeni, Sven Auhagen

On Wed, May 12, 2021 at 06:09:21PM +0200, Eric Dumazet wrote:
> On Wed, May 12, 2021 at 6:03 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 11, 2021 at 05:25:36PM +0300, Ilias Apalodimas wrote:
> > > Nope not at all, either would work. we'll switch to that
> >
> > You'll need something like this because of the current use of
> > page->index to mean "pfmemalloc".
> >
> > From ecd6d912056a21bbe55d997c01f96b0b8b9fbc31 Mon Sep 17 00:00:00 2001
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > Date: Fri, 16 Apr 2021 18:12:33 -0400
> > Subject: [PATCH] mm: Indicate pfmemalloc pages in compound_head
> >
> > The net page_pool wants to use a magic value to identify page pool pages.
> > The best place to put it is in the first word where it can be clearly a
> > non-pointer value.  That means shifting dma_addr up to alias with ->index,
> > which means we need to find another way to indicate page_is_pfmemalloc().
> > Since page_pool doesn't want to set its magic value on pages which are
> > pfmemalloc, we can use bit 1 of compound_head to indicate that the page
> > came from the memory reserves.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  include/linux/mm.h       | 12 +++++++-----
> >  include/linux/mm_types.h |  7 +++----
> >  2 files changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index bd21864449bf..4f9b2007efad 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1670,10 +1670,12 @@ struct address_space *page_mapping(struct page *page);
> >  static inline bool page_is_pfmemalloc(const struct page *page)
> >  {
> >         /*
> > -        * Page index cannot be this large so this must be
> > -        * a pfmemalloc page.
> > +        * This is not a tail page; compound_head of a head page is unused
> > +        * at return from the page allocator, and will be overwritten
> > +        * by callers who do not care whether the page came from the
> > +        * reserves.
> >          */
> > -       return page->index == -1UL;
> > +       return page->compound_head & 2;
> >  }
> >
> >  /*
> > @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
> >   */
> >  static inline void set_page_pfmemalloc(struct page *page)
> >  {
> > -       page->index = -1UL;
> > +       page->compound_head = 2;
> >  }
> >
> >  static inline void clear_page_pfmemalloc(struct page *page)
> >  {
> > -       page->index = 0;
> > +       page->compound_head = 0;
> >  }
> >
> >  /*
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5aacc1c10a45..1352e278939b 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -96,10 +96,9 @@ struct page {
> >                         unsigned long private;
> >                 };
> >                 struct {        /* page_pool used by netstack */
> > -                       /**
> > -                        * @dma_addr: might require a 64-bit value on
> > -                        * 32-bit architectures.
> > -                        */
> > +                       unsigned long pp_magic;
> > +                       struct page_pool *pp;
> > +                       unsigned long _pp_mapping_pad;
> >                         unsigned long dma_addr[2];
> >                 };
> >                 struct {        /* slab, slob and slub */
> 
> This would break compound_head() ?

No, compound_head() only checks bit 0.  If bit 0 is clear, then this is
not a tail page.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-12 15:57           ` Matthew Wilcox
  2021-05-12 16:09             ` Eric Dumazet
@ 2021-05-12 16:47             ` Ilias Apalodimas
  2021-05-13  2:15             ` Yunsheng Lin
  2 siblings, 0 replies; 20+ messages in thread
From: Ilias Apalodimas @ 2021-05-12 16:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matteo Croce, Networking, Linux-MM, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Yunsheng Lin, Guillaume Nault, open list, linux-rdma,
	bpf, Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On Wed, May 12, 2021 at 04:57:25PM +0100, Matthew Wilcox wrote:
> On Tue, May 11, 2021 at 05:25:36PM +0300, Ilias Apalodimas wrote:
> > Nope not at all, either would work. we'll switch to that
> 
> You'll need something like this because of the current use of
> page->index to mean "pfmemalloc".
> 

Yes, I was somehow under the impression that was already merged.
We'll include it in the series, with your Co-developed-by tag.

Thanks
/Ilias

> From ecd6d912056a21bbe55d997c01f96b0b8b9fbc31 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Fri, 16 Apr 2021 18:12:33 -0400
> Subject: [PATCH] mm: Indicate pfmemalloc pages in compound_head
> 
> The net page_pool wants to use a magic value to identify page pool pages.
> The best place to put it is in the first word where it can be clearly a
> non-pointer value.  That means shifting dma_addr up to alias with ->index,
> which means we need to find another way to indicate page_is_pfmemalloc().
> Since page_pool doesn't want to set its magic value on pages which are
> pfmemalloc, we can use bit 1 of compound_head to indicate that the page
> came from the memory reserves.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h       | 12 +++++++-----
>  include/linux/mm_types.h |  7 +++----
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bd21864449bf..4f9b2007efad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1670,10 +1670,12 @@ struct address_space *page_mapping(struct page *page);
>  static inline bool page_is_pfmemalloc(const struct page *page)
>  {
>  	/*
> -	 * Page index cannot be this large so this must be
> -	 * a pfmemalloc page.
> +	 * This is not a tail page; compound_head of a head page is unused
> +	 * at return from the page allocator, and will be overwritten
> +	 * by callers who do not care whether the page came from the
> +	 * reserves.
>  	 */
> -	return page->index == -1UL;
> +	return page->compound_head & 2;
>  }
>  
>  /*
> @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
>   */
>  static inline void set_page_pfmemalloc(struct page *page)
>  {
> -	page->index = -1UL;
> +	page->compound_head = 2;
>  }
>  
>  static inline void clear_page_pfmemalloc(struct page *page)
>  {
> -	page->index = 0;
> +	page->compound_head = 0;
>  }
>  
>  /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c10a45..1352e278939b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -96,10 +96,9 @@ struct page {
>  			unsigned long private;
>  		};
>  		struct {	/* page_pool used by netstack */
> -			/**
> -			 * @dma_addr: might require a 64-bit value on
> -			 * 32-bit architectures.
> -			 */
> +			unsigned long pp_magic;
> +			struct page_pool *pp;
> +			unsigned long _pp_mapping_pad;
>  			unsigned long dma_addr[2];
>  		};
>  		struct {	/* slab, slob and slub */
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-12 15:57           ` Matthew Wilcox
  2021-05-12 16:09             ` Eric Dumazet
  2021-05-12 16:47             ` Ilias Apalodimas
@ 2021-05-13  2:15             ` Yunsheng Lin
  2021-05-13  2:35               ` Matthew Wilcox
  2 siblings, 1 reply; 20+ messages in thread
From: Yunsheng Lin @ 2021-05-13  2:15 UTC (permalink / raw)
  To: Matthew Wilcox, Ilias Apalodimas
  Cc: Matteo Croce, Networking, Linux-MM, Ayush Sawal,
	Vinay Kumar Yadav, Rohit Maheshwari, David S. Miller,
	Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas, Russell King,
	Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Guillaume Nault, open list, linux-rdma, bpf,
	Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On 2021/5/12 23:57, Matthew Wilcox wrote:
> On Tue, May 11, 2021 at 05:25:36PM +0300, Ilias Apalodimas wrote:
>> Nope not at all, either would work. we'll switch to that
> 
> You'll need something like this because of the current use of
> page->index to mean "pfmemalloc".
> 
>>From ecd6d912056a21bbe55d997c01f96b0b8b9fbc31 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Fri, 16 Apr 2021 18:12:33 -0400
> Subject: [PATCH] mm: Indicate pfmemalloc pages in compound_head
> 
> The net page_pool wants to use a magic value to identify page pool pages.
> The best place to put it is in the first word where it can be clearly a
> non-pointer value.  That means shifting dma_addr up to alias with ->index,
> which means we need to find another way to indicate page_is_pfmemalloc().
> Since page_pool doesn't want to set its magic value on pages which are
> pfmemalloc, we can use bit 1 of compound_head to indicate that the page
> came from the memory reserves.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h       | 12 +++++++-----
>  include/linux/mm_types.h |  7 +++----
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bd21864449bf..4f9b2007efad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1670,10 +1670,12 @@ struct address_space *page_mapping(struct page *page);
>  static inline bool page_is_pfmemalloc(const struct page *page)
>  {
>  	/*
> -	 * Page index cannot be this large so this must be
> -	 * a pfmemalloc page.
> +	 * This is not a tail page; compound_head of a head page is unused
> +	 * at return from the page allocator, and will be overwritten
> +	 * by callers who do not care whether the page came from the
> +	 * reserves.
>  	 */
> -	return page->index == -1UL;
> +	return page->compound_head & 2;
>  }
>  
>  /*
> @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
>   */
>  static inline void set_page_pfmemalloc(struct page *page)
>  {
> -	page->index = -1UL;
> +	page->compound_head = 2;

Is there any reason why not use "page->compound_head |= 2"? as
corresponding to the "page->compound_head & 2" in the above
page_is_pfmemalloc()?

Also, this may mean we need to make sure to pass head page or
base page to set_page_pfmemalloc() if using
"page->compound_head = 2", because it clears the bit 0 and head
page ptr for tail page too, right?

>  }
>  
>  static inline void clear_page_pfmemalloc(struct page *page)
>  {
> -	page->index = 0;
> +	page->compound_head = 0;
>  }
>  
>  /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c10a45..1352e278939b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -96,10 +96,9 @@ struct page {
>  			unsigned long private;
>  		};
>  		struct {	/* page_pool used by netstack */
> -			/**
> -			 * @dma_addr: might require a 64-bit value on
> -			 * 32-bit architectures.
> -			 */
> +			unsigned long pp_magic;
> +			struct page_pool *pp;
> +			unsigned long _pp_mapping_pad;
>  			unsigned long dma_addr[2];

It seems the dma_addr[1] aliases with page->private, and
page_private() is used in skb_copy_ubufs()?

It seems we can avoid using page_private() in skb_copy_ubufs()
by using a dynamic allocated array to store the page ptr?

>  		};
>  		struct {	/* slab, slob and slub */
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-13  2:15             ` Yunsheng Lin
@ 2021-05-13  2:35               ` Matthew Wilcox
  2021-05-13  3:25                 ` Yunsheng Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-05-13  2:35 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Ilias Apalodimas, Matteo Croce, Networking, Linux-MM,
	Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Guillaume Nault, open list, linux-rdma, bpf,
	Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On Thu, May 13, 2021 at 10:15:26AM +0800, Yunsheng Lin wrote:
> On 2021/5/12 23:57, Matthew Wilcox wrote:
> > You'll need something like this because of the current use of
> > page->index to mean "pfmemalloc".
> >
> > @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
> >   */
> >  static inline void set_page_pfmemalloc(struct page *page)
> >  {
> > -	page->index = -1UL;
> > +	page->compound_head = 2;
> 
> Is there any reason why not use "page->compound_head |= 2"? as
> corresponding to the "page->compound_head & 2" in the above
> page_is_pfmemalloc()?
> 
> Also, this may mean we need to make sure to pass head page or
> base page to set_page_pfmemalloc() if using
> "page->compound_head = 2", because it clears the bit 0 and head
> page ptr for tail page too, right?

I think what you're missing here is that this page is freshly allocated.
This is information being passed from the page allocator to any user
who cares to look at it.  By definition, it's set on the head/base page, and
there is nothing else present in the page->compound_head.  Doing an OR
is more expensive than just setting it to 2.

I'm not really sure why set/clear page_pfmemalloc are defined in mm.h.
They should probably be in mm/page_alloc.c where nobody else would ever
think that they could or should be calling them.

> >  		struct {	/* page_pool used by netstack */
> > -			/**
> > -			 * @dma_addr: might require a 64-bit value on
> > -			 * 32-bit architectures.
> > -			 */
> > +			unsigned long pp_magic;
> > +			struct page_pool *pp;
> > +			unsigned long _pp_mapping_pad;
> >  			unsigned long dma_addr[2];
> 
> It seems the dma_addr[1] aliases with page->private, and
> page_private() is used in skb_copy_ubufs()?
> 
> It seems we can avoid using page_private() in skb_copy_ubufs()
> by using a dynamic allocated array to store the page ptr?

This is why I hate it when people use page_private() instead of
documenting what they're doing in struct page.  There is no way to know
(as an outsider to networking) whether the page in skb_copy_ubufs()
comes from page_pool.  I looked at it, and thought it didn't:

                page = alloc_page(gfp_mask);

but if you say those pages can come from page_pool, I believe you.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v4 1/4] mm: add a signature in struct page
  2021-05-13  2:35               ` Matthew Wilcox
@ 2021-05-13  3:25                 ` Yunsheng Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-05-13  3:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ilias Apalodimas, Matteo Croce, Networking, Linux-MM,
	Ayush Sawal, Vinay Kumar Yadav, Rohit Maheshwari,
	David S. Miller, Jakub Kicinski, Thomas Petazzoni, Marcin Wojtas,
	Russell King, Mirko Lindner, Stephen Hemminger, Tariq Toukan,
	Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Boris Pismenny, Arnd Bergmann, Andrew Morton,
	Peter Zijlstra (Intel),
	Vlastimil Babka, Yu Zhao, Will Deacon, Michel Lespinasse,
	Fenghua Yu, Roman Gushchin, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Jonathan Lemon, Alexander Lobakin, Cong Wang,
	wenxu, Kevin Hao, Jakub Sitnicki, Marco Elver, Willem de Bruijn,
	Miaohe Lin, Guillaume Nault, open list, linux-rdma, bpf,
	Eric Dumazet, David Ahern, Lorenzo Bianconi, Saeed Mahameed,
	Andrew Lunn, Paolo Abeni, Sven Auhagen

On 2021/5/13 10:35, Matthew Wilcox wrote:
> On Thu, May 13, 2021 at 10:15:26AM +0800, Yunsheng Lin wrote:
>> On 2021/5/12 23:57, Matthew Wilcox wrote:
>>> You'll need something like this because of the current use of
>>> page->index to mean "pfmemalloc".
>>>
>>> @@ -1682,12 +1684,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
>>>   */
>>>  static inline void set_page_pfmemalloc(struct page *page)
>>>  {
>>> -	page->index = -1UL;
>>> +	page->compound_head = 2;
>>
>> Is there any reason why not use "page->compound_head |= 2"? as
>> corresponding to the "page->compound_head & 2" in the above
>> page_is_pfmemalloc()?
>>
>> Also, this may mean we need to make sure to pass head page or
>> base page to set_page_pfmemalloc() if using
>> "page->compound_head = 2", because it clears the bit 0 and head
>> page ptr for tail page too, right?
> 
> I think what you're missing here is that this page is freshly allocated.
> This is information being passed from the page allocator to any user
> who cares to look at it.  By definition, it's set on the head/base page, and
> there is nothing else present in the page->compound_head.  Doing an OR
> is more expensive than just setting it to 2.

Thanks for clarifying.

> 
> I'm not really sure why set/clear page_pfmemalloc are defined in mm.h.
> They should probably be in mm/page_alloc.c where nobody else would ever
> think that they could or should be calling them.>
>>>  		struct {	/* page_pool used by netstack */
>>> -			/**
>>> -			 * @dma_addr: might require a 64-bit value on
>>> -			 * 32-bit architectures.
>>> -			 */
>>> +			unsigned long pp_magic;
>>> +			struct page_pool *pp;
>>> +			unsigned long _pp_mapping_pad;
>>>  			unsigned long dma_addr[2];
>>
>> It seems the dma_addr[1] aliases with page->private, and
>> page_private() is used in skb_copy_ubufs()?
>>
>> It seems we can avoid using page_private() in skb_copy_ubufs()
>> by using a dynamic allocated array to store the page ptr?
> 
> This is why I hate it when people use page_private() instead of
> documenting what they're doing in struct page.  There is no way to know
> (as an outsider to networking) whether the page in skb_copy_ubufs()
> comes from page_pool.  I looked at it, and thought it didn't:
> 
>                 page = alloc_page(gfp_mask);
> 
> but if you say those pages can come from page_pool, I believe you.

page_private() using in skb_copy_ubufs() does indeed seem ok here.
the page_private() is used on the page which is freshly allocated
from alloc_page().

Sorry for the confusion.

> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-05-13  3:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-11 13:31 [PATCH net-next v4 0/4] page_pool: recycle buffers Matteo Croce
2021-05-11 13:31 ` [PATCH net-next v4 1/4] mm: add a signature in struct page Matteo Croce
2021-05-11 13:45   ` Matthew Wilcox
2021-05-11 14:11     ` Ilias Apalodimas
2021-05-11 14:18       ` Matthew Wilcox
2021-05-11 14:25         ` Ilias Apalodimas
2021-05-12 15:57           ` Matthew Wilcox
2021-05-12 16:09             ` Eric Dumazet
2021-05-12 16:26               ` Matthew Wilcox
2021-05-12 16:47             ` Ilias Apalodimas
2021-05-13  2:15             ` Yunsheng Lin
2021-05-13  2:35               ` Matthew Wilcox
2021-05-13  3:25                 ` Yunsheng Lin
2021-05-11 13:31 ` [PATCH net-next v4 2/4] page_pool: Allow drivers to hint on SKB recycling Matteo Croce
2021-05-11 15:24   ` Eric Dumazet
2021-05-12  9:50     ` Ilias Apalodimas
2021-05-12 14:09       ` Eric Dumazet
2021-05-12 14:39         ` Ilias Apalodimas
2021-05-11 13:31 ` [PATCH net-next v4 3/4] mvpp2: recycle buffers Matteo Croce
2021-05-11 13:31 ` [PATCH net-next v4 4/4] mvneta: " Matteo Croce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).