All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3
@ 2023-05-24 15:32 David Howells
  2023-05-24 15:33 ` [PATCH net-next 01/12] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
                   ` (11 more replies)
  0 siblings, 12 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:32 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel

Here's the third tranche of patches towards providing a MSG_SPLICE_PAGES
internal sendmsg flag that is intended to replace the ->sendpage() op with
calls to sendmsg().  MSG_SPLICE_PAGES is a hint that tells the protocol
that it should splice the pages supplied if it can and copy them if not.

The primary focus of this tranche is to allow data passed in the slab to be
copied into page fragments (appending it to existing free space within an
sk_buff could also be possible), thereby allowing a single sendmsg() to mix
data held in the slab (such as higher-level protocol pieces) and data held
in pages (such as content for a network filesystem).  This puts the copying
in (mostly) one place: skb_splice_from_iter().

To make this work, some sort of locking is needed with the allocator.  I've
chosen to make the allocator internally have a separate bucket per cpu, as
the netdev and napi allocators already do - and then share the allocated
pages amongst those services that were using their own allocators.  I'm not
sure that the existing usage of the allocator is completely thread safe.

TLS is also converted here because that does things differently and uses
sk_msg rather than sk_buff - and so can't use skb_splice_from_iter().

So, firstly the page_frag_alloc_align() allocator is overhauled:

 (1) Split it out from mm/page_alloc.c into its own file,
     mm/page_frag_alloc.c.

 (2) Add a common function to clear an allocator.

 (3) Make the alignment specification consistent with some of the wrapper
     functions.

 (4) Make it use multipage folios rather than compound pages.

 (5) Make it handle __GFP_ZERO, rather than devolving this to the page
     allocator.

     Note that the current behaviour is potentially broken as the page may
     get reused if all refs have been dropped, but it doesn't then get
     cleared.  This might mean that the NVMe over TCP driver, for example,
     will malfunction under some circumstances.

 (6) Give it per-cpu buckets to allocate from to avoid the need for locking
     against users on other cpus.

 (7) The netdev_alloc_cache and the napi fragment cache are then recast
     in terms of this and some private allocators are removed.

We can then make use of the page fragment allocator to copy data that is
resident in the slab rather than returning EIO:

 (8) Make skb_splice_from_iter() copy data provided in the slab to page
     fragments.

 (9) Implement MSG_SPLICE_PAGES support in the AF_TLS-sw sendmsg and make
     tls_sw_sendpage() just a wrapper around sendmsg().

(10) Implement MSG_SPLICE_PAGES support in AF_TLS-device and make
     tls_device_sendpage() just a wrapper around sendmsg().

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3

David

Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1

David Howells (12):
  mm: Move the page fragment allocator from page_alloc.c into its own
    file
  mm: Provide a page_frag_cache allocator cleanup function
  mm: Make the page_frag_cache allocator alignment param a pow-of-2
  mm: Make the page_frag_cache allocator use multipage folios
  mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
  mm: Make the page_frag_cache allocator use per-cpu
  net: Clean up users of netdev_alloc_cache and napi_frag_cache
  net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
  tls/sw: Support MSG_SPLICE_PAGES
  tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
  tls/device: Support MSG_SPLICE_PAGES
  tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

 drivers/net/ethernet/google/gve/gve.h      |   1 -
 drivers/net/ethernet/google/gve/gve_main.c |  16 --
 drivers/net/ethernet/google/gve/gve_rx.c   |   2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  19 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h |   2 -
 drivers/nvme/host/tcp.c                    |  19 +-
 drivers/nvme/target/tcp.c                  |  22 +-
 include/linux/gfp.h                        |  17 +-
 include/linux/mm_types.h                   |  13 +-
 include/linux/skbuff.h                     |  28 +--
 mm/Makefile                                |   2 +-
 mm/page_alloc.c                            | 126 ------------
 mm/page_frag_alloc.c                       | 206 +++++++++++++++++++
 net/core/skbuff.c                          |  94 +++++----
 net/tls/tls_device.c                       |  93 ++++-----
 net/tls/tls_sw.c                           | 221 ++++++++-------------
 16 files changed, 418 insertions(+), 463 deletions(-)
 create mode 100644 mm/page_frag_alloc.c


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH net-next 01/12] mm: Move the page fragment allocator from page_alloc.c into its own file
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-24 15:33   ` David Howells
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Andrew Morton

Move the page fragment allocator from page_alloc.c into its own file
preparatory to changing it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
---
 mm/Makefile          |   2 +-
 mm/page_alloc.c      | 126 -----------------------------------------
 mm/page_frag_alloc.c | 131 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+), 127 deletions(-)
 create mode 100644 mm/page_frag_alloc.c

diff --git a/mm/Makefile b/mm/Makefile
index e29afc890cde..0daa4c6f4552 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,7 +51,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
-			   compaction.o \
+			   compaction.o page_frag_alloc.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o gup.o mmap_lock.o $(mmu-y)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..29dc79dbeb22 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4871,132 +4871,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * Page Fragment:
- *  An arbitrary-length arbitrary-offset area of memory which resides
- *  within a 0 or higher order page.  Multiple fragments within that page
- *  are individually refcounted, in the page's reference counter.
- *
- * The page_frag functions below provide a simple allocation framework for
- * page fragments.  This is used by the network stack and network device
- * drivers to provide a backing region of memory for use as either an
- * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
- */
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
-					     gfp_t gfp_mask)
-{
-	struct page *page = NULL;
-	gfp_t gfp = gfp_mask;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-				PAGE_FRAG_CACHE_MAX_ORDER);
-	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-#endif
-	if (unlikely(!page))
-		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
-
-	nc->va = page ? page_address(page) : NULL;
-
-	return page;
-}
-
-void __page_frag_cache_drain(struct page *page, unsigned int count)
-{
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
-	if (page_ref_sub_and_test(page, count))
-		free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(__page_frag_cache_drain);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
-{
-	unsigned int size = PAGE_SIZE;
-	struct page *page;
-	int offset;
-
-	if (unlikely(!nc->va)) {
-refill:
-		page = __page_frag_cache_refill(nc, gfp_mask);
-		if (!page)
-			return NULL;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
-		/* Even if we own the page, we do not use atomic_set().
-		 * This would break get_page_unless_zero() users.
-		 */
-		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
-
-		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = page_is_pfmemalloc(page);
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
-	}
-
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
-		page = virt_to_page(nc->va);
-
-		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
-			goto refill;
-
-		if (unlikely(nc->pfmemalloc)) {
-			free_the_page(page, compound_order(page));
-			goto refill;
-		}
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
-		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
-
-		/* reset page count bias and offset to start of new frag */
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
-			/*
-			 * The caller is trying to allocate a fragment
-			 * with fragsz > PAGE_SIZE but the cache isn't big
-			 * enough to satisfy the request, this may
-			 * happen in low memory conditions.
-			 * We don't release the cache page because
-			 * it could make memory pressure worse
-			 * so we simply return NULL here.
-			 */
-			return NULL;
-		}
-	}
-
-	nc->pagecnt_bias--;
-	offset &= align_mask;
-	nc->offset = offset;
-
-	return nc->va + offset;
-}
-EXPORT_SYMBOL(page_frag_alloc_align);
-
-/*
- * Frees a page fragment allocated out of either a compound or order 0 page.
- */
-void page_frag_free(void *addr)
-{
-	struct page *page = virt_to_head_page(addr);
-
-	if (unlikely(put_page_testzero(page)))
-		free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(page_frag_free);
-
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
 {
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
new file mode 100644
index 000000000000..bee95824ef8f
--- /dev/null
+++ b/mm/page_frag_alloc.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Page fragment allocator
+ *
+ * Page Fragment:
+ *  An arbitrary-length arbitrary-offset area of memory which resides within a
+ *  0 or higher order page.  Multiple fragments within that page are
+ *  individually refcounted, in the page's reference counter.
+ *
+ * The page_frag functions provide a simple allocation framework for page
+ * fragments.  This is used by the network stack and network device drivers to
+ * provide a backing region of memory for use as either an sk_buff->head, or to
+ * be used in the "frags" portion of skb_shared_info.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+					     gfp_t gfp_mask)
+{
+	struct page *page = NULL;
+	gfp_t gfp = gfp_mask;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
+		    __GFP_NOMEMALLOC;
+	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
+				PAGE_FRAG_CACHE_MAX_ORDER);
+	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+#endif
+	if (unlikely(!page))
+		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+
+	nc->va = page ? page_address(page) : NULL;
+
+	return page;
+}
+
+void __page_frag_cache_drain(struct page *page, unsigned int count)
+{
+	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+	if (page_ref_sub_and_test(page, count - 1))
+		__free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(__page_frag_cache_drain);
+
+void *page_frag_alloc_align(struct page_frag_cache *nc,
+		      unsigned int fragsz, gfp_t gfp_mask,
+		      unsigned int align_mask)
+{
+	unsigned int size = PAGE_SIZE;
+	struct page *page;
+	int offset;
+
+	if (unlikely(!nc->va)) {
+refill:
+		page = __page_frag_cache_refill(nc, gfp_mask);
+		if (!page)
+			return NULL;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+		/* if size can vary use size else just use PAGE_SIZE */
+		size = nc->size;
+#endif
+		/* Even if we own the page, we do not use atomic_set().
+		 * This would break get_page_unless_zero() users.
+		 */
+		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pfmemalloc = page_is_pfmemalloc(page);
+		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+		nc->offset = size;
+	}
+
+	offset = nc->offset - fragsz;
+	if (unlikely(offset < 0)) {
+		page = virt_to_page(nc->va);
+
+		if (page_ref_count(page) != nc->pagecnt_bias)
+			goto refill;
+		if (unlikely(nc->pfmemalloc)) {
+			page_ref_sub(page, nc->pagecnt_bias - 1);
+			__free_pages(page, compound_order(page));
+			goto refill;
+		}
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+		/* if size can vary use size else just use PAGE_SIZE */
+		size = nc->size;
+#endif
+		/* OK, page count is 0, we can safely set it */
+		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+		offset = size - fragsz;
+		if (unlikely(offset < 0)) {
+			/*
+			 * The caller is trying to allocate a fragment
+			 * with fragsz > PAGE_SIZE but the cache isn't big
+			 * enough to satisfy the request, this may
+			 * happen in low memory conditions.
+			 * We don't release the cache page because
+			 * it could make memory pressure worse
+			 * so we simply return NULL here.
+			 */
+			return NULL;
+		}
+	}
+
+	nc->pagecnt_bias--;
+	offset &= align_mask;
+	nc->offset = offset;
+
+	return nc->va + offset;
+}
+EXPORT_SYMBOL(page_frag_alloc_align);
+
+/*
+ * Frees a page fragment allocated out of either a compound or order 0 page.
+ */
+void page_frag_free(void *addr)
+{
+	struct page *page = virt_to_head_page(addr);
+
+	__free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(page_frag_free);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 02/12] mm: Provide a page_frag_cache allocator cleanup function
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33   ` David Howells
  2023-05-24 15:33   ` David Howells
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Provide a function to clean up a page_frag_cache allocator rather than
doing it manually each time.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
 drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
 drivers/nvme/host/tcp.c                    |  8 +-------
 drivers/nvme/target/tcp.c                  |  5 +----
 include/linux/gfp.h                        |  2 ++
 mm/page_frag_alloc.c                       | 17 +++++++++++++++++
 6 files changed, 25 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 8fb70db63b8b..55feab29bed9 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1251,17 +1251,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 
 static void gve_drain_page_cache(struct gve_priv *priv)
 {
-	struct page_frag_cache *nc;
 	int i;
 
-	for (i = 0; i < priv->rx_cfg.num_queues; i++) {
-		nc = &priv->rx[i].page_cache;
-		if (nc->va) {
-			__page_frag_cache_drain(virt_to_page(nc->va),
-						nc->pagecnt_bias);
-			nc->va = NULL;
-		}
-	}
+	for (i = 0; i < priv->rx_cfg.num_queues; i++)
+		page_frag_cache_clear(&priv->rx[i].page_cache);
 }
 
 static int gve_open(struct net_device *dev)
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 69fba29055e9..d90fea2c7d04 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -298,19 +297,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -320,12 +312,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..dcc35f6bff8c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1315,7 +1315,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1326,12 +1325,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
-
+	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index ed98df72c76b..984e6ce85dcd 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1464,7 +1464,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1486,9 +1485,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ed8cb537c6a7..03504beb51e4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -314,6 +314,8 @@ static inline void *page_frag_alloc(struct page_frag_cache *nc,
 	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
 }
 
+void page_frag_cache_clear(struct page_frag_cache *nc);
+
 extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index bee95824ef8f..e02b81d68dc4 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -46,6 +46,23 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
+/**
+ * page_frag_cache_clear - Clear out a page fragment cache
+ * @nc: The cache to clear
+ *
+ * Discard any pages still cached in a page fragment cache.
+ */
+void page_frag_cache_clear(struct page_frag_cache *nc)
+{
+	if (nc->va) {
+		struct page *page = virt_to_head_page(nc->va);
+
+		__page_frag_cache_drain(page, nc->pagecnt_bias);
+		nc->va = NULL;
+	}
+}
+EXPORT_SYMBOL(page_frag_cache_clear);
+
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align_mask)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 02/12] mm: Provide a page_frag_cache allocator cleanup function
@ 2023-05-24 15:33   ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Provide a function to clean up a page_frag_cache allocator rather than
doing it manually each time.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
 drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
 drivers/nvme/host/tcp.c                    |  8 +-------
 drivers/nvme/target/tcp.c                  |  5 +----
 include/linux/gfp.h                        |  2 ++
 mm/page_frag_alloc.c                       | 17 +++++++++++++++++
 6 files changed, 25 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 8fb70db63b8b..55feab29bed9 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1251,17 +1251,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 
 static void gve_drain_page_cache(struct gve_priv *priv)
 {
-	struct page_frag_cache *nc;
 	int i;
 
-	for (i = 0; i < priv->rx_cfg.num_queues; i++) {
-		nc = &priv->rx[i].page_cache;
-		if (nc->va) {
-			__page_frag_cache_drain(virt_to_page(nc->va),
-						nc->pagecnt_bias);
-			nc->va = NULL;
-		}
-	}
+	for (i = 0; i < priv->rx_cfg.num_queues; i++)
+		page_frag_cache_clear(&priv->rx[i].page_cache);
 }
 
 static int gve_open(struct net_device *dev)
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 69fba29055e9..d90fea2c7d04 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -298,19 +297,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -320,12 +312,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..dcc35f6bff8c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1315,7 +1315,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1326,12 +1325,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
-
+	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index ed98df72c76b..984e6ce85dcd 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1464,7 +1464,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1486,9 +1485,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ed8cb537c6a7..03504beb51e4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -314,6 +314,8 @@ static inline void *page_frag_alloc(struct page_frag_cache *nc,
 	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
 }
 
+void page_frag_cache_clear(struct page_frag_cache *nc);
+
 extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index bee95824ef8f..e02b81d68dc4 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -46,6 +46,23 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
+/**
+ * page_frag_cache_clear - Clear out a page fragment cache
+ * @nc: The cache to clear
+ *
+ * Discard any pages still cached in a page fragment cache.
+ */
+void page_frag_cache_clear(struct page_frag_cache *nc)
+{
+	if (nc->va) {
+		struct page *page = virt_to_head_page(nc->va);
+
+		__page_frag_cache_drain(page, nc->pagecnt_bias);
+		nc->va = NULL;
+	}
+}
+EXPORT_SYMBOL(page_frag_cache_clear);
+
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align_mask)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33   ` David Howells
  2023-05-24 15:33   ` David Howells
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator's alignment parameter a power of 2
rather than a mask and give a warning if it isn't.

This means that it's consistent with {napi,netdec}_alloc_frag_align() and
allows __{napi,netdev}_alloc_frag_align() to be removed.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 include/linux/gfp.h    |  4 ++--
 include/linux/skbuff.h | 22 ++++------------------
 mm/page_frag_alloc.c   |  8 +++++---
 net/core/skbuff.c      | 14 +++++++-------
 4 files changed, 18 insertions(+), 30 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 03504beb51e4..fa30100f46ad 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -306,12 +306,12 @@ struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc_align(struct page_frag_cache *nc,
 				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align_mask);
+				   unsigned int align);
 
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return page_frag_alloc_align(nc, fragsz, gfp_mask, 1);
 }
 
 void page_frag_cache_clear(struct page_frag_cache *nc);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1b2ebf6113e0..41b63e72c6c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3158,7 +3158,7 @@ void skb_queue_purge(struct sk_buff_head *list);
 
 unsigned int skb_rbtree_purge(struct rb_root *root);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 /**
  * netdev_alloc_frag - allocate a page fragment
@@ -3169,14 +3169,7 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
  */
 static inline void *netdev_alloc_frag(unsigned int fragsz)
 {
-	return __netdev_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *netdev_alloc_frag_align(unsigned int fragsz,
-					    unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __netdev_alloc_frag_align(fragsz, -align);
+	return netdev_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -3236,18 +3229,11 @@ static inline void skb_free_frag(void *addr)
 	page_frag_free(addr);
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 static inline void *napi_alloc_frag(unsigned int fragsz)
 {
-	return __napi_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *napi_alloc_frag_align(unsigned int fragsz,
-					  unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __napi_alloc_frag_align(fragsz, -align);
+	return napi_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index e02b81d68dc4..9d3f6fbd9a07 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -64,13 +64,15 @@ void page_frag_cache_clear(struct page_frag_cache *nc)
 EXPORT_SYMBOL(page_frag_cache_clear);
 
 void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
+			    unsigned int fragsz, gfp_t gfp_mask,
+			    unsigned int align)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
 	int offset;
 
+	WARN_ON_ONCE(!is_power_of_2(align));
+
 	if (unlikely(!nc->va)) {
 refill:
 		page = __page_frag_cache_refill(nc, gfp_mask);
@@ -129,7 +131,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	}
 
 	nc->pagecnt_bias--;
-	offset &= align_mask;
+	offset &= ~(align - 1);
 	nc->offset = offset;
 
 	return nc->va + offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f4a5b51aed22..cc507433b357 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -289,17 +289,17 @@ void napi_get_frags_check(struct napi_struct *napi)
 	local_bh_enable();
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 }
-EXPORT_SYMBOL(__napi_alloc_frag_align);
+EXPORT_SYMBOL(napi_alloc_frag_align);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	void *data;
 
@@ -307,18 +307,18 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	if (in_hardirq() || irqs_disabled()) {
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align);
 	} else {
 		struct napi_alloc_cache *nc;
 
 		local_bh_disable();
 		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 		local_bh_enable();
 	}
 	return data;
 }
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
+EXPORT_SYMBOL(netdev_alloc_frag_align);
 
 static struct sk_buff *napi_skb_cache_get(void)
 {


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
@ 2023-05-24 15:33   ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator's alignment parameter a power of 2
rather than a mask and give a warning if it isn't.

This means that it's consistent with {napi,netdec}_alloc_frag_align() and
allows __{napi,netdev}_alloc_frag_align() to be removed.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 include/linux/gfp.h    |  4 ++--
 include/linux/skbuff.h | 22 ++++------------------
 mm/page_frag_alloc.c   |  8 +++++---
 net/core/skbuff.c      | 14 +++++++-------
 4 files changed, 18 insertions(+), 30 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 03504beb51e4..fa30100f46ad 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -306,12 +306,12 @@ struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc_align(struct page_frag_cache *nc,
 				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align_mask);
+				   unsigned int align);
 
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return page_frag_alloc_align(nc, fragsz, gfp_mask, 1);
 }
 
 void page_frag_cache_clear(struct page_frag_cache *nc);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1b2ebf6113e0..41b63e72c6c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3158,7 +3158,7 @@ void skb_queue_purge(struct sk_buff_head *list);
 
 unsigned int skb_rbtree_purge(struct rb_root *root);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 /**
  * netdev_alloc_frag - allocate a page fragment
@@ -3169,14 +3169,7 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
  */
 static inline void *netdev_alloc_frag(unsigned int fragsz)
 {
-	return __netdev_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *netdev_alloc_frag_align(unsigned int fragsz,
-					    unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __netdev_alloc_frag_align(fragsz, -align);
+	return netdev_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -3236,18 +3229,11 @@ static inline void skb_free_frag(void *addr)
 	page_frag_free(addr);
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 static inline void *napi_alloc_frag(unsigned int fragsz)
 {
-	return __napi_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *napi_alloc_frag_align(unsigned int fragsz,
-					  unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __napi_alloc_frag_align(fragsz, -align);
+	return napi_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index e02b81d68dc4..9d3f6fbd9a07 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -64,13 +64,15 @@ void page_frag_cache_clear(struct page_frag_cache *nc)
 EXPORT_SYMBOL(page_frag_cache_clear);
 
 void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
+			    unsigned int fragsz, gfp_t gfp_mask,
+			    unsigned int align)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
 	int offset;
 
+	WARN_ON_ONCE(!is_power_of_2(align));
+
 	if (unlikely(!nc->va)) {
 refill:
 		page = __page_frag_cache_refill(nc, gfp_mask);
@@ -129,7 +131,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	}
 
 	nc->pagecnt_bias--;
-	offset &= align_mask;
+	offset &= ~(align - 1);
 	nc->offset = offset;
 
 	return nc->va + offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f4a5b51aed22..cc507433b357 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -289,17 +289,17 @@ void napi_get_frags_check(struct napi_struct *napi)
 	local_bh_enable();
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 }
-EXPORT_SYMBOL(__napi_alloc_frag_align);
+EXPORT_SYMBOL(napi_alloc_frag_align);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	void *data;
 
@@ -307,18 +307,18 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	if (in_hardirq() || irqs_disabled()) {
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align);
 	} else {
 		struct napi_alloc_cache *nc;
 
 		local_bh_disable();
 		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 		local_bh_enable();
 	}
 	return data;
 }
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
+EXPORT_SYMBOL(netdev_alloc_frag_align);
 
 static struct sk_buff *napi_skb_cache_get(void)
 {


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33   ` David Howells
  2023-05-24 15:33   ` David Howells
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Change the page_frag_cache allocator to use multipage folios rather than
groups of pages.  This reduces page_frag_free to just a folio_put() or
put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 include/linux/mm_types.h | 13 ++----
 mm/page_frag_alloc.c     | 99 +++++++++++++++++++---------------------
 2 files changed, 52 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..d7c52a5979cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
 }
 
 struct page_frag_cache {
-	void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	__u16 offset;
-	__u16 size;
-#else
-	__u32 offset;
-#endif
+	struct folio	*folio;
+	unsigned int	offset;
 	/* we maintain a pagecount bias, so that we dont dirty cache line
 	 * containing page->_refcount every time we allocate a fragment.
 	 */
-	unsigned int		pagecnt_bias;
-	bool pfmemalloc;
+	unsigned int	pagecnt_bias;
+	bool		pfmemalloc;
 };
 
 typedef unsigned long vm_flags_t;
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 9d3f6fbd9a07..ffd68bfb677d 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,33 +16,34 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
-					     gfp_t gfp_mask)
+/*
+ * Allocate a new folio for the frag cache.
+ */
+static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
+					    gfp_t gfp_mask)
 {
-	struct page *page = NULL;
+	struct folio *folio = NULL;
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-				PAGE_FRAG_CACHE_MAX_ORDER);
-	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
 #endif
-	if (unlikely(!page))
-		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+	if (unlikely(!folio))
+		folio = folio_alloc(gfp, 0);
 
-	nc->va = page ? page_address(page) : NULL;
-
-	return page;
+	if (folio)
+		nc->folio = folio;
+	return folio;
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+	struct folio *folio = page_folio(page);
+
+	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
 
-	if (page_ref_sub_and_test(page, count - 1))
-		__free_pages(page, compound_order(page));
+	folio_put_refs(folio, count);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -54,11 +55,12 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
  */
 void page_frag_cache_clear(struct page_frag_cache *nc)
 {
-	if (nc->va) {
-		struct page *page = virt_to_head_page(nc->va);
+	struct folio *folio = nc->folio;
 
-		__page_frag_cache_drain(page, nc->pagecnt_bias);
-		nc->va = NULL;
+	if (folio) {
+		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+		folio_put_refs(folio, nc->pagecnt_bias);
+		nc->folio = NULL;
 	}
 }
 EXPORT_SYMBOL(page_frag_cache_clear);
@@ -67,56 +69,51 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			    unsigned int fragsz, gfp_t gfp_mask,
 			    unsigned int align)
 {
-	unsigned int size = PAGE_SIZE;
-	struct page *page;
-	int offset;
+	struct folio *folio = nc->folio;
+	size_t offset;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
-	if (unlikely(!nc->va)) {
+	if (unlikely(!folio)) {
 refill:
-		page = __page_frag_cache_refill(nc, gfp_mask);
-		if (!page)
+		folio = page_frag_cache_refill(nc, gfp_mask);
+		if (!folio)
 			return NULL;
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
 
 		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = page_is_pfmemalloc(page);
+		nc->pfmemalloc = folio_is_pfmemalloc(folio);
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
+		nc->offset = folio_size(folio);
 	}
 
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
-		page = virt_to_page(nc->va);
-
-		if (page_ref_count(page) != nc->pagecnt_bias)
+	offset = nc->offset;
+	if (unlikely(fragsz > offset)) {
+		/* Reuse the folio if everyone we gave it to has finished with
+		 * it.
+		 */
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
+			nc->folio = NULL;
 			goto refill;
+		}
+
 		if (unlikely(nc->pfmemalloc)) {
-			page_ref_sub(page, nc->pagecnt_bias - 1);
-			__free_pages(page, compound_order(page));
+			__folio_put(folio);
+			nc->folio = NULL;
 			goto refill;
 		}
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+		folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
+		offset = folio_size(folio);
+		if (unlikely(fragsz > offset)) {
 			/*
 			 * The caller is trying to allocate a fragment
 			 * with fragsz > PAGE_SIZE but the cache isn't big
@@ -126,15 +123,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			 * it could make memory pressure worse
 			 * so we simply return NULL here.
 			 */
+			nc->offset = offset;
 			return NULL;
 		}
 	}
 
 	nc->pagecnt_bias--;
+	offset -= fragsz;
 	offset &= ~(align - 1);
 	nc->offset = offset;
 
-	return nc->va + offset;
+	return folio_address(folio) + offset;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -143,8 +142,6 @@ EXPORT_SYMBOL(page_frag_alloc_align);
  */
 void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
-
-	__free_pages(page, compound_order(page));
+	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-24 15:33   ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Change the page_frag_cache allocator to use multipage folios rather than
groups of pages.  This reduces page_frag_free to just a folio_put() or
put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 include/linux/mm_types.h | 13 ++----
 mm/page_frag_alloc.c     | 99 +++++++++++++++++++---------------------
 2 files changed, 52 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..d7c52a5979cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
 }
 
 struct page_frag_cache {
-	void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	__u16 offset;
-	__u16 size;
-#else
-	__u32 offset;
-#endif
+	struct folio	*folio;
+	unsigned int	offset;
 	/* we maintain a pagecount bias, so that we dont dirty cache line
 	 * containing page->_refcount every time we allocate a fragment.
 	 */
-	unsigned int		pagecnt_bias;
-	bool pfmemalloc;
+	unsigned int	pagecnt_bias;
+	bool		pfmemalloc;
 };
 
 typedef unsigned long vm_flags_t;
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 9d3f6fbd9a07..ffd68bfb677d 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,33 +16,34 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
-					     gfp_t gfp_mask)
+/*
+ * Allocate a new folio for the frag cache.
+ */
+static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
+					    gfp_t gfp_mask)
 {
-	struct page *page = NULL;
+	struct folio *folio = NULL;
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-				PAGE_FRAG_CACHE_MAX_ORDER);
-	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
 #endif
-	if (unlikely(!page))
-		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+	if (unlikely(!folio))
+		folio = folio_alloc(gfp, 0);
 
-	nc->va = page ? page_address(page) : NULL;
-
-	return page;
+	if (folio)
+		nc->folio = folio;
+	return folio;
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+	struct folio *folio = page_folio(page);
+
+	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
 
-	if (page_ref_sub_and_test(page, count - 1))
-		__free_pages(page, compound_order(page));
+	folio_put_refs(folio, count);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -54,11 +55,12 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
  */
 void page_frag_cache_clear(struct page_frag_cache *nc)
 {
-	if (nc->va) {
-		struct page *page = virt_to_head_page(nc->va);
+	struct folio *folio = nc->folio;
 
-		__page_frag_cache_drain(page, nc->pagecnt_bias);
-		nc->va = NULL;
+	if (folio) {
+		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+		folio_put_refs(folio, nc->pagecnt_bias);
+		nc->folio = NULL;
 	}
 }
 EXPORT_SYMBOL(page_frag_cache_clear);
@@ -67,56 +69,51 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			    unsigned int fragsz, gfp_t gfp_mask,
 			    unsigned int align)
 {
-	unsigned int size = PAGE_SIZE;
-	struct page *page;
-	int offset;
+	struct folio *folio = nc->folio;
+	size_t offset;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
-	if (unlikely(!nc->va)) {
+	if (unlikely(!folio)) {
 refill:
-		page = __page_frag_cache_refill(nc, gfp_mask);
-		if (!page)
+		folio = page_frag_cache_refill(nc, gfp_mask);
+		if (!folio)
 			return NULL;
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
 
 		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = page_is_pfmemalloc(page);
+		nc->pfmemalloc = folio_is_pfmemalloc(folio);
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
+		nc->offset = folio_size(folio);
 	}
 
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
-		page = virt_to_page(nc->va);
-
-		if (page_ref_count(page) != nc->pagecnt_bias)
+	offset = nc->offset;
+	if (unlikely(fragsz > offset)) {
+		/* Reuse the folio if everyone we gave it to has finished with
+		 * it.
+		 */
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
+			nc->folio = NULL;
 			goto refill;
+		}
+
 		if (unlikely(nc->pfmemalloc)) {
-			page_ref_sub(page, nc->pagecnt_bias - 1);
-			__free_pages(page, compound_order(page));
+			__folio_put(folio);
+			nc->folio = NULL;
 			goto refill;
 		}
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+		folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
+		offset = folio_size(folio);
+		if (unlikely(fragsz > offset)) {
 			/*
 			 * The caller is trying to allocate a fragment
 			 * with fragsz > PAGE_SIZE but the cache isn't big
@@ -126,15 +123,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			 * it could make memory pressure worse
 			 * so we simply return NULL here.
 			 */
+			nc->offset = offset;
 			return NULL;
 		}
 	}
 
 	nc->pagecnt_bias--;
+	offset -= fragsz;
 	offset &= ~(align - 1);
 	nc->offset = offset;
 
-	return nc->va + offset;
+	return folio_address(folio) + offset;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -143,8 +142,6 @@ EXPORT_SYMBOL(page_frag_alloc_align);
  */
 void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
-
-	__free_pages(page, compound_order(page));
+	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33   ` David Howells
  2023-05-24 15:33   ` David Howells
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
passing it off to the page allocator.  There may be a mix of callers, some
specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
might refurbish the page, in which case the returned memory doesn't get
cleared.

This is a potential bug in the nvme over TCP driver.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 mm/page_frag_alloc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index ffd68bfb677d..2b73c7f5d9a9 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -23,7 +23,10 @@ static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
 					    gfp_t gfp_mask)
 {
 	struct folio *folio = NULL;
-	gfp_t gfp = gfp_mask;
+	gfp_t gfp;
+
+	gfp_mask &= ~__GFP_ZERO;
+	gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
 	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
@@ -71,6 +74,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 {
 	struct folio *folio = nc->folio;
 	size_t offset;
+	void *p;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
@@ -133,7 +137,10 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	offset &= ~(align - 1);
 	nc->offset = offset;
 
-	return folio_address(folio) + offset;
+	p = folio_address(folio) + offset;
+	if (gfp_mask & __GFP_ZERO)
+		return memset(p, 0, fragsz);
+	return p;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
@ 2023-05-24 15:33   ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
passing it off to the page allocator.  There may be a mix of callers, some
specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
might refurbish the page, in which case the returned memory doesn't get
cleared.

This is a potential bug in the nvme over TCP driver.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 mm/page_frag_alloc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index ffd68bfb677d..2b73c7f5d9a9 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -23,7 +23,10 @@ static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
 					    gfp_t gfp_mask)
 {
 	struct folio *folio = NULL;
-	gfp_t gfp = gfp_mask;
+	gfp_t gfp;
+
+	gfp_mask &= ~__GFP_ZERO;
+	gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
 	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
@@ -71,6 +74,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 {
 	struct folio *folio = nc->folio;
 	size_t offset;
+	void *p;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
@@ -133,7 +137,10 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	offset &= ~(align - 1);
 	nc->offset = offset;
 
-	return folio_address(folio) + offset;
+	p = folio_address(folio) + offset;
+	if (gfp_mask & __GFP_ZERO)
+		return memset(p, 0, fragsz);
+	return p;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
@ 2023-05-24 15:33   ` David Howells
  2023-05-24 15:33   ` David Howells
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator have a separate allocation bucket for
each cpu to avoid racing.  This means that no lock is required, other than
preempt disablement, to allocate from it, though if a softirq wants to
access it, then softirq disablement will need to be added.

Make the NVMe, mediatek and GVE drivers pass in NULL to page_frag_cache()
and use the default allocation buckets rather than defining their own.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 drivers/net/ethernet/google/gve/gve.h      |   1 -
 drivers/net/ethernet/google/gve/gve_main.c |   9 -
 drivers/net/ethernet/google/gve/gve_rx.c   |   2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |   6 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h |   2 -
 drivers/nvme/host/tcp.c                    |  13 +-
 drivers/nvme/target/tcp.c                  |  19 +-
 include/linux/gfp.h                        |  19 +-
 mm/page_frag_alloc.c                       | 202 +++++++++++++--------
 net/core/skbuff.c                          |  32 ++--
 10 files changed, 163 insertions(+), 142 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 98eb78d98e9f..87244ab911bd 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -250,7 +250,6 @@ struct gve_rx_ring {
 	struct xdp_rxq_info xdp_rxq;
 	struct xdp_rxq_info xsk_rxq;
 	struct xsk_buff_pool *xsk_pool;
-	struct page_frag_cache page_cache; /* Page cache to allocate XDP frames */
 };
 
 /* A TX desc ring entry */
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 55feab29bed9..9f0fb986d61e 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1249,14 +1249,6 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 	}
 }
 
-static void gve_drain_page_cache(struct gve_priv *priv)
-{
-	int i;
-
-	for (i = 0; i < priv->rx_cfg.num_queues; i++)
-		page_frag_cache_clear(&priv->rx[i].page_cache);
-}
-
 static int gve_open(struct net_device *dev)
 {
 	struct gve_priv *priv = netdev_priv(dev);
@@ -1340,7 +1332,6 @@ static int gve_close(struct net_device *dev)
 	netif_carrier_off(dev);
 	if (gve_get_device_rings_ok(priv)) {
 		gve_turndown(priv);
-		gve_drain_page_cache(priv);
 		err = gve_destroy_rings(priv);
 		if (err)
 			goto err;
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index d1da7413dc4d..7ae8377c394f 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -634,7 +634,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct gve_rx_ring *rx,
 
 	total_len = headroom + SKB_DATA_ALIGN(len) +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	frame = page_frag_alloc(&rx->page_cache, total_len, GFP_ATOMIC);
+	frame = page_frag_alloc(NULL, total_len, GFP_ATOMIC);
 	if (!frame) {
 		u64_stats_update_begin(&rx->statss);
 		rx->xdp_alloc_fails++;
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index d90fea2c7d04..859f34447f2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -143,7 +143,7 @@ mtk_wed_wo_queue_refill(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q,
 		dma_addr_t addr;
 		void *buf;
 
-		buf = page_frag_alloc(&q->cache, q->buf_size, GFP_ATOMIC);
+		buf = page_frag_alloc(NULL, q->buf_size, GFP_ATOMIC);
 		if (!buf)
 			break;
 
@@ -296,8 +296,6 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(entry->buf);
 		entry->buf = NULL;
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
@@ -311,8 +309,6 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 
 		skb_free_frag(buf);
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.h b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
index 7a1a2a28f1ac..f69bd83dc486 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.h
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
@@ -211,8 +211,6 @@ struct mtk_wed_wo_queue_entry {
 struct mtk_wed_wo_queue {
 	struct mtk_wed_wo_queue_regs regs;
 
-	struct page_frag_cache cache;
-
 	struct mtk_wed_wo_queue_desc *desc;
 	dma_addr_t desc_dma;
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index dcc35f6bff8c..145cf6186509 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -147,8 +147,6 @@ struct nvme_tcp_queue {
 	__le32			exp_ddgst;
 	__le32			recv_ddgst;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -482,9 +480,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	req->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	req->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				   GFP_KERNEL | __GFP_ZERO);
 	if (!req->pdu)
 		return -ENOMEM;
 
@@ -1303,9 +1300,8 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 	struct nvme_tcp_request *async = &ctrl->async_req;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	async->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	async->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!async->pdu)
 		return -ENOMEM;
 
@@ -1325,7 +1321,6 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 984e6ce85dcd..cb352f5d2bbf 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -169,8 +169,6 @@ struct nvmet_tcp_queue {
 
 	struct nvmet_tcp_cmd	connect;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*data_ready)(struct sock *);
 	void (*state_change)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -1338,25 +1336,25 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 	c->queue = queue;
 	c->req.port = queue->port->nport;
 
-	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->cmd_pdu = page_frag_alloc(NULL, sizeof(*c->cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->cmd_pdu)
 		return -ENOMEM;
 	c->req.cmd = &c->cmd_pdu->cmd;
 
-	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->rsp_pdu = page_frag_alloc(NULL, sizeof(*c->rsp_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->rsp_pdu)
 		goto out_free_cmd;
 	c->req.cqe = &c->rsp_pdu->cqe;
 
-	c->data_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->data_pdu = page_frag_alloc(NULL, sizeof(*c->data_pdu) + hdgst,
+				      GFP_KERNEL | __GFP_ZERO);
 	if (!c->data_pdu)
 		goto out_free_rsp;
 
-	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->r2t_pdu = page_frag_alloc(NULL, sizeof(*c->r2t_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->r2t_pdu)
 		goto out_free_data;
 
@@ -1485,7 +1483,6 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fa30100f46ad..baa25a00d9e3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -304,18 +304,19 @@ extern void free_pages(unsigned long addr, unsigned int order);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
-extern void *page_frag_alloc_align(struct page_frag_cache *nc,
-				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align);
-
-static inline void *page_frag_alloc(struct page_frag_cache *nc,
-			     unsigned int fragsz, gfp_t gfp_mask)
+extern void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+				   size_t fragsz, gfp_t gfp,
+				   unsigned long align);
+extern void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+			      const void *p, size_t fragsz, gfp_t gfp,
+			      unsigned long align);
+
+static inline void *page_frag_alloc(struct page_frag_cache __percpu *frag_cache,
+				    size_t fragsz, gfp_t gfp)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, 1);
+	return page_frag_alloc_align(frag_cache, fragsz, gfp, 1);
 }
 
-void page_frag_cache_clear(struct page_frag_cache *nc);
-
 extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 2b73c7f5d9a9..b035bbb34fac 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,28 +16,25 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
+static DEFINE_PER_CPU(struct page_frag_cache, page_frag_default_allocator);
+
 /*
  * Allocate a new folio for the frag cache.
  */
-static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
-					    gfp_t gfp_mask)
+static struct folio *page_frag_cache_refill(gfp_t gfp)
 {
-	struct folio *folio = NULL;
-	gfp_t gfp;
+	struct folio *folio;
 
-	gfp_mask &= ~__GFP_ZERO;
-	gfp = gfp_mask;
+	gfp &= ~__GFP_ZERO;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
-	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
+	folio = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
+			    PAGE_FRAG_CACHE_MAX_ORDER);
+	if (folio)
+		return folio;
 #endif
-	if (unlikely(!folio))
-		folio = folio_alloc(gfp, 0);
 
-	if (folio)
-		nc->folio = folio;
-	return folio;
+	return folio_alloc(gfp, 0);
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
@@ -51,63 +48,70 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
 /**
- * page_frag_cache_clear - Clear out a page fragment cache
- * @nc: The cache to clear
+ * page_frag_alloc_align - Allocate some memory for use in zerocopy
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The size of the fragment desired
+ * @gfp: Allocation flags under which to make an allocation
+ * @align: The required alignment
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data.  Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and fragments are broken off
+ * sequentially and returned to the caller with a ref until the folio no longer
+ * has enough spare space - at which point the allocator's ref is dropped and a
+ * new folio is allocated.  The folio remains in existence until the last ref
+ * held by, say, an sk_buff is discarded and then the page is returned to the
+ * page allocator.
  *
- * Discard any pages still cached in a page fragment cache.
+ * Returns a pointer to the memory on success and -ENOMEM on allocation
+ * failure.
+ *
+ * The allocated memory should be disposed of with folio_put().
  */
-void page_frag_cache_clear(struct page_frag_cache *nc)
-{
-	struct folio *folio = nc->folio;
-
-	if (folio) {
-		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
-		folio_put_refs(folio, nc->pagecnt_bias);
-		nc->folio = NULL;
-	}
-}
-EXPORT_SYMBOL(page_frag_cache_clear);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-			    unsigned int fragsz, gfp_t gfp_mask,
-			    unsigned int align)
+void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+			    size_t fragsz, gfp_t gfp, unsigned long align)
 {
-	struct folio *folio = nc->folio;
+	struct page_frag_cache *nc;
+	struct folio *folio, *spare = NULL;
 	size_t offset;
 	void *p;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
-	if (unlikely(!folio)) {
-refill:
-		folio = page_frag_cache_refill(nc, gfp_mask);
-		if (!folio)
-			return NULL;
-
-		/* Even if we own the page, we do not use atomic_set().
-		 * This would break get_page_unless_zero() users.
-		 */
-		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+	if (!frag_cache)
+		frag_cache = &page_frag_default_allocator;
+	if (WARN_ON_ONCE(fragsz == 0))
+		fragsz = 1;
 
-		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = folio_is_pfmemalloc(folio);
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = folio_size(folio);
+	nc = get_cpu_ptr(frag_cache);
+reload:
+	folio = nc->folio;
+	offset = nc->offset;
+try_again:
+
+	/* Make the allocation if there's sufficient space. */
+	if (fragsz <= offset) {
+		nc->pagecnt_bias--;
+		offset = (offset - fragsz) & ~(align - 1);
+		nc->offset = offset;
+		p = folio_address(folio) + offset;
+		put_cpu_ptr(frag_cache);
+		if (spare)
+			folio_put(spare);
+		if (gfp & __GFP_ZERO)
+			return memset(p, 0, fragsz);
+		return p;
 	}
 
-	offset = nc->offset;
-	if (unlikely(fragsz > offset)) {
-		/* Reuse the folio if everyone we gave it to has finished with
-		 * it.
-		 */
-		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
-			nc->folio = NULL;
+	/* Insufficient space - see if we can refurbish the current folio. */
+	if (folio) {
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias))
 			goto refill;
-		}
 
 		if (unlikely(nc->pfmemalloc)) {
 			__folio_put(folio);
-			nc->folio = NULL;
 			goto refill;
 		}
 
@@ -117,30 +121,56 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
 		offset = folio_size(folio);
-		if (unlikely(fragsz > offset)) {
-			/*
-			 * The caller is trying to allocate a fragment
-			 * with fragsz > PAGE_SIZE but the cache isn't big
-			 * enough to satisfy the request, this may
-			 * happen in low memory conditions.
-			 * We don't release the cache page because
-			 * it could make memory pressure worse
-			 * so we simply return NULL here.
-			 */
-			nc->offset = offset;
+		if (unlikely(fragsz > offset))
+			goto frag_too_big;
+		goto try_again;
+	}
+
+refill:
+	if (!spare) {
+		nc->folio = NULL;
+		put_cpu_ptr(frag_cache);
+
+		spare = page_frag_cache_refill(gfp);
+		if (!spare)
 			return NULL;
-		}
+
+		nc = get_cpu_ptr(frag_cache);
+		/* We may now be on a different cpu and/or someone else may
+		 * have refilled it
+		 */
+		nc->pfmemalloc = folio_is_pfmemalloc(spare);
+		if (nc->folio)
+			goto reload;
 	}
 
-	nc->pagecnt_bias--;
-	offset -= fragsz;
-	offset &= ~(align - 1);
+	nc->folio = spare;
+	folio = spare;
+	spare = NULL;
+
+	/* Even if we own the page, we do not use atomic_set().  This would
+	 * break get_page_unless_zero() users.
+	 */
+	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+	/* Reset page count bias and offset to start of new frag */
+	nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = folio_size(folio);
+	goto try_again;
+
+frag_too_big:
+	/*
+	 * The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+	 * but the cache isn't big enough to satisfy the request, this may
+	 * happen in low memory conditions.  We don't release the cache page
+	 * because it could make memory pressure worse so we simply return NULL
+	 * here.
+	 */
 	nc->offset = offset;
-
-	p = folio_address(folio) + offset;
-	if (gfp_mask & __GFP_ZERO)
-		return memset(p, 0, fragsz);
-	return p;
+	put_cpu_ptr(frag_cache);
+	if (spare)
+		folio_put(spare);
+	return NULL;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -152,3 +182,25 @@ void page_frag_free(void *addr)
 	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);
+
+/**
+ * page_frag_memdup - Allocate a page fragment and duplicate some data into it
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ */
+void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+		       const void *p, size_t fragsz, gfp_t gfp,
+		       unsigned long align_mask)
+{
+	void *q;
+
+	q = page_frag_alloc_align(frag_cache, fragsz, gfp, align_mask);
+	if (!q)
+		return q;
+
+	return memcpy(q, p, fragsz);
+}
+EXPORT_SYMBOL(page_frag_memdup);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cc507433b357..225a16f3713f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,13 +263,13 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 #endif
 
 struct napi_alloc_cache {
-	struct page_frag_cache page;
 	struct page_frag_1k page_small;
 	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
 
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct page_frag_cache, napi_frag_cache);
 static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 /* Double check that napi_get_frags() allocates skbs with
@@ -291,11 +291,9 @@ void napi_get_frags_check(struct napi_struct *napi)
 
 void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
-	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
+	return page_frag_alloc_align(&napi_frag_cache, fragsz, GFP_ATOMIC, align);
 }
 EXPORT_SYMBOL(napi_alloc_frag_align);
 
@@ -305,15 +303,12 @@ void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 	if (in_hardirq() || irqs_disabled()) {
-		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
-
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align);
+		data = page_frag_alloc_align(&netdev_alloc_cache,
+					     fragsz, GFP_ATOMIC, align);
 	} else {
-		struct napi_alloc_cache *nc;
-
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
+		data = page_frag_alloc_align(&napi_frag_cache,
+					     fragsz, GFP_ATOMIC, align);
 		local_bh_enable();
 	}
 	return data;
@@ -691,7 +686,6 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 				   gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc;
 	struct sk_buff *skb;
 	bool pfmemalloc;
 	void *data;
@@ -716,14 +710,12 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 		gfp_mask |= __GFP_MEMALLOC;
 
 	if (in_hardirq() || irqs_disabled()) {
-		nc = this_cpu_ptr(&netdev_alloc_cache);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&netdev_alloc_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	} else {
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache.page);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 		local_bh_enable();
 	}
 
@@ -811,8 +803,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	} else {
 		len = SKB_HEAD_ALIGN(len);
 
-		data = page_frag_alloc(&nc->page, len, gfp_mask);
-		pfmemalloc = nc->page.pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	}
 
 	if (unlikely(!data))


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu
@ 2023-05-24 15:33   ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Make the page_frag_cache allocator have a separate allocation bucket for
each cpu to avoid racing.  This means that no lock is required, other than
preempt disablement, to allocate from it, though if a softirq wants to
access it, then softirq disablement will need to be added.

Make the NVMe, mediatek and GVE drivers pass in NULL to page_frag_cache()
and use the default allocation buckets rather than defining their own.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jeroen de Borst <jeroendb@google.com>
cc: Catherine Sullivan <csully@google.com>
cc: Shailend Chand <shailend@google.com>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Matthias Brugger <matthias.bgg@gmail.com>
cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-arm-kernel@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
cc: linux-nvme@lists.infradead.org
cc: linux-mm@kvack.org
---
 drivers/net/ethernet/google/gve/gve.h      |   1 -
 drivers/net/ethernet/google/gve/gve_main.c |   9 -
 drivers/net/ethernet/google/gve/gve_rx.c   |   2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |   6 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h |   2 -
 drivers/nvme/host/tcp.c                    |  13 +-
 drivers/nvme/target/tcp.c                  |  19 +-
 include/linux/gfp.h                        |  19 +-
 mm/page_frag_alloc.c                       | 202 +++++++++++++--------
 net/core/skbuff.c                          |  32 ++--
 10 files changed, 163 insertions(+), 142 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 98eb78d98e9f..87244ab911bd 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -250,7 +250,6 @@ struct gve_rx_ring {
 	struct xdp_rxq_info xdp_rxq;
 	struct xdp_rxq_info xsk_rxq;
 	struct xsk_buff_pool *xsk_pool;
-	struct page_frag_cache page_cache; /* Page cache to allocate XDP frames */
 };
 
 /* A TX desc ring entry */
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 55feab29bed9..9f0fb986d61e 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1249,14 +1249,6 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 	}
 }
 
-static void gve_drain_page_cache(struct gve_priv *priv)
-{
-	int i;
-
-	for (i = 0; i < priv->rx_cfg.num_queues; i++)
-		page_frag_cache_clear(&priv->rx[i].page_cache);
-}
-
 static int gve_open(struct net_device *dev)
 {
 	struct gve_priv *priv = netdev_priv(dev);
@@ -1340,7 +1332,6 @@ static int gve_close(struct net_device *dev)
 	netif_carrier_off(dev);
 	if (gve_get_device_rings_ok(priv)) {
 		gve_turndown(priv);
-		gve_drain_page_cache(priv);
 		err = gve_destroy_rings(priv);
 		if (err)
 			goto err;
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index d1da7413dc4d..7ae8377c394f 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -634,7 +634,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct gve_rx_ring *rx,
 
 	total_len = headroom + SKB_DATA_ALIGN(len) +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	frame = page_frag_alloc(&rx->page_cache, total_len, GFP_ATOMIC);
+	frame = page_frag_alloc(NULL, total_len, GFP_ATOMIC);
 	if (!frame) {
 		u64_stats_update_begin(&rx->statss);
 		rx->xdp_alloc_fails++;
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index d90fea2c7d04..859f34447f2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -143,7 +143,7 @@ mtk_wed_wo_queue_refill(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q,
 		dma_addr_t addr;
 		void *buf;
 
-		buf = page_frag_alloc(&q->cache, q->buf_size, GFP_ATOMIC);
+		buf = page_frag_alloc(NULL, q->buf_size, GFP_ATOMIC);
 		if (!buf)
 			break;
 
@@ -296,8 +296,6 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(entry->buf);
 		entry->buf = NULL;
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
@@ -311,8 +309,6 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 
 		skb_free_frag(buf);
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.h b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
index 7a1a2a28f1ac..f69bd83dc486 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.h
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
@@ -211,8 +211,6 @@ struct mtk_wed_wo_queue_entry {
 struct mtk_wed_wo_queue {
 	struct mtk_wed_wo_queue_regs regs;
 
-	struct page_frag_cache cache;
-
 	struct mtk_wed_wo_queue_desc *desc;
 	dma_addr_t desc_dma;
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index dcc35f6bff8c..145cf6186509 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -147,8 +147,6 @@ struct nvme_tcp_queue {
 	__le32			exp_ddgst;
 	__le32			recv_ddgst;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -482,9 +480,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	req->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	req->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				   GFP_KERNEL | __GFP_ZERO);
 	if (!req->pdu)
 		return -ENOMEM;
 
@@ -1303,9 +1300,8 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 	struct nvme_tcp_request *async = &ctrl->async_req;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	async->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	async->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!async->pdu)
 		return -ENOMEM;
 
@@ -1325,7 +1321,6 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 984e6ce85dcd..cb352f5d2bbf 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -169,8 +169,6 @@ struct nvmet_tcp_queue {
 
 	struct nvmet_tcp_cmd	connect;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*data_ready)(struct sock *);
 	void (*state_change)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -1338,25 +1336,25 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 	c->queue = queue;
 	c->req.port = queue->port->nport;
 
-	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->cmd_pdu = page_frag_alloc(NULL, sizeof(*c->cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->cmd_pdu)
 		return -ENOMEM;
 	c->req.cmd = &c->cmd_pdu->cmd;
 
-	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->rsp_pdu = page_frag_alloc(NULL, sizeof(*c->rsp_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->rsp_pdu)
 		goto out_free_cmd;
 	c->req.cqe = &c->rsp_pdu->cqe;
 
-	c->data_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->data_pdu = page_frag_alloc(NULL, sizeof(*c->data_pdu) + hdgst,
+				      GFP_KERNEL | __GFP_ZERO);
 	if (!c->data_pdu)
 		goto out_free_rsp;
 
-	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->r2t_pdu = page_frag_alloc(NULL, sizeof(*c->r2t_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->r2t_pdu)
 		goto out_free_data;
 
@@ -1485,7 +1483,6 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fa30100f46ad..baa25a00d9e3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -304,18 +304,19 @@ extern void free_pages(unsigned long addr, unsigned int order);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
-extern void *page_frag_alloc_align(struct page_frag_cache *nc,
-				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align);
-
-static inline void *page_frag_alloc(struct page_frag_cache *nc,
-			     unsigned int fragsz, gfp_t gfp_mask)
+extern void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+				   size_t fragsz, gfp_t gfp,
+				   unsigned long align);
+extern void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+			      const void *p, size_t fragsz, gfp_t gfp,
+			      unsigned long align);
+
+static inline void *page_frag_alloc(struct page_frag_cache __percpu *frag_cache,
+				    size_t fragsz, gfp_t gfp)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, 1);
+	return page_frag_alloc_align(frag_cache, fragsz, gfp, 1);
 }
 
-void page_frag_cache_clear(struct page_frag_cache *nc);
-
 extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 2b73c7f5d9a9..b035bbb34fac 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,28 +16,25 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
+static DEFINE_PER_CPU(struct page_frag_cache, page_frag_default_allocator);
+
 /*
  * Allocate a new folio for the frag cache.
  */
-static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
-					    gfp_t gfp_mask)
+static struct folio *page_frag_cache_refill(gfp_t gfp)
 {
-	struct folio *folio = NULL;
-	gfp_t gfp;
+	struct folio *folio;
 
-	gfp_mask &= ~__GFP_ZERO;
-	gfp = gfp_mask;
+	gfp &= ~__GFP_ZERO;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
-	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
+	folio = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
+			    PAGE_FRAG_CACHE_MAX_ORDER);
+	if (folio)
+		return folio;
 #endif
-	if (unlikely(!folio))
-		folio = folio_alloc(gfp, 0);
 
-	if (folio)
-		nc->folio = folio;
-	return folio;
+	return folio_alloc(gfp, 0);
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
@@ -51,63 +48,70 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
 /**
- * page_frag_cache_clear - Clear out a page fragment cache
- * @nc: The cache to clear
+ * page_frag_alloc_align - Allocate some memory for use in zerocopy
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The size of the fragment desired
+ * @gfp: Allocation flags under which to make an allocation
+ * @align: The required alignment
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data.  Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and fragments are broken off
+ * sequentially and returned to the caller with a ref until the folio no longer
+ * has enough spare space - at which point the allocator's ref is dropped and a
+ * new folio is allocated.  The folio remains in existence until the last ref
+ * held by, say, an sk_buff is discarded and then the page is returned to the
+ * page allocator.
  *
- * Discard any pages still cached in a page fragment cache.
+ * Returns a pointer to the memory on success and -ENOMEM on allocation
+ * failure.
+ *
+ * The allocated memory should be disposed of with folio_put().
  */
-void page_frag_cache_clear(struct page_frag_cache *nc)
-{
-	struct folio *folio = nc->folio;
-
-	if (folio) {
-		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
-		folio_put_refs(folio, nc->pagecnt_bias);
-		nc->folio = NULL;
-	}
-}
-EXPORT_SYMBOL(page_frag_cache_clear);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-			    unsigned int fragsz, gfp_t gfp_mask,
-			    unsigned int align)
+void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+			    size_t fragsz, gfp_t gfp, unsigned long align)
 {
-	struct folio *folio = nc->folio;
+	struct page_frag_cache *nc;
+	struct folio *folio, *spare = NULL;
 	size_t offset;
 	void *p;
 
 	WARN_ON_ONCE(!is_power_of_2(align));
 
-	if (unlikely(!folio)) {
-refill:
-		folio = page_frag_cache_refill(nc, gfp_mask);
-		if (!folio)
-			return NULL;
-
-		/* Even if we own the page, we do not use atomic_set().
-		 * This would break get_page_unless_zero() users.
-		 */
-		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+	if (!frag_cache)
+		frag_cache = &page_frag_default_allocator;
+	if (WARN_ON_ONCE(fragsz == 0))
+		fragsz = 1;
 
-		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = folio_is_pfmemalloc(folio);
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = folio_size(folio);
+	nc = get_cpu_ptr(frag_cache);
+reload:
+	folio = nc->folio;
+	offset = nc->offset;
+try_again:
+
+	/* Make the allocation if there's sufficient space. */
+	if (fragsz <= offset) {
+		nc->pagecnt_bias--;
+		offset = (offset - fragsz) & ~(align - 1);
+		nc->offset = offset;
+		p = folio_address(folio) + offset;
+		put_cpu_ptr(frag_cache);
+		if (spare)
+			folio_put(spare);
+		if (gfp & __GFP_ZERO)
+			return memset(p, 0, fragsz);
+		return p;
 	}
 
-	offset = nc->offset;
-	if (unlikely(fragsz > offset)) {
-		/* Reuse the folio if everyone we gave it to has finished with
-		 * it.
-		 */
-		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
-			nc->folio = NULL;
+	/* Insufficient space - see if we can refurbish the current folio. */
+	if (folio) {
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias))
 			goto refill;
-		}
 
 		if (unlikely(nc->pfmemalloc)) {
 			__folio_put(folio);
-			nc->folio = NULL;
 			goto refill;
 		}
 
@@ -117,30 +121,56 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
 		offset = folio_size(folio);
-		if (unlikely(fragsz > offset)) {
-			/*
-			 * The caller is trying to allocate a fragment
-			 * with fragsz > PAGE_SIZE but the cache isn't big
-			 * enough to satisfy the request, this may
-			 * happen in low memory conditions.
-			 * We don't release the cache page because
-			 * it could make memory pressure worse
-			 * so we simply return NULL here.
-			 */
-			nc->offset = offset;
+		if (unlikely(fragsz > offset))
+			goto frag_too_big;
+		goto try_again;
+	}
+
+refill:
+	if (!spare) {
+		nc->folio = NULL;
+		put_cpu_ptr(frag_cache);
+
+		spare = page_frag_cache_refill(gfp);
+		if (!spare)
 			return NULL;
-		}
+
+		nc = get_cpu_ptr(frag_cache);
+		/* We may now be on a different cpu and/or someone else may
+		 * have refilled it
+		 */
+		nc->pfmemalloc = folio_is_pfmemalloc(spare);
+		if (nc->folio)
+			goto reload;
 	}
 
-	nc->pagecnt_bias--;
-	offset -= fragsz;
-	offset &= ~(align - 1);
+	nc->folio = spare;
+	folio = spare;
+	spare = NULL;
+
+	/* Even if we own the page, we do not use atomic_set().  This would
+	 * break get_page_unless_zero() users.
+	 */
+	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+	/* Reset page count bias and offset to start of new frag */
+	nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = folio_size(folio);
+	goto try_again;
+
+frag_too_big:
+	/*
+	 * The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+	 * but the cache isn't big enough to satisfy the request, this may
+	 * happen in low memory conditions.  We don't release the cache page
+	 * because it could make memory pressure worse so we simply return NULL
+	 * here.
+	 */
 	nc->offset = offset;
-
-	p = folio_address(folio) + offset;
-	if (gfp_mask & __GFP_ZERO)
-		return memset(p, 0, fragsz);
-	return p;
+	put_cpu_ptr(frag_cache);
+	if (spare)
+		folio_put(spare);
+	return NULL;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -152,3 +182,25 @@ void page_frag_free(void *addr)
 	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);
+
+/**
+ * page_frag_memdup - Allocate a page fragment and duplicate some data into it
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ */
+void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+		       const void *p, size_t fragsz, gfp_t gfp,
+		       unsigned long align_mask)
+{
+	void *q;
+
+	q = page_frag_alloc_align(frag_cache, fragsz, gfp, align_mask);
+	if (!q)
+		return q;
+
+	return memcpy(q, p, fragsz);
+}
+EXPORT_SYMBOL(page_frag_memdup);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cc507433b357..225a16f3713f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,13 +263,13 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 #endif
 
 struct napi_alloc_cache {
-	struct page_frag_cache page;
 	struct page_frag_1k page_small;
 	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
 
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct page_frag_cache, napi_frag_cache);
 static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 /* Double check that napi_get_frags() allocates skbs with
@@ -291,11 +291,9 @@ void napi_get_frags_check(struct napi_struct *napi)
 
 void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
-	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
+	return page_frag_alloc_align(&napi_frag_cache, fragsz, GFP_ATOMIC, align);
 }
 EXPORT_SYMBOL(napi_alloc_frag_align);
 
@@ -305,15 +303,12 @@ void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 	if (in_hardirq() || irqs_disabled()) {
-		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
-
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align);
+		data = page_frag_alloc_align(&netdev_alloc_cache,
+					     fragsz, GFP_ATOMIC, align);
 	} else {
-		struct napi_alloc_cache *nc;
-
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
+		data = page_frag_alloc_align(&napi_frag_cache,
+					     fragsz, GFP_ATOMIC, align);
 		local_bh_enable();
 	}
 	return data;
@@ -691,7 +686,6 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 				   gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc;
 	struct sk_buff *skb;
 	bool pfmemalloc;
 	void *data;
@@ -716,14 +710,12 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 		gfp_mask |= __GFP_MEMALLOC;
 
 	if (in_hardirq() || irqs_disabled()) {
-		nc = this_cpu_ptr(&netdev_alloc_cache);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&netdev_alloc_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	} else {
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache.page);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 		local_bh_enable();
 	}
 
@@ -811,8 +803,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	} else {
 		len = SKB_HEAD_ALIGN(len);
 
-		data = page_frag_alloc(&nc->page, len, gfp_mask);
-		pfmemalloc = nc->page.pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	}
 
 	if (unlikely(!data))


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 07/12] net: Clean up users of netdev_alloc_cache and napi_frag_cache
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (5 preceding siblings ...)
  2023-05-24 15:33   ` David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-24 15:33 ` [PATCH net-next 08/12] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel

The users of netdev_alloc_cache and napi_frag_cache don't need to take the
bh lock around access to these fragment caches any more as the percpu
handling is now done in page_frag_alloc_align().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-mm@kvack.org
---
 include/linux/skbuff.h |  3 ++-
 net/core/skbuff.c      | 29 +++++++++--------------------
 2 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 41b63e72c6c3..e11a765fe7fa 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -252,7 +252,8 @@
 /* Maximum value in skb->csum_level */
 #define SKB_MAX_CSUM_LEVEL	3
 
-#define SKB_DATA_ALIGN(X)	ALIGN(X, SMP_CACHE_BYTES)
+#define SKB_DATA_ALIGNMENT	SMP_CACHE_BYTES
+#define SKB_DATA_ALIGN(X)	ALIGN(X, SKB_DATA_ALIGNMENT)
 #define SKB_WITH_OVERHEAD(X)	\
 	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 225a16f3713f..c2840b0dcad9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -291,27 +291,20 @@ void napi_get_frags_check(struct napi_struct *napi)
 
 void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
-	fragsz = SKB_DATA_ALIGN(fragsz);
-
+	align = min_t(unsigned int, align, SKB_DATA_ALIGNMENT);
 	return page_frag_alloc_align(&napi_frag_cache, fragsz, GFP_ATOMIC, align);
 }
 EXPORT_SYMBOL(napi_alloc_frag_align);
 
 void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
-	void *data;
-
-	fragsz = SKB_DATA_ALIGN(fragsz);
-	if (in_hardirq() || irqs_disabled()) {
-		data = page_frag_alloc_align(&netdev_alloc_cache,
+	align = min_t(unsigned int, align, SKB_DATA_ALIGNMENT);
+	if (in_hardirq() || irqs_disabled())
+		return page_frag_alloc_align(&netdev_alloc_cache,
 					     fragsz, GFP_ATOMIC, align);
-	} else {
-		local_bh_disable();
-		data = page_frag_alloc_align(&napi_frag_cache,
+	else
+		return page_frag_alloc_align(&napi_frag_cache,
 					     fragsz, GFP_ATOMIC, align);
-		local_bh_enable();
-	}
-	return data;
 }
 EXPORT_SYMBOL(netdev_alloc_frag_align);
 
@@ -709,15 +702,11 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	if (in_hardirq() || irqs_disabled()) {
+	if (in_hardirq() || irqs_disabled())
 		data = page_frag_alloc(&netdev_alloc_cache, len, gfp_mask);
-		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
-	} else {
-		local_bh_disable();
+	else
 		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
-		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
-		local_bh_enable();
-	}
+	pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 
 	if (unlikely(!data))
 		return NULL;


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 08/12] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (6 preceding siblings ...)
  2023-05-24 15:33 ` [PATCH net-next 07/12] net: Clean up users of netdev_alloc_cache and napi_frag_cache David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel

If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
some data that's resident in the slab, copy/coalesce it rather than
returning EIO.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |  3 +++
 net/core/skbuff.c      | 33 ++++++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e11a765fe7fa..11d98990f5f1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -5084,6 +5084,9 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
 #endif
 }
 
+ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
+			     ssize_t maxsize, gfp_t gfp);
+
 ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			     ssize_t maxsize, gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c2840b0dcad9..a16499b9942b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6927,17 +6927,44 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			break;
 		}
 
+		if (space == 0 &&
+		    !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
+				      pages[0], off)) {
+			iov_iter_revert(iter, len);
+			break;
+		}
+
 		i = 0;
 		do {
 			struct page *page = pages[i++];
 			size_t part = min_t(size_t, PAGE_SIZE - off, len);
-
-			ret = -EIO;
-			if (WARN_ON_ONCE(!sendpage_ok(page)))
+			bool put = false;
+
+			if (PageSlab(page)) {
+				const void *p;
+				void *q;
+
+				p = kmap_local_page(page);
+				q = page_frag_memdup(NULL, p + off, part, gfp,
+						     ULONG_MAX);
+				kunmap_local(p);
+				if (!q) {
+					iov_iter_revert(iter, len);
+					ret = -ENOMEM;
+					goto out;
+				}
+				page = virt_to_page(q);
+				off = offset_in_page(q);
+				put = true;
+			} else if (WARN_ON_ONCE(!sendpage_ok(page))) {
+				ret = -EIO;
 				goto out;
+			}
 
 			ret = skb_append_pagefrags(skb, page, off, part,
 						   frag_limit);
+			if (put)
+				put_page(page);
 			if (ret < 0) {
 				iov_iter_revert(iter, len);
 				goto out;


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (7 preceding siblings ...)
  2023-05-24 15:33 ` [PATCH net-next 08/12] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES) David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-27  1:08   ` Jakub Kicinski
  2023-05-30 22:26   ` Bug in short splice to socket? David Howells
  2023-05-24 15:33 ` [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES David Howells
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

Make TLS's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible and copied the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/tls/tls_sw.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 635b8bf6b937..0ccef8aa9951 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -929,6 +929,49 @@ static int tls_sw_push_pending_record(struct sock *sk, int flags)
 				   &copied, flags);
 }
 
+static int rls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,
+				 struct sk_msg *msg_pl, size_t try_to_copy,
+				 ssize_t *copied)
+{
+	struct page *page = NULL, **pages = &page;
+
+	do {
+		ssize_t part;
+		size_t off;
+		bool put = false;
+
+		part = iov_iter_extract_pages(&msg->msg_iter, &pages,
+					      try_to_copy, 1, 0, &off);
+		if (part <= 0)
+			return part ?: -EIO;
+
+		if (!sendpage_ok(page)) {
+			const void *p = kmap_local_page(page);
+			void *q;
+
+			q = page_frag_memdup(NULL, p + off, part,
+					     sk->sk_allocation, ULONG_MAX);
+			kunmap_local(p);
+			if (!q) {
+				iov_iter_revert(&msg->msg_iter, part);
+				return -ENOMEM;
+			}
+			page = virt_to_page(q);
+			off = offset_in_page(q);
+			put = true;
+		}
+
+		sk_msg_page_add(msg_pl, page, part, off);
+		sk_mem_charge(sk, part);
+		if (put)
+			put_page(page);
+		*copied += part;
+		try_to_copy -= part;
+	} while (try_to_copy && !sk_msg_full(msg_pl));
+
+	return 0;
+}
+
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
@@ -1018,6 +1061,17 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			full_record = true;
 		}
 
+		if (try_to_copy && (msg->msg_flags & MSG_SPLICE_PAGES)) {
+			ret = rls_sw_sendmsg_splice(sk, msg, msg_pl,
+						    try_to_copy, &copied);
+			if (ret < 0)
+				goto send_end;
+			tls_ctx->pending_open_record_frags = true;
+			if (full_record || eor || sk_msg_full(msg_pl))
+				goto copied;
+			continue;
+		}
+
 		if (!is_kvec && (full_record || eor) && !async_capable) {
 			u32 first = msg_pl->sg.end;
 
@@ -1080,8 +1134,9 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		/* Open records defined only if successfully copied, otherwise
 		 * we would trim the sg but not reset the open record frags.
 		 */
-		tls_ctx->pending_open_record_frags = true;
 		copied += try_to_copy;
+copied:
+		tls_ctx->pending_open_record_frags = true;
 		if (full_record || eor) {
 			ret = bpf_exec_tx_verdict(msg_pl, sk, full_record,
 						  record_type, &copied,


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (8 preceding siblings ...)
  2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-27  1:13   ` Jakub Kicinski
  2023-05-24 15:33 ` [PATCH net-next 11/12] tls/device: Support MSG_SPLICE_PAGES David Howells
  2023-05-24 15:33 ` [PATCH net-next 12/12] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES David Howells
  11 siblings, 1 reply; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg()
with MSG_SPLICE_PAGES rather than directly splicing in the pages itself.

[!] Note that tls_sw_sendpage_locked() appears to have the wrong locking
    upstream.  I think the caller will only hold the socket lock, but it
    should hold tls_ctx->tx_lock too.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/tls/tls_sw.c | 164 +++++++++--------------------------------------
 1 file changed, 30 insertions(+), 134 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0ccef8aa9951..1a5926cc3e84 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -972,7 +972,7 @@ static int rls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,
 	return 0;
 }
 
-int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -995,15 +995,6 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	int ret = 0;
 	int pending;
 
-	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
-			       MSG_CMSG_COMPAT))
-		return -EOPNOTSUPP;
-
-	ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
-	if (ret)
-		return ret;
-	lock_sock(sk);
-
 	if (unlikely(msg->msg_controllen)) {
 		ret = tls_process_cmsg(sk, msg, &record_type);
 		if (ret) {
@@ -1204,157 +1195,62 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 
 send_end:
 	ret = sk_stream_error(sk, msg->msg_flags, ret);
-
-	release_sock(sk);
-	mutex_unlock(&tls_ctx->tx_lock);
 	return copied > 0 ? copied : ret;
 }
 
-static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
-			      int offset, size_t size, int flags)
+int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
-	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
-	struct tls_prot_info *prot = &tls_ctx->prot_info;
-	unsigned char record_type = TLS_RECORD_TYPE_DATA;
-	struct sk_msg *msg_pl;
-	struct tls_rec *rec;
-	int num_async = 0;
-	ssize_t copied = 0;
-	bool full_record;
-	int record_room;
-	int ret = 0;
-	bool eor;
-
-	eor = !(flags & MSG_SENDPAGE_NOTLAST);
-	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
-	/* Call the sk_stream functions to manage the sndbuf mem. */
-	while (size > 0) {
-		size_t copy, required_size;
-
-		if (sk->sk_err) {
-			ret = -sk->sk_err;
-			goto sendpage_end;
-		}
-
-		if (ctx->open_rec)
-			rec = ctx->open_rec;
-		else
-			rec = ctx->open_rec = tls_get_rec(sk);
-		if (!rec) {
-			ret = -ENOMEM;
-			goto sendpage_end;
-		}
-
-		msg_pl = &rec->msg_plaintext;
-
-		full_record = false;
-		record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size;
-		copy = size;
-		if (copy >= record_room) {
-			copy = record_room;
-			full_record = true;
-		}
-
-		required_size = msg_pl->sg.size + copy + prot->overhead_size;
-
-		if (!sk_stream_memory_free(sk))
-			goto wait_for_sndbuf;
-alloc_payload:
-		ret = tls_alloc_encrypted_msg(sk, required_size);
-		if (ret) {
-			if (ret != -ENOSPC)
-				goto wait_for_memory;
-
-			/* Adjust copy according to the amount that was
-			 * actually allocated. The difference is due
-			 * to max sg elements limit
-			 */
-			copy -= required_size - msg_pl->sg.size;
-			full_record = true;
-		}
-
-		sk_msg_page_add(msg_pl, page, copy, offset);
-		sk_mem_charge(sk, copy);
-
-		offset += copy;
-		size -= copy;
-		copied += copy;
-
-		tls_ctx->pending_open_record_frags = true;
-		if (full_record || eor || sk_msg_full(msg_pl)) {
-			ret = bpf_exec_tx_verdict(msg_pl, sk, full_record,
-						  record_type, &copied, flags);
-			if (ret) {
-				if (ret == -EINPROGRESS)
-					num_async++;
-				else if (ret == -ENOMEM)
-					goto wait_for_memory;
-				else if (ret != -EAGAIN) {
-					if (ret == -ENOSPC)
-						ret = 0;
-					goto sendpage_end;
-				}
-			}
-		}
-		continue;
-wait_for_sndbuf:
-		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
-		ret = sk_stream_wait_memory(sk, &timeo);
-		if (ret) {
-			if (ctx->open_rec)
-				tls_trim_both_msgs(sk, msg_pl->sg.size);
-			goto sendpage_end;
-		}
+	int ret;
 
-		if (ctx->open_rec)
-			goto alloc_payload;
-	}
+	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
+			       MSG_CMSG_COMPAT | MSG_SPLICE_PAGES |
+			       MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
+		return -EOPNOTSUPP;
 
-	if (num_async) {
-		/* Transmit if any encryptions have completed */
-		if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
-			cancel_delayed_work(&ctx->tx_work.work);
-			tls_tx_records(sk, flags);
-		}
-	}
-sendpage_end:
-	ret = sk_stream_error(sk, flags, ret);
-	return copied > 0 ? copied : ret;
+	ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
+	if (ret)
+		return ret;
+	lock_sock(sk);
+	ret = tls_sw_sendmsg_locked(sk, msg, size);
+	release_sock(sk);
+	mutex_unlock(&tls_ctx->tx_lock);
+	return ret;
 }
 
 int tls_sw_sendpage_locked(struct sock *sk, struct page *page,
 			   int offset, size_t size, int flags)
 {
+	struct bio_vec bvec;
+	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+
 	if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
 		      MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY |
 		      MSG_NO_SHARED_FRAGS))
 		return -EOPNOTSUPP;
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-	return tls_sw_do_sendpage(sk, page, offset, size, flags);
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+	return tls_sw_sendmsg_locked(sk, &msg, size);
 }
 
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	int ret;
+	struct bio_vec bvec;
+	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 
 	if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
 		      MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
 		return -EOPNOTSUPP;
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-	ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
-	if (ret)
-		return ret;
-	lock_sock(sk);
-	ret = tls_sw_do_sendpage(sk, page, offset, size, flags);
-	release_sock(sk);
-	mutex_unlock(&tls_ctx->tx_lock);
-	return ret;
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+	return tls_sw_sendmsg(sk, &msg, size);
 }
 
 static int


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 11/12] tls/device: Support MSG_SPLICE_PAGES
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (9 preceding siblings ...)
  2023-05-24 15:33 ` [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES David Howells
@ 2023-05-24 15:33 ` David Howells
  2023-05-24 15:33 ` [PATCH net-next 12/12] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES David Howells
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

Make TLS's device sendmsg() support MSG_SPLICE_PAGES.  This causes pages to
be spliced from the source iterator if possible and copied the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/tls/tls_device.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index daeff54bdbfa..ee07f6e67d52 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -508,7 +508,30 @@ static int tls_push_data(struct sock *sk,
 			tls_append_frag(record, &zc_pfrag, copy);
 
 			iter_offset.offset += copy;
+		} else if (copy && (flags & MSG_SPLICE_PAGES)) {
+			struct page_frag zc_pfrag;
+			struct page **pages = &zc_pfrag.page;
+			size_t off;
+
+			rc = iov_iter_extract_pages(iter_offset.msg_iter, &pages,
+						    copy, 1, 0, &off);
+			if (rc <= 0) {
+				if (rc == 0)
+					rc = -EIO;
+				goto handle_error;
+			}
+			copy = rc;
+
+			if (!sendpage_ok(zc_pfrag.page)) {
+				iov_iter_revert(iter_offset.msg_iter, copy);
+				goto no_zcopy_this_page;
+			}
+
+			zc_pfrag.offset = off;
+			zc_pfrag.size = copy;
+			tls_append_frag(record, &zc_pfrag, copy);
 		} else if (copy) {
+no_zcopy_this_page:
 			copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
 
 			rc = tls_device_copy_data(page_address(pfrag->page) +
@@ -571,6 +594,9 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	union tls_iter_offset iter;
 	int rc;
 
+	if (!tls_ctx->zerocopy_sendfile)
+		msg->msg_flags &= ~MSG_SPLICE_PAGES;
+
 	mutex_lock(&tls_ctx->tx_lock);
 	lock_sock(sk);
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH net-next 12/12] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES
  2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
                   ` (10 preceding siblings ...)
  2023-05-24 15:33 ` [PATCH net-next 11/12] tls/device: Support MSG_SPLICE_PAGES David Howells
@ 2023-05-24 15:33 ` David Howells
  11 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-24 15:33 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

Convert tls_device_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.  With that, the tls_iter_offset
union is no longer necessary and can be replaced with an iov_iter pointer
and the zc_page argument to tls_push_data() can also be removed.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/tls/tls_device.c | 81 ++++++++++----------------------------------
 1 file changed, 18 insertions(+), 63 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index ee07f6e67d52..f2c895009314 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -422,16 +422,10 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
 	return 0;
 }
 
-union tls_iter_offset {
-	struct iov_iter *msg_iter;
-	int offset;
-};
-
 static int tls_push_data(struct sock *sk,
-			 union tls_iter_offset iter_offset,
+			 struct iov_iter *iter,
 			 size_t size, int flags,
-			 unsigned char record_type,
-			 struct page *zc_page)
+			 unsigned char record_type)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
@@ -499,21 +493,12 @@ static int tls_push_data(struct sock *sk,
 		record = ctx->open_record;
 
 		copy = min_t(size_t, size, max_open_record_len - record->len);
-		if (copy && zc_page) {
-			struct page_frag zc_pfrag;
-
-			zc_pfrag.page = zc_page;
-			zc_pfrag.offset = iter_offset.offset;
-			zc_pfrag.size = copy;
-			tls_append_frag(record, &zc_pfrag, copy);
-
-			iter_offset.offset += copy;
-		} else if (copy && (flags & MSG_SPLICE_PAGES)) {
+		if (copy && (flags & MSG_SPLICE_PAGES)) {
 			struct page_frag zc_pfrag;
 			struct page **pages = &zc_pfrag.page;
 			size_t off;
 
-			rc = iov_iter_extract_pages(iter_offset.msg_iter, &pages,
+			rc = iov_iter_extract_pages(iter, &pages,
 						    copy, 1, 0, &off);
 			if (rc <= 0) {
 				if (rc == 0)
@@ -523,7 +508,7 @@ static int tls_push_data(struct sock *sk,
 			copy = rc;
 
 			if (!sendpage_ok(zc_pfrag.page)) {
-				iov_iter_revert(iter_offset.msg_iter, copy);
+				iov_iter_revert(iter, copy);
 				goto no_zcopy_this_page;
 			}
 
@@ -536,7 +521,7 @@ static int tls_push_data(struct sock *sk,
 
 			rc = tls_device_copy_data(page_address(pfrag->page) +
 						  pfrag->offset, copy,
-						  iter_offset.msg_iter);
+						  iter);
 			if (rc)
 				goto handle_error;
 			tls_append_frag(record, pfrag, copy);
@@ -591,7 +576,6 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	unsigned char record_type = TLS_RECORD_TYPE_DATA;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	union tls_iter_offset iter;
 	int rc;
 
 	if (!tls_ctx->zerocopy_sendfile)
@@ -606,8 +590,7 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			goto out;
 	}
 
-	iter.msg_iter = &msg->msg_iter;
-	rc = tls_push_data(sk, iter, size, msg->msg_flags, record_type, NULL);
+	rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags, record_type);
 
 out:
 	release_sock(sk);
@@ -618,44 +601,18 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 int tls_device_sendpage(struct sock *sk, struct page *page,
 			int offset, size_t size, int flags)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	union tls_iter_offset iter_offset;
-	struct iov_iter msg_iter;
-	char *kaddr;
-	struct kvec iov;
-	int rc;
+	struct bio_vec bvec;
+	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 
 	if (flags & MSG_SENDPAGE_NOTLAST)
-		flags |= MSG_MORE;
-
-	mutex_lock(&tls_ctx->tx_lock);
-	lock_sock(sk);
+		msg.msg_flags |= MSG_MORE;
 
-	if (flags & MSG_OOB) {
-		rc = -EOPNOTSUPP;
-		goto out;
-	}
-
-	if (tls_ctx->zerocopy_sendfile) {
-		iter_offset.offset = offset;
-		rc = tls_push_data(sk, iter_offset, size,
-				   flags, TLS_RECORD_TYPE_DATA, page);
-		goto out;
-	}
-
-	kaddr = kmap(page);
-	iov.iov_base = kaddr + offset;
-	iov.iov_len = size;
-	iov_iter_kvec(&msg_iter, ITER_SOURCE, &iov, 1, size);
-	iter_offset.msg_iter = &msg_iter;
-	rc = tls_push_data(sk, iter_offset, size, flags, TLS_RECORD_TYPE_DATA,
-			   NULL);
-	kunmap(page);
+	if (flags & MSG_OOB)
+		return -EOPNOTSUPP;
 
-out:
-	release_sock(sk);
-	mutex_unlock(&tls_ctx->tx_lock);
-	return rc;
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+	return tls_device_sendmsg(sk, &msg, size);
 }
 
 struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
@@ -720,12 +677,10 @@ EXPORT_SYMBOL(tls_get_record);
 
 static int tls_device_push_pending_record(struct sock *sk, int flags)
 {
-	union tls_iter_offset iter;
-	struct iov_iter msg_iter;
+	struct iov_iter iter;
 
-	iov_iter_kvec(&msg_iter, ITER_SOURCE, NULL, 0, 0);
-	iter.msg_iter = &msg_iter;
-	return tls_push_data(sk, iter, 0, flags, TLS_RECORD_TYPE_DATA, NULL);
+	iov_iter_kvec(&iter, ITER_SOURCE, NULL, 0, 0);
+	return tls_push_data(sk, &iter, 0, flags, TLS_RECORD_TYPE_DATA);
 }
 
 void tls_device_write_space(struct sock *sk, struct tls_context *ctx)


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-24 15:33   ` David Howells
@ 2023-05-26 11:56     ` Yunsheng Lin
  -1 siblings, 0 replies; 69+ messages in thread
From: Yunsheng Lin @ 2023-05-26 11:56 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On 2023/5/24 23:33, David Howells wrote:
> Change the page_frag_cache allocator to use multipage folios rather than
> groups of pages.  This reduces page_frag_free to just a folio_put() or
> put_page().

Hi, David

put_page() is not used in this patch, perhaps remove it to avoid
the confusion?
Also, Is there any significant difference between __free_pages()
and folio_put()? IOW, what does the 'reduces' part means here?

I followed some disscusion about folio before, but have not really
understood about real difference between 'multipage folios' and
'groups of pages' yet. Is folio mostly used to avoid the confusion
about whether a page is 'headpage of compound page', 'base page' or
'tailpage of compound page'? Or is there any abvious benefit about
folio that I missed?

> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..d7c52a5979cc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
>  }
>  
>  struct page_frag_cache {
> -	void * va;
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	__u16 offset;
> -	__u16 size;
> -#else
> -	__u32 offset;
> -#endif
> +	struct folio	*folio;
> +	unsigned int	offset;
>  	/* we maintain a pagecount bias, so that we dont dirty cache line
>  	 * containing page->_refcount every time we allocate a fragment.
>  	 */
> -	unsigned int		pagecnt_bias;
> -	bool pfmemalloc;
> +	unsigned int	pagecnt_bias;
> +	bool		pfmemalloc;
>  };

It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
avoid possible cache bouncing when there is more frag can be allocated
from the page while other frags is freed at the same time before this patch?
It might be worth calling that out in the commit log or split it into another
patch to make it clearer and easier to review?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-26 11:56     ` Yunsheng Lin
  0 siblings, 0 replies; 69+ messages in thread
From: Yunsheng Lin @ 2023-05-26 11:56 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On 2023/5/24 23:33, David Howells wrote:
> Change the page_frag_cache allocator to use multipage folios rather than
> groups of pages.  This reduces page_frag_free to just a folio_put() or
> put_page().

Hi, David

put_page() is not used in this patch, perhaps remove it to avoid
the confusion?
Also, Is there any significant difference between __free_pages()
and folio_put()? IOW, what does the 'reduces' part means here?

I followed some disscusion about folio before, but have not really
understood about real difference between 'multipage folios' and
'groups of pages' yet. Is folio mostly used to avoid the confusion
about whether a page is 'headpage of compound page', 'base page' or
'tailpage of compound page'? Or is there any abvious benefit about
folio that I missed?

> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..d7c52a5979cc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
>  }
>  
>  struct page_frag_cache {
> -	void * va;
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	__u16 offset;
> -	__u16 size;
> -#else
> -	__u32 offset;
> -#endif
> +	struct folio	*folio;
> +	unsigned int	offset;
>  	/* we maintain a pagecount bias, so that we dont dirty cache line
>  	 * containing page->_refcount every time we allocate a fragment.
>  	 */
> -	unsigned int		pagecnt_bias;
> -	bool pfmemalloc;
> +	unsigned int	pagecnt_bias;
> +	bool		pfmemalloc;
>  };

It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
avoid possible cache bouncing when there is more frag can be allocated
from the page while other frags is freed at the same time before this patch?
It might be worth calling that out in the commit log or split it into another
patch to make it clearer and easier to review?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-24 15:33   ` David Howells
@ 2023-05-26 12:47     ` David Howells
  -1 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-26 12:47 UTC (permalink / raw)
  To: Yunsheng Lin, Matthew Wilcox
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, David Ahern,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Yunsheng Lin <linyunsheng@huawei.com> wrote:

> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages.  This reduces page_frag_free to just a folio_put() or
> > put_page().
> 
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?

Will do if I need to respin the patches.

> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?

I meant that the folio code handles page compounding for us and we don't need
to work out how big the page is for ourselves.

If you look at __free_pages(), you can see a PageHead() call.  folio_put()
doesn't need that.

> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?

There is a benefit: a folio pointer always points to the head page and so we
never need to do "is this compound? where's the head?" logic to find it.  When
going from a page pointer, we still have to find the head.

Ultimately, the aim is to reduce struct page to a typed pointer to massively
reduce the amount of space consumed by mem_map[].  A page struct will then
point at a folio or a slab struct or one of a number of different types.  But
to get to that point, we have to stop a whole lot of things from using page
structs, but rather use some other type, such as folio.

Eventually, there won't be a need for head pages and tail pages per se - just
memory objects of different sizes.

> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> >  }
> >  
> >  struct page_frag_cache {
> > -	void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	__u16 offset;
> > -	__u16 size;
> > -#else
> > -	__u32 offset;
> > -#endif
> > +	struct folio	*folio;
> > +	unsigned int	offset;
> >  	/* we maintain a pagecount bias, so that we dont dirty cache line
> >  	 * containing page->_refcount every time we allocate a fragment.
> >  	 */
> > -	unsigned int		pagecnt_bias;
> > -	bool pfmemalloc;
> > +	unsigned int	pagecnt_bias;
> > +	bool		pfmemalloc;
> >  };
> 
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?

Hmmm... fair point, though va is calculated from the page pointer on most
arches without the need to dereference struct page (only arc, m68k and sparc
define WANT_PAGE_VIRTUAL).

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-26 12:47     ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-05-26 12:47 UTC (permalink / raw)
  To: Yunsheng Lin, Matthew Wilcox
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, David Ahern,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Yunsheng Lin <linyunsheng@huawei.com> wrote:

> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages.  This reduces page_frag_free to just a folio_put() or
> > put_page().
> 
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?

Will do if I need to respin the patches.

> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?

I meant that the folio code handles page compounding for us and we don't need
to work out how big the page is for ourselves.

If you look at __free_pages(), you can see a PageHead() call.  folio_put()
doesn't need that.

> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?

There is a benefit: a folio pointer always points to the head page and so we
never need to do "is this compound? where's the head?" logic to find it.  When
going from a page pointer, we still have to find the head.

Ultimately, the aim is to reduce struct page to a typed pointer to massively
reduce the amount of space consumed by mem_map[].  A page struct will then
point at a folio or a slab struct or one of a number of different types.  But
to get to that point, we have to stop a whole lot of things from using page
structs, but rather use some other type, such as folio.

Eventually, there won't be a need for head pages and tail pages per se - just
memory objects of different sizes.

> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> >  }
> >  
> >  struct page_frag_cache {
> > -	void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	__u16 offset;
> > -	__u16 size;
> > -#else
> > -	__u32 offset;
> > -#endif
> > +	struct folio	*folio;
> > +	unsigned int	offset;
> >  	/* we maintain a pagecount bias, so that we dont dirty cache line
> >  	 * containing page->_refcount every time we allocate a fragment.
> >  	 */
> > -	unsigned int		pagecnt_bias;
> > -	bool pfmemalloc;
> > +	unsigned int	pagecnt_bias;
> > +	bool		pfmemalloc;
> >  };
> 
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?

Hmmm... fair point, though va is calculated from the page pointer on most
arches without the need to dereference struct page (only arc, m68k and sparc
define WANT_PAGE_VIRTUAL).

David


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-26 12:47     ` David Howells
@ 2023-05-26 14:06       ` Mika Penttilä
  -1 siblings, 0 replies; 69+ messages in thread
From: Mika Penttilä @ 2023-05-26 14:06 UTC (permalink / raw)
  To: David Howells, Yunsheng Lin, Matthew Wilcox
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Jens Axboe, linux-mm,
	linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

Hi,

On 26.5.2023 15.47, David Howells wrote:
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>>> Change the page_frag_cache allocator to use multipage folios rather than
>>> groups of pages.  This reduces page_frag_free to just a folio_put() or
>>> put_page().
>>
>> put_page() is not used in this patch, perhaps remove it to avoid
>> the confusion?
> 
> Will do if I need to respin the patches.
> 
>> Also, Is there any significant difference between __free_pages()
>> and folio_put()? IOW, what does the 'reduces' part means here?
> 
> I meant that the folio code handles page compounding for us and we don't need
> to work out how big the page is for ourselves.
> 
> If you look at __free_pages(), you can see a PageHead() call.  folio_put()
> doesn't need that.
> 
>> I followed some disscusion about folio before, but have not really
>> understood about real difference between 'multipage folios' and
>> 'groups of pages' yet. Is folio mostly used to avoid the confusion
>> about whether a page is 'headpage of compound page', 'base page' or
>> 'tailpage of compound page'? Or is there any abvious benefit about
>> folio that I missed?
> 
> There is a benefit: a folio pointer always points to the head page and so we
> never need to do "is this compound? where's the head?" logic to find it.  When
> going from a page pointer, we still have to find the head.
> 


But page_frag_free() uses folio_put(virt_to_folio(addr)) and 
virt_to_folio() depends on the compound infrastructure to get the head 
page and folio.


> Ultimately, the aim is to reduce struct page to a typed pointer to massively
> reduce the amount of space consumed by mem_map[].  A page struct will then
> point at a folio or a slab struct or one of a number of different types.  But
> to get to that point, we have to stop a whole lot of things from using page
> structs, but rather use some other type, such as folio.
> 
> Eventually, there won't be a need for head pages and tail pages per se - just
> memory objects of different sizes.
> 
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 306a3d1a0fa6..d7c52a5979cc 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
>>>   }
>>>   
>>>   struct page_frag_cache {
>>> -	void * va;
>>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>> -	__u16 offset;
>>> -	__u16 size;
>>> -#else
>>> -	__u32 offset;
>>> -#endif
>>> +	struct folio	*folio;
>>> +	unsigned int	offset;
>>>   	/* we maintain a pagecount bias, so that we dont dirty cache line
>>>   	 * containing page->_refcount every time we allocate a fragment.
>>>   	 */
>>> -	unsigned int		pagecnt_bias;
>>> -	bool pfmemalloc;
>>> +	unsigned int	pagecnt_bias;
>>> +	bool		pfmemalloc;
>>>   };
>>
>> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
>> avoid possible cache bouncing when there is more frag can be allocated
>> from the page while other frags is freed at the same time before this patch?
> 
> Hmmm... fair point, though va is calculated from the page pointer on most
> arches without the need to dereference struct page (only arc, m68k and sparc
> define WANT_PAGE_VIRTUAL).
> 
> David
> 

--Mika


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-26 14:06       ` Mika Penttilä
  0 siblings, 0 replies; 69+ messages in thread
From: Mika Penttilä @ 2023-05-26 14:06 UTC (permalink / raw)
  To: David Howells, Yunsheng Lin, Matthew Wilcox
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Jens Axboe, linux-mm,
	linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

Hi,

On 26.5.2023 15.47, David Howells wrote:
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>>> Change the page_frag_cache allocator to use multipage folios rather than
>>> groups of pages.  This reduces page_frag_free to just a folio_put() or
>>> put_page().
>>
>> put_page() is not used in this patch, perhaps remove it to avoid
>> the confusion?
> 
> Will do if I need to respin the patches.
> 
>> Also, Is there any significant difference between __free_pages()
>> and folio_put()? IOW, what does the 'reduces' part means here?
> 
> I meant that the folio code handles page compounding for us and we don't need
> to work out how big the page is for ourselves.
> 
> If you look at __free_pages(), you can see a PageHead() call.  folio_put()
> doesn't need that.
> 
>> I followed some disscusion about folio before, but have not really
>> understood about real difference between 'multipage folios' and
>> 'groups of pages' yet. Is folio mostly used to avoid the confusion
>> about whether a page is 'headpage of compound page', 'base page' or
>> 'tailpage of compound page'? Or is there any abvious benefit about
>> folio that I missed?
> 
> There is a benefit: a folio pointer always points to the head page and so we
> never need to do "is this compound? where's the head?" logic to find it.  When
> going from a page pointer, we still have to find the head.
> 


But page_frag_free() uses folio_put(virt_to_folio(addr)) and 
virt_to_folio() depends on the compound infrastructure to get the head 
page and folio.


> Ultimately, the aim is to reduce struct page to a typed pointer to massively
> reduce the amount of space consumed by mem_map[].  A page struct will then
> point at a folio or a slab struct or one of a number of different types.  But
> to get to that point, we have to stop a whole lot of things from using page
> structs, but rather use some other type, such as folio.
> 
> Eventually, there won't be a need for head pages and tail pages per se - just
> memory objects of different sizes.
> 
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 306a3d1a0fa6..d7c52a5979cc 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
>>>   }
>>>   
>>>   struct page_frag_cache {
>>> -	void * va;
>>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>> -	__u16 offset;
>>> -	__u16 size;
>>> -#else
>>> -	__u32 offset;
>>> -#endif
>>> +	struct folio	*folio;
>>> +	unsigned int	offset;
>>>   	/* we maintain a pagecount bias, so that we dont dirty cache line
>>>   	 * containing page->_refcount every time we allocate a fragment.
>>>   	 */
>>> -	unsigned int		pagecnt_bias;
>>> -	bool pfmemalloc;
>>> +	unsigned int	pagecnt_bias;
>>> +	bool		pfmemalloc;
>>>   };
>>
>> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
>> avoid possible cache bouncing when there is more frag can be allocated
>> from the page while other frags is freed at the same time before this patch?
> 
> Hmmm... fair point, though va is calculated from the page pointer on most
> arches without the need to dereference struct page (only arc, m68k and sparc
> define WANT_PAGE_VIRTUAL).
> 
> David
> 

--Mika


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-24 15:33   ` David Howells
@ 2023-05-27  0:50     ` Jakub Kicinski
  -1 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  0:50 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 24 May 2023 16:33:03 +0100 David Howells wrote:
> -	offset = nc->offset - fragsz;
> -	if (unlikely(offset < 0)) {
> -		page = virt_to_page(nc->va);
> -
> -		if (page_ref_count(page) != nc->pagecnt_bias)
> +	offset = nc->offset;
> +	if (unlikely(fragsz > offset)) {
> +		/* Reuse the folio if everyone we gave it to has finished with
> +		 * it.
> +		 */
> +		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
> +			nc->folio = NULL;
>  			goto refill;
> +		}
> +
>  		if (unlikely(nc->pfmemalloc)) {
> -			page_ref_sub(page, nc->pagecnt_bias - 1);
> -			__free_pages(page, compound_order(page));
> +			__folio_put(folio);

This is not a pure 1:1 page -> folio conversion.
Why mix conversion with other code changes?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-27  0:50     ` Jakub Kicinski
  0 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  0:50 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 24 May 2023 16:33:03 +0100 David Howells wrote:
> -	offset = nc->offset - fragsz;
> -	if (unlikely(offset < 0)) {
> -		page = virt_to_page(nc->va);
> -
> -		if (page_ref_count(page) != nc->pagecnt_bias)
> +	offset = nc->offset;
> +	if (unlikely(fragsz > offset)) {
> +		/* Reuse the folio if everyone we gave it to has finished with
> +		 * it.
> +		 */
> +		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
> +			nc->folio = NULL;
>  			goto refill;
> +		}
> +
>  		if (unlikely(nc->pfmemalloc)) {
> -			page_ref_sub(page, nc->pagecnt_bias - 1);
> -			__free_pages(page, compound_order(page));
> +			__folio_put(folio);

This is not a pure 1:1 page -> folio conversion.
Why mix conversion with other code changes?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
  2023-05-24 15:33   ` David Howells
@ 2023-05-27  0:57     ` Jakub Kicinski
  -1 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  0:57 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, Alexander Duyck

On Wed, 24 May 2023 16:33:04 +0100 David Howells wrote:
> Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
> passing it off to the page allocator.  There may be a mix of callers, some
> specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
> might refurbish the page, in which case the returned memory doesn't get
> cleared.

I think it's pretty clear that page frag allocator was never supposed
to support GFP_ZERO, as we don't need it in networking.. So maybe
you're better off adding the memset() in nvme?

CCing Alex, who I think would say something along those lines :)
IDK how much we still care given that most networking drivers are
migrating to page_poll these days.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
@ 2023-05-27  0:57     ` Jakub Kicinski
  0 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  0:57 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, Alexander Duyck

On Wed, 24 May 2023 16:33:04 +0100 David Howells wrote:
> Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
> passing it off to the page allocator.  There may be a mix of callers, some
> specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
> might refurbish the page, in which case the returned memory doesn't get
> cleared.

I think it's pretty clear that page frag allocator was never supposed
to support GFP_ZERO, as we don't need it in networking.. So maybe
you're better off adding the memset() in nvme?

CCing Alex, who I think would say something along those lines :)
IDK how much we still care given that most networking drivers are
migrating to page_poll these days.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu
  2023-05-24 15:33   ` David Howells
@ 2023-05-27  1:02     ` Jakub Kicinski
  -1 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  1:02 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 24 May 2023 16:33:05 +0100 David Howells wrote:
> though if a softirq wants to access it, then softirq disablement will
> need to be added.

Pretty sure GVE uses their allocator from softirq.
So this doesn't work, right?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu
@ 2023-05-27  1:02     ` Jakub Kicinski
  0 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  1:02 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 24 May 2023 16:33:05 +0100 David Howells wrote:
> though if a softirq wants to access it, then softirq disablement will
> need to be added.

Pretty sure GVE uses their allocator from softirq.
So this doesn't work, right?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES
  2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
@ 2023-05-27  1:08   ` Jakub Kicinski
  2023-05-30 22:26   ` Bug in short splice to socket? David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  1:08 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

On Wed, 24 May 2023 16:33:08 +0100 David Howells wrote:
> +static int rls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,

s/rls/tls/ ?

Will the TLS selftests under tools/.../net/tls.c exercise this?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
  2023-05-24 15:33 ` [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES David Howells
@ 2023-05-27  1:13   ` Jakub Kicinski
  0 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-27  1:13 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend

On Wed, 24 May 2023 16:33:09 +0100 David Howells wrote:
> Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg()
> with MSG_SPLICE_PAGES rather than directly splicing in the pages itself.
> 
> [!] Note that tls_sw_sendpage_locked() appears to have the wrong locking
>     upstream.  I think the caller will only hold the socket lock, but it
>     should hold tls_ctx->tx_lock too.

Lock ordering, as you probably discovered. It is what it is :|

> +	if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
> +			       MSG_CMSG_COMPAT | MSG_SPLICE_PAGES |
> +			       MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
> +		return -EOPNOTSUPP;

Now MSG_SENDPAGE_* can leak in thru the sendmsg() call?
Letting MSG_SENDPAGE_NOPOLICY in seems pretty suspicious, no?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-26 11:56     ` Yunsheng Lin
@ 2023-05-27 15:47       ` Alexander H Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander H Duyck @ 2023-05-27 15:47 UTC (permalink / raw)
  To: Yunsheng Lin, David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Fri, 2023-05-26 at 19:56 +0800, Yunsheng Lin wrote:
> On 2023/5/24 23:33, David Howells wrote:
> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages.  This reduces page_frag_free to just a folio_put() or
> > put_page().
> 
> Hi, David
> 
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?
> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?
> 
> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?
> 
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> >  }
> >  
> >  struct page_frag_cache {
> > -	void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	__u16 offset;
> > -	__u16 size;
> > -#else
> > -	__u32 offset;
> > -#endif
> > +	struct folio	*folio;
> > +	unsigned int	offset;
> >  	/* we maintain a pagecount bias, so that we dont dirty cache line
> >  	 * containing page->_refcount every time we allocate a fragment.
> >  	 */
> > -	unsigned int		pagecnt_bias;
> > -	bool pfmemalloc;
> > +	unsigned int	pagecnt_bias;
> > +	bool		pfmemalloc;
> >  };
> 
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?
> It might be worth calling that out in the commit log or split it into another
> patch to make it clearer and easier to review?

Yes, there is a cost for going from page to virtual address. That is
why we only use the page when we finally get to freeing or resetting
the pagecnt_bias.

Also I have some concerns about going from page to folio as it seems
like the folio_alloc setups the transparent hugepage destructor instead
of using the compound page destructor. I would think that would slow
down most users as it looks like there is a spinlock that is taken in
the hugepage destructor that isn't there in the compound page
destructor.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-05-27 15:47       ` Alexander H Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander H Duyck @ 2023-05-27 15:47 UTC (permalink / raw)
  To: Yunsheng Lin, David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Fri, 2023-05-26 at 19:56 +0800, Yunsheng Lin wrote:
> On 2023/5/24 23:33, David Howells wrote:
> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages.  This reduces page_frag_free to just a folio_put() or
> > put_page().
> 
> Hi, David
> 
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?
> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?
> 
> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?
> 
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> >  }
> >  
> >  struct page_frag_cache {
> > -	void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	__u16 offset;
> > -	__u16 size;
> > -#else
> > -	__u32 offset;
> > -#endif
> > +	struct folio	*folio;
> > +	unsigned int	offset;
> >  	/* we maintain a pagecount bias, so that we dont dirty cache line
> >  	 * containing page->_refcount every time we allocate a fragment.
> >  	 */
> > -	unsigned int		pagecnt_bias;
> > -	bool pfmemalloc;
> > +	unsigned int	pagecnt_bias;
> > +	bool		pfmemalloc;
> >  };
> 
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?
> It might be worth calling that out in the commit log or split it into another
> patch to make it clearer and easier to review?

Yes, there is a cost for going from page to virtual address. That is
why we only use the page when we finally get to freeing or resetting
the pagecnt_bias.

Also I have some concerns about going from page to folio as it seems
like the folio_alloc setups the transparent hugepage destructor instead
of using the compound page destructor. I would think that would slow
down most users as it looks like there is a spinlock that is taken in
the hugepage destructor that isn't there in the compound page
destructor.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
  2023-05-24 15:33   ` David Howells
@ 2023-05-27 15:54     ` Alexander H Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander H Duyck @ 2023-05-27 15:54 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 2023-05-24 at 16:33 +0100, David Howells wrote:
> Make the page_frag_cache allocator's alignment parameter a power of 2
> rather than a mask and give a warning if it isn't.
> 
> This means that it's consistent with {napi,netdec}_alloc_frag_align() and
> allows __{napi,netdev}_alloc_frag_align() to be removed.
> 

This goes against the original intention of these functions. One of the
reasons why this is being used is because when somebody enables
something like 2K jumbo frames they don't necessarily want to have to
allocate 4K SLABs. Instead they can just add a bit of overhead and get
almost twice the utilization out of an order 3 page.

The requirement should only be cache alignment, not power of 2
alignment. This isn't meant to be a slab allocator. We are just
sectioning up pages to handle mixed workloads. In the case of
networking we can end up getting everything from 60B packets, to 1514B
in the standard cases. That was why we started sectioning up pages in
the first place so putting a power of 2 requirement on it doens't fit
our use case at all and is what we were trying to get away from with
the SLAB allocators.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
@ 2023-05-27 15:54     ` Alexander H Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander H Duyck @ 2023-05-27 15:54 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Wed, 2023-05-24 at 16:33 +0100, David Howells wrote:
> Make the page_frag_cache allocator's alignment parameter a power of 2
> rather than a mask and give a warning if it isn't.
> 
> This means that it's consistent with {napi,netdec}_alloc_frag_align() and
> allows __{napi,netdev}_alloc_frag_align() to be removed.
> 

This goes against the original intention of these functions. One of the
reasons why this is being used is because when somebody enables
something like 2K jumbo frames they don't necessarily want to have to
allocate 4K SLABs. Instead they can just add a bit of overhead and get
almost twice the utilization out of an order 3 page.

The requirement should only be cache alignment, not power of 2
alignment. This isn't meant to be a slab allocator. We are just
sectioning up pages to handle mixed workloads. In the case of
networking we can end up getting everything from 60B packets, to 1514B
in the standard cases. That was why we started sectioning up pages in
the first place so putting a power of 2 requirement on it doens't fit
our use case at all and is what we were trying to get away from with
the SLAB allocators.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
  2023-05-27  0:57     ` Jakub Kicinski
@ 2023-05-27 15:54       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-05-27 15:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

On Fri, May 26, 2023 at 5:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 24 May 2023 16:33:04 +0100 David Howells wrote:
> > Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
> > passing it off to the page allocator.  There may be a mix of callers, some
> > specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
> > might refurbish the page, in which case the returned memory doesn't get
> > cleared.
>
> I think it's pretty clear that page frag allocator was never supposed
> to support GFP_ZERO, as we don't need it in networking.. So maybe
> you're better off adding the memset() in nvme?
>
> CCing Alex, who I think would say something along those lines :)
> IDK how much we still care given that most networking drivers are
> migrating to page_poll these days.

Yeah, the page frag allocator wasn't meant to handle things like this.
Generally the cache was meant to be used within one context so that
the GFP flags used were consistent between calls. Currently the only
thing passed appears to be GFP_ATOMIC.

Also I am not a big fan of pulling this out of page_alloc.c The fact
is that is where the allocation functions live so it makes sense to
just leave it there. It isn't as if there is enough code added in my
point of view to create yet another file and make it harder to track
git history as a result.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
@ 2023-05-27 15:54       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-05-27 15:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

On Fri, May 26, 2023 at 5:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 24 May 2023 16:33:04 +0100 David Howells wrote:
> > Make the page_frag_cache allocator handle __GFP_ZERO itself rather than
> > passing it off to the page allocator.  There may be a mix of callers, some
> > specifying __GFP_ZERO and some not - and even if all specify __GFP_ZERO, we
> > might refurbish the page, in which case the returned memory doesn't get
> > cleared.
>
> I think it's pretty clear that page frag allocator was never supposed
> to support GFP_ZERO, as we don't need it in networking.. So maybe
> you're better off adding the memset() in nvme?
>
> CCing Alex, who I think would say something along those lines :)
> IDK how much we still care given that most networking drivers are
> migrating to page_poll these days.

Yeah, the page frag allocator wasn't meant to handle things like this.
Generally the cache was meant to be used within one context so that
the GFP flags used were consistent between calls. Currently the only
thing passed appears to be GFP_ATOMIC.

Also I am not a big fan of pulling this out of page_alloc.c The fact
is that is where the allocation functions live so it makes sense to
just leave it there. It isn't as if there is enough code added in my
point of view to create yet another file and make it harder to track
git history as a result.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Bug in short splice to socket?
  2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
  2023-05-27  1:08   ` Jakub Kicinski
@ 2023-05-30 22:26   ` David Howells
  2023-05-31  0:32     ` Jakub Kicinski
                       ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: David Howells @ 2023-05-30 22:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dhowells, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig, Linus Torvalds

Jakub Kicinski <kuba@kernel.org> wrote:

> Will the TLS selftests under tools/.../net/tls.c exercise this?

Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
it passes, but I'm having some problems with the multi_chunk_sendfile tests
that time out.  I think that splice_direct_to_actor() has a bug.  The problem
is this bit of code:

		/*
		 * If more data is pending, set SPLICE_F_MORE
		 * If this is the last data and SPLICE_F_MORE was not set
		 * initially, clears it.
		 */
		if (read_len < len)
			sd->flags |= SPLICE_F_MORE;
		else if (!more)
			sd->flags &= ~SPLICE_F_MORE;

When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
passed to the network protocol) if we haven't yet read everything that the
user requested and clears it if we fulfilled what the user requested.

This has the weird effect that MSG_MORE gets kind of inverted.  It's never
seen by the actor if we can read the entire request into the pipe - except if
we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
upstream TLS code ignores it - but I'm changing this with my patches as
sendmsg() then uses it to mark the EOR.

I think we probably need to fix this in some way to check the size of source
file - which may not be a regular file:-/  With the attached change, all tests
pass; without it, a bunch of tests fail with timeouts.

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..a7cf216c02a7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -982,10 +982,21 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		 * If this is the last data and SPLICE_F_MORE was not set
 		 * initially, clears it.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if (read_len < len) {
+			struct inode *ii = in->f_mapping->host;
+
+			if (ii->i_fop->llseek != noop_llseek &&
+			    pos >= i_size_read(ii)) {
+				if (!more)
+					sd->flags &= ~SPLICE_F_MORE;
+			} else {
+				sd->flags |= SPLICE_F_MORE;
+			}
+
+		} else if (!more) {
 			sd->flags &= ~SPLICE_F_MORE;
+		}
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-05-30 22:26   ` Bug in short splice to socket? David Howells
@ 2023-05-31  0:32     ` Jakub Kicinski
  2023-06-01 11:01     ` David Laight
  2023-06-01 13:09     ` Linus Torvalds
  2 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-05-31  0:32 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig, Linus Torvalds

On Tue, 30 May 2023 23:26:43 +0100 David Howells wrote:
> Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
> it passes, but I'm having some problems with the multi_chunk_sendfile tests
> that time out.  I think that splice_direct_to_actor() has a bug.  The problem
> is this bit of code:
> 
> 		/*
> 		 * If more data is pending, set SPLICE_F_MORE
> 		 * If this is the last data and SPLICE_F_MORE was not set
> 		 * initially, clears it.
> 		 */
> 		if (read_len < len)
> 			sd->flags |= SPLICE_F_MORE;
> 		else if (!more)
> 			sd->flags &= ~SPLICE_F_MORE;
> 
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
> 
> This has the weird effect that MSG_MORE gets kind of inverted.  It's never
> seen by the actor if we can read the entire request into the pipe - except if
> we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
> we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
> upstream TLS code ignores it - but I'm changing this with my patches as
> sendmsg() then uses it to mark the EOR.
> 
> I think we probably need to fix this in some way to check the size of source
> file - which may not be a regular file:-/  With the attached change, all tests
> pass; without it, a bunch of tests fail with timeouts.

Yeah.. it's one of those 'known warts' which we worked around in TLS
because I don't know enough about VFS to confidently fix it in fs/.
Proper fix would be pretty nice to have.

The original-original report of the problem is here, FWIW:
https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/

And my somewhat hacky fix was d452d48b9f8.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: Bug in short splice to socket?
  2023-05-30 22:26   ` Bug in short splice to socket? David Howells
  2023-05-31  0:32     ` Jakub Kicinski
@ 2023-06-01 11:01     ` David Laight
  2023-06-01 13:09     ` Linus Torvalds
  2 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2023-06-01 11:01 UTC (permalink / raw)
  To: 'David Howells', Jakub Kicinski
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig, Linus Torvalds

From: David Howells
> Sent: 30 May 2023 23:27
> 
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > Will the TLS selftests under tools/.../net/tls.c exercise this?
> 
> Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
> it passes, but I'm having some problems with the multi_chunk_sendfile tests
> that time out.  I think that splice_direct_to_actor() has a bug.  The problem
> is this bit of code:
> 
> 		/*
> 		 * If more data is pending, set SPLICE_F_MORE
> 		 * If this is the last data and SPLICE_F_MORE was not set
> 		 * initially, clears it.
> 		 */
> 		if (read_len < len)
> 			sd->flags |= SPLICE_F_MORE;
> 		else if (!more)
> 			sd->flags &= ~SPLICE_F_MORE;
> 
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
> 
> This has the weird effect that MSG_MORE gets kind of inverted.  It's never
> seen by the actor if we can read the entire request into the pipe - except if
> we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
> we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
> upstream TLS code ignores it - but I'm changing this with my patches as
> sendmsg() then uses it to mark the EOR.

Isn't MSG_MORE supposed to be just a hint that more data will follow.
So you'd expect a final send with MSG_MORE to get sent, but possibly
after a short timeout.

Using it as a record marker seems wrong.

I'm not sure how to clear 'Oh bugger I set MSG_MORE but have no data'
to avoid the timeout.
A zero length semdmsg() won't DTRT with protocols like SCTP.
(Does splice even do anything sensible with SCTP?)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-05-30 22:26   ` Bug in short splice to socket? David Howells
  2023-05-31  0:32     ` Jakub Kicinski
  2023-06-01 11:01     ` David Laight
@ 2023-06-01 13:09     ` Linus Torvalds
  2023-06-01 13:19       ` Linus Torvalds
  2023-06-01 14:34       ` David Howells
  2 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2023-06-01 13:09 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Tue, May 30, 2023 at 6:26 PM David Howells <dhowells@redhat.com> wrote:
>
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
>
> This has the weird effect that MSG_MORE gets kind of inverted. [...]

I hate this patch.

The old code is unquestionably garbage, but this just ends up
resulting in more of the same.

The reason the old code is garbage is that it sets SPLICE_F_MORE
entirely in the wrong place. It sets it *after* it has done the
splice(). That's just crazy.

And that craziness is the direct cause of the bug.

You try to fix the bug by just extending on the craziness. No. The
crazy should be removed, not extended upon.

The "flag" setting should be done *before* the send, if we know that
more data is pending even after the "do_splice_to()". Not after.

And the networking code should do its own "oh, splice gave me X bytes,
but I only used Y, so I know more data is pending, so I'll set
MSG_MORE on the *current* packet". But that's entirely inside of
whatever networking code that does th esplice.

So no. I absolutely refuse to entertain this patch because it's
completely broken. The fact that the old code was also completely
broken is *not* an excuse to make for more nonsensical breakage that
may or may just hide the crazy.

               Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 13:09     ` Linus Torvalds
@ 2023-06-01 13:19       ` Linus Torvalds
  2023-06-01 14:34       ` David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2023-06-01 13:19 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The reason the old code is garbage is that it sets SPLICE_F_MORE
> entirely in the wrong place. It sets it *after* it has done the
> splice(). That's just crazy.

Clarification, because there are two splice's (from and to): by "after
the splice" I mean after the splice source, of course. It's still set
before the splice _to_ the network.

(But it is still true that I hope the networking code itself then sets
MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
bigger than what it can handle right now - so there are two
*different* reasons for "more data" - either the source knows it has
more to give, or the destination knows it didn't use everything it
got).

The point is that the splice *source* knows whether there is more data
to be had, and that's where the "there is more" should be set.

But the generic code does *not* know. You add a completely horrendous
hack to kind of approximate that knowledge, but it's *wrong*.

The old code was wrong too, of course. No question about that.

Basically my argument is that the whole "there is more data" should be
set by "->splice_read()" not by some hack in some generic
splice_direct_to_actor() function that is absolutely not supposed to
know about the internals of the source or the destination.

Do we have a good interface for that? No. I get the feeling that to
actually fix this, we'd have to pass in the 'struct splice_desc"
pointer to ->splice_read().

Then the reading side can say "yup, I have more data for you after
this", and it all makes sense at least conceptually.

So please, can we just fix the layering violation, rather than make it worse?

                Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 13:09     ` Linus Torvalds
  2023-06-01 13:19       ` Linus Torvalds
@ 2023-06-01 14:34       ` David Howells
  2023-06-01 15:12         ` Linus Torvalds
  2023-06-01 17:14         ` David Howells
  1 sibling, 2 replies; 69+ messages in thread
From: David Howells @ 2023-06-01 14:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig


Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The reason the old code is garbage is that it sets SPLICE_F_MORE
> > entirely in the wrong place. It sets it *after* it has done the
> > splice(). That's just crazy.
> 
> Clarification, because there are two splice's (from and to): by "after
> the splice" I mean after the splice source, of course. It's still set
> before the splice _to_ the network.
> 
> (But it is still true that I hope the networking code itself then sets
> MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
> bigger than what it can handle right now - so there are two
> *different* reasons for "more data" - either the source knows it has
> more to give, or the destination knows it didn't use everything it
> got).

I'm in the process of changing things so that ->sendpage() is removed and
replaced with sendmsg() with MSG_SPLICE_PAGES.  The aim is to end up with a
splice_to_socket() function (see attached) that transcribes a chunk of the
pipe into a BVEC iterator and does a single sendmsg() that pushes a whole
chunk of data to the socket in one go.

In the network protocol, the loop inside sendmsg splices those pages into
socket buffers, sleeping as necessary to gain sufficient memory to transcribe
all of them.  It can return partly done and the fully consumed pages will be
consumed and then it'll return to gain more data.

At the moment, it transcribes 16 pages at a time.  I could make it set
MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
there's yet more data in the buffer.

However, this might well cause a malfunction in UDP, for example.  MSG_MORE
corks the current packet, so if I ask sendfile() say shove 32K into a packet,
if, say, 16K is read from the source and entirely transcribed into the packet,
if I understand what you're proposing, MSG_MORE wouldn't get set and the
packet would be transmitted early.  A similar issue might exist for AF_TLS,
where the lack of MSG_MORE triggers an end-of-record.

The problem isn't that the buffer supplied is bigger than the protocol could
handle in one go, it's that what we read was less than the amount that the
user requested - but we don't know if there will be more data.

One solution might be to pass MSG_MORE regardless, and then follow up with a
null sendmsg() if we then hit a zero-length read (ie. EOF) or -ENODATA (ie. a
hole).

> The point is that the splice *source* knows whether there is more data
> to be had, and that's where the "there is more" should be set.

Actually, that's not necessarily true.  If the source is a pipe or a socket,
there may not be more, but we don't know that until the far end is closed -
but for a file it probably is (we could be racing with a writer).  And then
there's seqfile-based things like the trace buffer or procfiles which are
really a class of their own.

> Basically my argument is that the whole "there is more data" should be
> set by "->splice_read()" not by some hack in some generic
> splice_direct_to_actor() function that is absolutely not supposed to
> know about the internals of the source or the destination.
> 
> Do we have a good interface for that? No. I get the feeling that to
> actually fix this, we'd have to pass in the 'struct splice_desc"
> pointer to ->splice_read().

That might become more feasible, actually.  In Jens's tree are the patches to
clean up splice_read a lot and get rid of ITER_PIPE.  Most of the
->splice_reads are going to be direct calls to copy_splice_read() and
filemap_splice_read() or wrappers around filemap_splice_read().  The latter,
at least, might be in a good position to note EOF, perhaps by marking a pipe
buffer as "no more".

copy_splice_read() is more problematic because ->read_iter() doesn't indicate
if it hit EOF (or hit a hole).

David
---
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, mask, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_empty(pipe->head, pipe->tail)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;
		mask = pipe->ring_size - 1;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);
			seg = min_t(size_t, seg, PAGE_SIZE);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (seg >= buf->len)
				tail++;
			if (bc >= ARRAY_SIZE(bvec))
				break;
		}

		if (!bc)
			break;

		msg.msg_flags = 0;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags = MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags = MSG_MORE;
		msg.msg_flags |= MSG_SPLICE_PAGES;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 14:34       ` David Howells
@ 2023-06-01 15:12         ` Linus Torvalds
  2023-06-05 11:03           ` David Laight
  2023-06-05 15:52           ` David Howells
  2023-06-01 17:14         ` David Howells
  1 sibling, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2023-06-01 15:12 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote:
>
> At the moment, it transcribes 16 pages at a time.  I could make it set
> MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
> there's yet more data in the buffer.

That would at least be a good first step.

> However, this might well cause a malfunction in UDP, for example.  MSG_MORE
> corks the current packet, so if I ask sendfile() say shove 32K into a packet,
> if, say, 16K is read from the source and entirely transcribed into the packet,

If you use splice() for UDP, I don't think you would normally expect
to get all that well-defined packet boundaries.

That said, I think *this* part of splice_direct_to_actor() is correct:

                if (read_len < len)
                        sd->flags |= SPLICE_F_MORE;    <- WRONG
                else if (!more)
                        sd->flags &= ~SPLICE_F_MORE; <- CORRECT

ie if we've used up all of the 'len' argument, *and* 'more' wasn't set
in the incoming flags, then at that point we should clear
SPLICE_F_MORE.

So that means that UDP packets boundaries will be honored at the
'splice()' system call 'len' argument.

Obviously packet boundaries might happen before that - ie depending on
what the packet size limits are.

But the "set SPLICE_F_MORE" bit is just wrong. The generic code simply
does not know enough to make that determination.

> if I understand what you're proposing, MSG_MORE wouldn't get set and the
> packet would be transmitted early.

No, I'm saying that MSG_MORE should be set depending on what the
splice *input* says.

If the splice input knows that it has more to give but stopped early
for whatever reason (typically that the splice pipe buffers filled up,
but that's not necessarily the *only* reason), then it should set
SPLICE_F_MORE.

But this is literally only something that the input side *can* know.

And as you mention, some input sides cannot know even that. Regular
files typically know if there is more data. Other dynamic sources may
simply not know. And if they know, they just shouldn't set
SPLICE_F_MORE.

Of course, SPLICE_F_MORE may then be set because the *user* passed in
that flag, but that's a completely separate issue. The user may pass
in that flag because the user wants maximally sized packets, and knows
that other things will be fed into the destination (not even
necessarily through splice) after the splice.

So you really have multiple different reasons why SPLICE_F_MORE might
get set, but that

                if (read_len < len)

is *not* a valid reason. And no, extending that logic with more random
logic is still not a valid reason.

            Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 14:34       ` David Howells
  2023-06-01 15:12         ` Linus Torvalds
@ 2023-06-01 17:14         ` David Howells
  2023-06-02  4:20           ` Jakub Kicinski
  2023-06-02  8:23           ` David Howells
  1 sibling, 2 replies; 69+ messages in thread
From: David Howells @ 2023-06-01 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > However, this might well cause a malfunction in UDP, for example.
> > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K
> > into a packet, if, say, 16K is read from the source and entirely
> > transcribed into the packet,
> 
> If you use splice() for UDP, I don't think you would normally expect
> to get all that well-defined packet boundaries.

Actually, it will.  Attempting to overfill a UDP packet with splice will get
you -EMSGSIZE.  It won't turn a splice into more than one UDP packet.

I wonder if the right solution actually is to declare that the problem is
userspace's.  If you ask it to splice Z amount of data and it can't manage
that because the source dries up prematurely, then make it so that you assume
it always passed MSG_MORE and returns a short splice to userspace.  Userspace
can retry the splice/sendfile or do an empty sendmsg() to cap the message (or
cancel it).  Perhaps flushing a short message is actually a *bad* idea.

The answer then might be to make TLS handle a zero-length send() and fix the
test cases.  Would the attached changes then work for you?

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..237688b0700b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -956,13 +956,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	 */
 	bytes = 0;
 	len = sd->total_len;
+
+	/* Don't block on output, we have to drain the direct pipe. */
 	flags = sd->flags;
+	sd->flags &= ~SPLICE_F_NONBLOCK;
 
 	/*
-	 * Don't block on output, we have to drain the direct pipe.
+	 * We signal MORE until we've read sufficient data to fulfill the
+	 * request and we keep signalling it if the caller set it.
 	 */
-	sd->flags &= ~SPLICE_F_NONBLOCK;
 	more = sd->flags & SPLICE_F_MORE;
+	sd->flags |= SPLICE_F_MORE;
 
 	WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));
 
@@ -978,14 +982,12 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		sd->total_len = read_len;
 
 		/*
-		 * If more data is pending, set SPLICE_F_MORE
-		 * If this is the last data and SPLICE_F_MORE was not set
-		 * initially, clears it.
+		 * If we now have sufficient data to fulfill the request then
+		 * we clear SPLICE_F_MORE if it was not set initially.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if (read_len >= len && !more)
 			sd->flags &= ~SPLICE_F_MORE;
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f63e4405cf34..5d48391da16c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
 		}
 	}
 
+	if (!msg_data_left(msg) && eor)
+		goto copied;
+
 	while (msg_data_left(msg)) {
 		if (sk->sk_err) {
 			ret = -sk->sk_err;
diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c
index e699548d4247..7df31583f2a4 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -377,7 +377,7 @@ static void chunked_sendfile(struct __test_metadata *_metadata,
 	char buf[TLS_PAYLOAD_MAX_LEN];
 	uint16_t test_payload_size;
 	int size = 0;
-	int ret;
+	int ret = 0;
 	char filename[] = "/tmp/mytemp.XXXXXX";
 	int fd = mkstemp(filename);
 	off_t offset = 0;
@@ -398,6 +398,9 @@ static void chunked_sendfile(struct __test_metadata *_metadata,
 		size -= ret;
 	}
 
+	if (ret < chunk_size)
+		EXPECT_EQ(send(self->fd, "", 0, 0), 0);
+
 	EXPECT_EQ(recv(self->cfd, buf, test_payload_size, MSG_WAITALL),
 		  test_payload_size);
 


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 17:14         ` David Howells
@ 2023-06-02  4:20           ` Jakub Kicinski
  2023-06-02  8:23           ` David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-06-02  4:20 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> The answer then might be to make TLS handle a zero-length send()

IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around 
this exact problem. Your refactoring happens to break it and what
you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST 
is unnecessary, it's user's fault".

A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
with the deadly mess of the unchecked sendmsg()/sendpage() flags.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 17:14         ` David Howells
  2023-06-02  4:20           ` Jakub Kicinski
@ 2023-06-02  8:23           ` David Howells
  2023-06-02 11:28             ` Linus Torvalds
  2023-06-02 11:44             ` David Howells
  1 sibling, 2 replies; 69+ messages in thread
From: David Howells @ 2023-06-02  8:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dhowells, Linus Torvalds, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

Jakub Kicinski <kuba@kernel.org> wrote:

> On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> > The answer then might be to make TLS handle a zero-length send()
> 
> IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around 
> this exact problem. Your refactoring happens to break it and what
> you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST 
> is unnecessary, it's user's fault".
> 
> A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
> with the deadly mess of the unchecked sendmsg()/sendpage() flags.

Not so much the "user's fault" as we couldn't fulfill what the user asked - so
should we leave it to the user to work out how to clean it up rather than
automatically allowing the socket to flush (if cancellation might be an option
instead)?

The problem I have with NOTLAST is that it won't be asserted if the short read
only occupies a single pipe_buf.  We don't know that we won't get some more
data on the next pass.

An alternative way to maintain the current behaviour might be to have
splice_direct_to_actor() call the actor with sd->total_len == 0 if
do_splice_to() returned 0 and SPLICE_F_MORE wasn't set by the caller
(ie. !more).  Attached is a change to do that.  It becomes simpler if/once my
splice_to_socket() patches are applied - but I don't really want to push that
until all the protocols that support sendpage() support sendmsg() +
MSG_SPLICE_PAGES as well[*].

[*] Though if you're okay having a small window where TLS copies data rather
    than splicing, I could push the splice_to_socket() patches *first*.  TCP
    and AF_UNIX splice already support MSG_SPLICE_PAGES so that would bump
    their efficiency.

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..84e9ca06db47 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -643,6 +643,22 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
 		wakeup_pipe_writers(pipe);
 }
 
+/*
+ * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE
+ * turned off to allow the network to see MSG_MORE deasserted.
+ */
+static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe,
+				     struct splice_desc *sd,
+				     splice_actor *actor)
+{
+	struct pipe_buffer buf = {
+		.page	= ZERO_PAGE(0),
+		.ops	= &nosteal_pipe_buf_ops,
+	};
+
+	return actor(pipe, &buf, sd);
+}
+
 /**
  * __splice_from_pipe - splice data from a pipe to given actor
  * @pipe:	pipe to splice from
@@ -662,6 +678,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	int ret;
 
 	splice_from_pipe_begin(sd);
+	if (!sd->total_len)
+		return splice_from_pipe_zero(pipe, sd, actor);
+
 	do {
 		cond_resched();
 		ret = splice_from_pipe_next(pipe, sd);
@@ -956,13 +975,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	 */
 	bytes = 0;
 	len = sd->total_len;
+
+	/* Don't block on output, we have to drain the direct pipe. */
 	flags = sd->flags;
+	sd->flags &= ~SPLICE_F_NONBLOCK;
 
 	/*
-	 * Don't block on output, we have to drain the direct pipe.
+	 * We signal MORE until we've read sufficient data to fulfill the
+	 * request and we keep signalling it if the caller set it.
 	 */
-	sd->flags &= ~SPLICE_F_NONBLOCK;
 	more = sd->flags & SPLICE_F_MORE;
+	sd->flags |= SPLICE_F_MORE;
 
 	WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));
 
@@ -971,21 +994,19 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		loff_t pos = sd->pos, prev_pos = pos;
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
-		if (unlikely(ret <= 0))
+		if (unlikely(ret < 0))
 			goto out_release;
 
 		read_len = ret;
 		sd->total_len = read_len;
 
 		/*
-		 * If more data is pending, set SPLICE_F_MORE
-		 * If this is the last data and SPLICE_F_MORE was not set
-		 * initially, clears it.
+		 * If we now have sufficient data to fulfill the request then
+		 * we clear SPLICE_F_MORE if it was not set initially.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if ((read_len == 0 || read_len >= len) && !more)
 			sd->flags &= ~SPLICE_F_MORE;
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we
@@ -1005,6 +1026,8 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 			sd->pos = prev_pos + ret;
 			goto out_release;
 		}
+		if (read_len < 0)
+			goto out_release;
 	}
 
 done:
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f63e4405cf34..5d48391da16c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
 		}
 	}
 
+	if (!msg_data_left(msg) && eor)
+		goto copied;
+
 	while (msg_data_left(msg)) {
 		if (sk->sk_err) {
 			ret = -sk->sk_err;


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02  8:23           ` David Howells
@ 2023-06-02 11:28             ` Linus Torvalds
  2023-06-02 11:44             ` David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2023-06-02 11:28 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, Jun 2, 2023 at 4:23 AM David Howells <dhowells@redhat.com> wrote:
>
> +/*
> + * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE
> + * turned off to allow the network to see MSG_MORE deasserted.
> + */
> +static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe,

David, STOP.

Stop doing these crazy hacks to generic splice code. None of this makes sense.

Those zero-sized splices have never worked, don't try to make them work.

The splice() system call has always had that

        if (!len)
                return 0;

since being introduced. We're not adding any crazy stuff now.

Do what I already suggested: making SPLICE_F_MORE reflect reality.

Or don't do anything at all. Because this kind of hacky garbage is not
even amusing.

               Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02  8:23           ` David Howells
  2023-06-02 11:28             ` Linus Torvalds
@ 2023-06-02 11:44             ` David Howells
  2023-06-02 12:11               ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: David Howells @ 2023-06-02 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Do what I already suggested: making SPLICE_F_MORE reflect reality.

I'm trying to.  I need MSG_MORE to behave sensibly for what I want.

What I have signals SPLICE_F_MORE (and thus MSG_MORE) as long as we haven't
yet read enough data to fulfill the request - and will break out of the loop
if we get a zero-length read.

But this causes a change in behaviour because we then leave the protocol
having seen MSG_MORE set where it didn't previously see that.

This causes "tls -r tls.12_aes_gcm.multi_chunk_sendfile" on the TLS kselftest
to fail.

Now, if we're fine with the change in behaviour, I can make the selftest
observe the short sendfile() and cancel MSG_MORE itself - but that's just a
test program.

So that's the question: Do I have to maintain the current behaviour for the
short-splice case?

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 11:44             ` David Howells
@ 2023-06-02 12:11               ` Linus Torvalds
  2023-06-02 16:39                 ` Jakub Kicinski
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2023-06-02 12:11 UTC (permalink / raw)
  To: David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, Jun 2, 2023 at 7:45 AM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Do what I already suggested: making SPLICE_F_MORE reflect reality.
>
> I'm trying to.  I need MSG_MORE to behave sensibly for what I want.

But you need to stop doing these random hacks to fs/splice.c

The point is, you *CANNOT* make SPLICE_F_MORE reflect reality by
hacking fs/splice.c. Really. The generic layer DOES NOT KNOW, AND
FUNDAMENTALLY CANNOT KNOW if there is more data to be had.

So any of these random patches that try to add heuristics to
fs/splice.c will be rejected out of hand. They simply cannot be
correct.

And no, on the whole I do not believe you have to maintain some
selftest. A selftest failure is worrisome in that it clearly shows
that some behavior changed, but the situation here is

 (a) the current behavior is arguably bad and buggy

 (b) if we want to fix that bug, then the current behavior *will* change

Now, the only question then is whether the self-test actually tests
anything that user space actually depends on, or if it just tests some
random corner case.

So the self-test is certainly a ref flag, but not necessarily a very
meaningful one.

It shows that some user-visible change happened, which is always a big
danger flag, but after all that was the whole *point* of the whole
exercise.  The fact that the self-test caught the change is good,
because it means we had test coverage, but when the behavior is
something we *want* to change, the test failure is not a problem in
itself.

So what I think you should do is to fix the bug right, with a clean
patch, and no crazy hacks. That is something we can then apply and
test. All the while knowing full well that "uhhuh, this is a visible
change, we may have to revert it".

If then some *real* load ends up showing a regression, we may just be
screwed. Our current behavior may be buggy, but we have the rule that
once user space depends on kernel bugs, they become features pretty
much by definition, however much we might dislike it.

At that point, we'll have to see what we can do - if anything.

Basically, what I think the SPLICE_F_MORE rules *should* be (and hey,
I may be missing something) is

 1) if the user set that bit in the flags, then it's always true.

    The user basically told us "I will supply more data even after the
splice has finished", so it doesn't matter if the kernel runs out of
data in the middle.

 2) if the splice read side sees "I was asked for N bytes, but I could
only supply X bytes and I still have more to give", when we should set
SPLICE_F_MORE internally ("temporarily") for the next splice call.

    This is basically the "kernel independently knows that there will
be more data" case.

 3) In the end, this is all "best effort" and to some degree
inevitably a heuristic. We cannot see the future. We may hit that case
#2 and set the "there will be more data" bit, but then get a signal
and finish the splice system call before that more data actually
happens.

    Now, presumably the user will then continue the partial splice
after handling the signal, so (3) is still "right", but obviously we
can't _know_ that.

A corollary to (3) is that the reader side may not always know if
there will be more data to be read. For a file source, it's fairly
clear (modulo the obvious caveats - files can be truncated etc etc).
For other splice sources, the "I still have more to give" may not be
as unambiguous.  It is what it is.

Am I missing some important case? Considering that we clearly do *not*
do a great job at  SPLICE_F_MORE right now, I'd really want the
situation to be either that we just make the code "ClearlyCorrect(tm)"
and simple, or we just leave it alone as "that's our odd behavior,
deal with it".

None of this "let's change this all to be even more complex, and
handle some particular special case the way I want" crap.

Do it right, or don't do it at all.

             Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 12:11               ` Linus Torvalds
@ 2023-06-02 16:39                 ` Jakub Kicinski
  2023-06-02 16:53                   ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Jakub Kicinski @ 2023-06-02 16:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, 2 Jun 2023 08:11:47 -0400 Linus Torvalds wrote:
> If then some *real* load ends up showing a regression, we may just be
> screwed. Our current behavior may be buggy, but we have the rule that
> once user space depends on kernel bugs, they become features pretty
> much by definition, however much we might dislike it.
> 
> At that point, we'll have to see what we can do - if anything.

Can we have a provisional plan of how we'll fix it if someone does
complain? We can't just revert David's work, and if none of the
solutions are appealing - socket implementations may be left holding
the bag.

I dislike the magic zero sends, and I think you do, too. In case of TLS
its unclear whether we should generate an empty record (like UDP would).

Can we add an optional splice_end / short_splice / splice_underflow /
splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback
to struct file_operations?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 16:39                 ` Jakub Kicinski
@ 2023-06-02 16:53                   ` Linus Torvalds
  2023-06-02 17:05                     ` Linus Torvalds
                                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2023-06-02 16:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, Jun 2, 2023 at 12:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Can we add an optional splice_end / short_splice / splice_underflow /
> splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback
> to struct file_operations?

A splice_end() operation might well be the simplest model, but I think
it's broken.

It would certainly be easy to implement: file descriptor that doesn't
care about SPLICE_F_MORE - so most of them - would just leave it as
NULL, and the splice code could decide to call it *if* it had left the
last splice with SPLICE_F_MORE, _and_ the user hadn't set it, and the
file descriptor wants that information.

But I think one of the problems here is one of "what the hell is the
meaning of that bit"?

In particular, think about what happens if a signal is pending, and we
return with a partially completed write? There potentially *is* more
data to be sent, it's just not sent by *this* splice() call, as user
space has to handle the signal first.

What is the semantics of SPLICE_F_MORE in that kind of situation?

Which is why I really think that it would be *so* much better if we
really let the whole SPLICE_F_MORE bit be a signal from the *input*
side.

I know I've been harping on this, but just from a "sane semantics"
standpoint, I really think the only thing that *really* makes sense is
for the input side of a splice to say "I gave you X amount of data,
but I have more to give".

And that would *literally* be the semantic meaning of that SPLICE_F_MORE bit.

Wouldn't it be lovely to have some actual documented meaning to it,
which does *not* depend on things like ".. but what if a signal
happens" issues?

And yes, it's entirely possible that I'm missing something, and I'm
misunderstanding what people really want, but I do feel like this is a
somewhat subtle area, and if people really care about the exact
semantics of SPLICE_F_MORE, then we need to *have* exact semantics for
it.

And no, I don't think "splice_end()" can be that exact semantics -
even if it's simple - exactly because splice() is an interruptible
operation, so the "end" of a splice() is simply not a stable thing.

I also do wonder how much we care. What are the situations where the
packet boundaries can really matter in actual real world. Exactly
because I'm not 100% convinced we've had super-stable behavior here.

The fact that a test-case never triggers signal handling in the middle
of a splice() call isn't exactly a huge surprise. The test case
probably doesn't *have* signals. But it just means that the test-case
isn't all that real-life.

              Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 16:53                   ` Linus Torvalds
@ 2023-06-02 17:05                     ` Linus Torvalds
  2023-06-02 17:38                       ` Jakub Kicinski
  2023-06-02 20:38                     ` David Howells
  2023-06-02 20:50                     ` David Howells
  2 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2023-06-02 17:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, Jun 2, 2023 at 12:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And no, I don't think "splice_end()" can be that exact semantics -
> even if it's simple - exactly because splice() is an interruptible
> operation, so the "end" of a splice() is simply not a stable thing.

Just to harp some more on this - if SPLICE_F_MORE is seen as purely a
performance hit, with no real semantic value, and will still set
random packet boundaries but we want big packets for all the _usual_
cases, then I think something like "splice_end()" can be a fine
solution regardless of exact semantics.

Alternatively, if we make it the rule that "splice_end()" is only
called on EOF situations - so signals etc do not matter - then the
semantics would be stable and sound fine to me too.

In that second case, I'd like to literally name it that way, and
actually call it "splice_eof()". Because I'd like to really make it
very clear what the semantics would be.

So a "splice_eof()" sounds fine to me, and we'd make the semantics be
the current behavior:

 - splice() sets SPLICE_F_MORE if 'len > read_len'

 - splice() _clears_ SPLICE_F_MORE if we have hit 'len'

 - splice always sets SPLICE_F_MORE if it was passed by the user

BUT with the small new 'splice_eof()' rule that:

 - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
"use all of len" case that cleared SPLICE_F_MORE, *and* we did a
"->splice_in()" that returned EOF (ie zero), *then* we will also do
that ->splice_eof() call.

The above sounds like "stable and possibly useful semantics" to me. It
would just have to be documented.

Is that what people want?

I don't think it's conceptually any different from my suggestion of
saying "->splice_read() can set SPLICE_F_MORE if it has more to give",
just a different implementation that doesn't require changes on the
splice_read() side.

                 Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 17:05                     ` Linus Torvalds
@ 2023-06-02 17:38                       ` Jakub Kicinski
  0 siblings, 0 replies; 69+ messages in thread
From: Jakub Kicinski @ 2023-06-02 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

On Fri, 2 Jun 2023 13:05:14 -0400 Linus Torvalds wrote:
> Just to harp some more on this - if SPLICE_F_MORE is seen as purely a
> performance hit, with no real semantic value, and will still set
> random packet boundaries but we want big packets for all the _usual_
> cases, then I think something like "splice_end()" can be a fine
> solution regardless of exact semantics.
> 
> Alternatively, if we make it the rule that "splice_end()" is only
> called on EOF situations - so signals etc do not matter - then the
> semantics would be stable and sound fine to me too.
> 
> In that second case, I'd like to literally name it that way, and
> actually call it "splice_eof()". Because I'd like to really make it
> very clear what the semantics would be.
> 
> So a "splice_eof()" sounds fine to me, and we'd make the semantics be
> the current behavior:
> 
>  - splice() sets SPLICE_F_MORE if 'len > read_len'
> 
>  - splice() _clears_ SPLICE_F_MORE if we have hit 'len'
> 
>  - splice always sets SPLICE_F_MORE if it was passed by the user
> 
> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.
> 
> The above sounds like "stable and possibly useful semantics" to me. It
> would just have to be documented.
> 
> Is that what people want?

->splice_eof() with the proposed semantics sounds perfect for the cases
testers complained about it the past, IMHO. We can pencil that in as the
contingency plan. Actually I like these semantics so much I'm tempted to
ask David to implement it already and save users potential debugging :D

> I don't think it's conceptually any different from my suggestion of
> saying "->splice_read() can set SPLICE_F_MORE if it has more to give",
> just a different implementation that doesn't require changes on the
> splice_read() side.

Setting SPLICE_F_MORE from the input side does feel much cleaner than
guessing in splice.c. But we may end up needing the eof() callback for 
the corner cases, anyway :(

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 16:53                   ` Linus Torvalds
  2023-06-02 17:05                     ` Linus Torvalds
@ 2023-06-02 20:38                     ` David Howells
  2023-06-02 20:50                     ` David Howells
  2 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-02 20:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So a "splice_eof()" sounds fine to me, and we'd make the semantics be
> the current behavior:
> 
>  - splice() sets SPLICE_F_MORE if 'len > read_len'
> 
>  - splice() _clears_ SPLICE_F_MORE if we have hit 'len'
> 
>  - splice always sets SPLICE_F_MORE if it was passed by the user
> 
> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.
> 
> The above sounds like "stable and possibly useful semantics" to me. It
> would just have to be documented.
> 
> Is that what people want?

That's easier to implement, I think.  That's basically what I was trying to
achieve by sending a zero-length actor call, but this is a cleaner way of
doing it, particularly if it's added as a socket op next to ->sendmsg().

Otherwise I have to build up the input side to try and tell me in advance
whether it thinks we hit an EOF/hole/whatever condition.  The problem is that,
as previously mentioned, it doesn't work for all circumstances - seqfile,
pipes, sockets for instance.

Take the following scenario for example: I could read from a TCP socket,
filling up the pipe-buffer, but not with sufficient data to fulfill the
operation.  Say I drain the TCP socket, but it's still open, so might produce
more data.  I then call the actor, which passes all the data to sendmsg() with
MSG_SPLICE_PAGES and MSG_MORE and clears the buffer.  I then go round again,
but in the meantime, the source socket got shut down with no further data
available and do_splice_to() returns 0.

There's no way to predict this, so having a ->splice_eof() call would handle
this situation.

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-02 16:53                   ` Linus Torvalds
  2023-06-02 17:05                     ` Linus Torvalds
  2023-06-02 20:38                     ` David Howells
@ 2023-06-02 20:50                     ` David Howells
  2 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-02 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.

But not if we didn't splice anything at all yet?

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: Bug in short splice to socket?
  2023-06-01 15:12         ` Linus Torvalds
@ 2023-06-05 11:03           ` David Laight
  2023-06-05 15:52           ` David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: David Laight @ 2023-06-05 11:03 UTC (permalink / raw)
  To: 'Linus Torvalds', David Howells
  Cc: Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

From: Linus Torvalds
> Sent: 01 June 2023 16:12
> 
> On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote:
> >
> > At the moment, it transcribes 16 pages at a time.  I could make it set
> > MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
> > there's yet more data in the buffer.
> 
> That would at least be a good first step.
> 
> > However, this might well cause a malfunction in UDP, for example.  MSG_MORE
> > corks the current packet, so if I ask sendfile() say shove 32K into a packet,
> > if, say, 16K is read from the source and entirely transcribed into the packet,
> 
> If you use splice() for UDP, I don't think you would normally expect
> to get all that well-defined packet boundaries.

Especially since (assuming I've understood other bits of this thread)
the splice() can get split into multiple sendmsg() calls for other
reasons.

What semantics are you trying to implement for AF_TLS?
MSG_MORE has different effects on different protocols.

For UDP the next data is appended to the datagram being built.
(This is really pretty pointless, doing it in the caller will be faster!)

For TCP it stops the pending data being sent immediately.
And more data is appended.
I'm pretty sure it gets sent on timeout.

For SCTP the data chunk created for the sendmsg() isn't sent immediately.
Any more sendmsg(MSG_MORE) get queued until a full ethernet packet
is buffered.
The pending data is sent on timeout.
This is pretty much the only way to get two (or more) DATA chunks
into an ethernet frame when Nagle is disabled.

But I get the impression AF_TLS is deciding not to encode/send
the data because 'there isn't enough'.
That seems wrong.

Note that you can't use a zero length sendmsg() to flush pending
data - if there is no pending data some protocols will send a 
zero length data message.
A socket option/ioctl (eg UNCORK) could be (ab)used to force
queued data be sent.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Bug in short splice to socket?
  2023-06-01 15:12         ` Linus Torvalds
  2023-06-05 11:03           ` David Laight
@ 2023-06-05 15:52           ` David Howells
  1 sibling, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-05 15:52 UTC (permalink / raw)
  To: David Laight
  Cc: dhowells, 'Linus Torvalds',
	Jakub Kicinski, netdev, David S. Miller, Eric Dumazet,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Chuck Lever, Boris Pismenny,
	John Fastabend, Christoph Hellwig

David Laight <David.Laight@ACULAB.COM> wrote:

> > > However, this might well cause a malfunction in UDP, for example.
> > > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K
> > > into a packet, if, say, 16K is read from the source and entirely
> > > transcribed into the packet,
> > 
> > If you use splice() for UDP, I don't think you would normally expect
> > to get all that well-defined packet boundaries.
> 
> Especially since (assuming I've understood other bits of this thread)
> the splice() can get split into multiple sendmsg() calls for other
> reasons.

Yes - with SPLICE_F_MORE/MSG_MORE set on all but the last piece.  The issue is
what happens if the input side gets a premature EOF after we've passed a chunk
with MSG_MORE set when the caller didn't indicate SPLICE_F_MORE?

> What semantics are you trying to implement for AF_TLS?

As I understand it, deasserting MSG_MORE causes a record boundary to be
interposed on TLS.

> MSG_MORE has different effects on different protocols.

Do you mean "different protocols" in relation to TLS specifically? Software vs
device vs device-specific like Chelsio-TLS?

> For UDP the next data is appended to the datagram being built.
> (This is really pretty pointless, doing it in the caller will be faster!)

Splice with SPLICE_F_MORE seems to work the same as sendmsg with MSG_MORE
here.  You get an error if you try to append with splice or sendmsg more than
a single packet will hold.

> For TCP it stops the pending data being sent immediately.
> And more data is appended.
> I'm pretty sure it gets sent on timeout.

Yeah - corking is used by some network filesystem protocols, presumably to
better place RPC messages into TCP packets.

> For SCTP the data chunk created for the sendmsg() isn't sent immediately.
> Any more sendmsg(MSG_MORE) get queued until a full ethernet packet
> is buffered.
> The pending data is sent on timeout.
> This is pretty much the only way to get two (or more) DATA chunks
> into an ethernet frame when Nagle is disabled.

SCTP doesn't support sendpage, so that's not an issue.

> But I get the impression AF_TLS is deciding not to encode/send
> the data because 'there isn't enough'.
> That seems wrong.
> 
> Note that you can't use a zero length sendmsg() to flush pending
> data - if there is no pending data some protocols will send a 
> zero length data message.
> A socket option/ioctl (eg UNCORK) could be (ab)used to force
> queued data be sent.

Yeah - I've changed that, see v4.  I've implemented Linus's ->splice_eof()
idea.

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-05-26 11:56     ` Yunsheng Lin
@ 2023-06-06  8:25       ` David Howells
  -1 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-06  8:25 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Yunsheng Lin, David Howells, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Willem de Bruijn,
	David Ahern, Matthew Wilcox, Jens Axboe, linux-mm, linux-kernel,
	Jeroen de Borst, Catherine Sullivan, Shailend Chand,
	Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

Alexander H Duyck <alexander.duyck@gmail.com> wrote:

> Also I have some concerns about going from page to folio as it seems
> like the folio_alloc setups the transparent hugepage destructor instead
> of using the compound page destructor. I would think that would slow
> down most users as it looks like there is a spinlock that is taken in
> the hugepage destructor that isn't there in the compound page
> destructor.

Note that this code is going to have to move to folios[*] at some point.
"Old-style" compound pages are going to go away, I believe.  Matthew Wilcox
and the mm folks are on a drive towards simplifying memory management,
formalising chunks larger than a single page - with the ultimate aim of
reducing the page struct to a single, typed pointer.

So, take, for example, a folio: As I understand it, this will no longer
overlay struct page, but rather will become a single, dynamically-allocated
struct that covers a pow-of-2 number of pages.  A contiguous subset of page
structs will point at it.

However, rather than using a folio, we could define a "page fragment" memory
type.  Rather than having all the flags and fields to be found in struct
folio, it could have just the set to be found in page_frag_cache.

David

[*] It will be possible to have some other type than "folio".  See "struct
slab" in mm/slab.h for example.  struct slab corresponds to a set of pages
and, in the future, a number of struct pages will point at it.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-06-06  8:25       ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-06  8:25 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Yunsheng Lin, David Howells, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Willem de Bruijn,
	David Ahern, Matthew Wilcox, Jens Axboe, linux-mm, linux-kernel,
	Jeroen de Borst, Catherine Sullivan, Shailend Chand,
	Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

Alexander H Duyck <alexander.duyck@gmail.com> wrote:

> Also I have some concerns about going from page to folio as it seems
> like the folio_alloc setups the transparent hugepage destructor instead
> of using the compound page destructor. I would think that would slow
> down most users as it looks like there is a spinlock that is taken in
> the hugepage destructor that isn't there in the compound page
> destructor.

Note that this code is going to have to move to folios[*] at some point.
"Old-style" compound pages are going to go away, I believe.  Matthew Wilcox
and the mm folks are on a drive towards simplifying memory management,
formalising chunks larger than a single page - with the ultimate aim of
reducing the page struct to a single, typed pointer.

So, take, for example, a folio: As I understand it, this will no longer
overlay struct page, but rather will become a single, dynamically-allocated
struct that covers a pow-of-2 number of pages.  A contiguous subset of page
structs will point at it.

However, rather than using a folio, we could define a "page fragment" memory
type.  Rather than having all the flags and fields to be found in struct
folio, it could have just the set to be found in page_frag_cache.

David

[*] It will be possible to have some other type than "folio".  See "struct
slab" in mm/slab.h for example.  struct slab corresponds to a set of pages
and, in the future, a number of struct pages will point at it.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
  2023-06-06  8:25       ` David Howells
@ 2023-06-06 14:59         ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-06-06 14:59 UTC (permalink / raw)
  To: David Howells
  Cc: Yunsheng Lin, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, David Ahern,
	Matthew Wilcox, Jens Axboe, linux-mm, linux-kernel,
	Jeroen de Borst, Catherine Sullivan, Shailend Chand,
	Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Tue, Jun 6, 2023 at 1:25 AM David Howells <dhowells@redhat.com> wrote:
>
> Alexander H Duyck <alexander.duyck@gmail.com> wrote:
>
> > Also I have some concerns about going from page to folio as it seems
> > like the folio_alloc setups the transparent hugepage destructor instead
> > of using the compound page destructor. I would think that would slow
> > down most users as it looks like there is a spinlock that is taken in
> > the hugepage destructor that isn't there in the compound page
> > destructor.
>
> Note that this code is going to have to move to folios[*] at some point.
> "Old-style" compound pages are going to go away, I believe.  Matthew Wilcox
> and the mm folks are on a drive towards simplifying memory management,
> formalising chunks larger than a single page - with the ultimate aim of
> reducing the page struct to a single, typed pointer.

I'm not against making the move, but as others have pointed out this
is getting into unrelated things. One of those being the fact that to
transition to using folios we don't need to get rid of the use of the
virtual address. The idea behind using the virtual address here is
that we can avoid a bunch of address translation overhead since we
only need to use the folio if we are going to allocate, retire, or
recycle a page/folio. If we are using an order 3 page that shouldn't
be very often.

> So, take, for example, a folio: As I understand it, this will no longer
> overlay struct page, but rather will become a single, dynamically-allocated
> struct that covers a pow-of-2 number of pages.  A contiguous subset of page
> structs will point at it.
>
> However, rather than using a folio, we could define a "page fragment" memory
> type.  Rather than having all the flags and fields to be found in struct
> folio, it could have just the set to be found in page_frag_cache.

I don't think we need a new memory type. For the most part the page
fragment code is really more a subset of something like a
__get_free_pages where the requester provides the size, is just given
a virtual address, and we shouldn't need to be allocating a new page
as often as ideally the allocations are 2K or less in size.

Also one thing I would want to avoid is adding complexity to the
freeing path. The general idea with page frags is that they are meant
to be lightweight in terms of freeing as well. So just as they are
similar to __get_free_pages in terms of allocation the freeing is
meant to be similar to free_pages.

> David
>
> [*] It will be possible to have some other type than "folio".  See "struct
> slab" in mm/slab.h for example.  struct slab corresponds to a set of pages
> and, in the future, a number of struct pages will point at it.

I want to avoid getting anywhere near the complexity of a slab
allocator. The whole point of this was to keep it simple so that
drivers could use it and get decent performance. When I had
implemented it in the Intel drivers back in the day this approach was
essentially just a reference count/page offset hack that allowed us to
split a page in 2 and use the pages as a sort of mobius strip within
the ring buffer.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios
@ 2023-06-06 14:59         ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-06-06 14:59 UTC (permalink / raw)
  To: David Howells
  Cc: Yunsheng Lin, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Willem de Bruijn, David Ahern,
	Matthew Wilcox, Jens Axboe, linux-mm, linux-kernel,
	Jeroen de Borst, Catherine Sullivan, Shailend Chand,
	Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Tue, Jun 6, 2023 at 1:25 AM David Howells <dhowells@redhat.com> wrote:
>
> Alexander H Duyck <alexander.duyck@gmail.com> wrote:
>
> > Also I have some concerns about going from page to folio as it seems
> > like the folio_alloc setups the transparent hugepage destructor instead
> > of using the compound page destructor. I would think that would slow
> > down most users as it looks like there is a spinlock that is taken in
> > the hugepage destructor that isn't there in the compound page
> > destructor.
>
> Note that this code is going to have to move to folios[*] at some point.
> "Old-style" compound pages are going to go away, I believe.  Matthew Wilcox
> and the mm folks are on a drive towards simplifying memory management,
> formalising chunks larger than a single page - with the ultimate aim of
> reducing the page struct to a single, typed pointer.

I'm not against making the move, but as others have pointed out this
is getting into unrelated things. One of those being the fact that to
transition to using folios we don't need to get rid of the use of the
virtual address. The idea behind using the virtual address here is
that we can avoid a bunch of address translation overhead since we
only need to use the folio if we are going to allocate, retire, or
recycle a page/folio. If we are using an order 3 page that shouldn't
be very often.

> So, take, for example, a folio: As I understand it, this will no longer
> overlay struct page, but rather will become a single, dynamically-allocated
> struct that covers a pow-of-2 number of pages.  A contiguous subset of page
> structs will point at it.
>
> However, rather than using a folio, we could define a "page fragment" memory
> type.  Rather than having all the flags and fields to be found in struct
> folio, it could have just the set to be found in page_frag_cache.

I don't think we need a new memory type. For the most part the page
fragment code is really more a subset of something like a
__get_free_pages where the requester provides the size, is just given
a virtual address, and we shouldn't need to be allocating a new page
as often as ideally the allocations are 2K or less in size.

Also one thing I would want to avoid is adding complexity to the
freeing path. The general idea with page frags is that they are meant
to be lightweight in terms of freeing as well. So just as they are
similar to __get_free_pages in terms of allocation the freeing is
meant to be similar to free_pages.

> David
>
> [*] It will be possible to have some other type than "folio".  See "struct
> slab" in mm/slab.h for example.  struct slab corresponds to a set of pages
> and, in the future, a number of struct pages will point at it.

I want to avoid getting anywhere near the complexity of a slab
allocator. The whole point of this was to keep it simple so that
drivers could use it and get decent performance. When I had
implemented it in the Intel drivers back in the day this approach was
essentially just a reference count/page offset hack that allowed us to
split a page in 2 and use the pages as a sort of mobius strip within
the ring buffer.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
  2023-05-24 15:33   ` David Howells
@ 2023-06-16 15:28     ` David Howells
  -1 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-16 15:28 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: dhowells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Alexander H Duyck <alexander.duyck@gmail.com> wrote:

> The requirement should only be cache alignment, not power of 2
> alignment.

Sure, but, upstream, page_frag_alloc_align() allows the specification of an
alignment and applies that alignment by:

	offset &= align_mask;

which doesn't really make sense unless the alignment boils down to being a
power of two.  Now, it might make sense to kill off the align_mask parameter
and just go with SMP_CACHE_BYTES (which will be a power of two).

Note, though, that most users seem to use an align_mask of ~0u which feels a
bit dodgy (it's not an unsigned long), but is probably fine.

David


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
@ 2023-06-16 15:28     ` David Howells
  0 siblings, 0 replies; 69+ messages in thread
From: David Howells @ 2023-06-16 15:28 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: dhowells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, David Ahern, Matthew Wilcox,
	Jens Axboe, linux-mm, linux-kernel, Jeroen de Borst,
	Catherine Sullivan, Shailend Chand, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Lorenzo Bianconi, Matthias Brugger,
	AngeloGioacchino Del Regno, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Andrew Morton, linux-arm-kernel, linux-mediatek, linux-nvme

Alexander H Duyck <alexander.duyck@gmail.com> wrote:

> The requirement should only be cache alignment, not power of 2
> alignment.

Sure, but, upstream, page_frag_alloc_align() allows the specification of an
alignment and applies that alignment by:

	offset &= align_mask;

which doesn't really make sense unless the alignment boils down to being a
power of two.  Now, it might make sense to kill off the align_mask parameter
and just go with SMP_CACHE_BYTES (which will be a power of two).

Note, though, that most users seem to use an align_mask of ~0u which feels a
bit dodgy (it's not an unsigned long), but is probably fine.

David


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
  2023-06-16 15:28     ` David Howells
@ 2023-06-16 16:06       ` Alexander Duyck
  -1 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-06-16 16:06 UTC (permalink / raw)
  To: David Howells
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Fri, Jun 16, 2023 at 8:28 AM David Howells <dhowells@redhat.com> wrote:
>
> Alexander H Duyck <alexander.duyck@gmail.com> wrote:
>
> > The requirement should only be cache alignment, not power of 2
> > alignment.
>
> Sure, but, upstream, page_frag_alloc_align() allows the specification of an
> alignment and applies that alignment by:
>
>         offset &= align_mask;
>
> which doesn't really make sense unless the alignment boils down to being a
> power of two.  Now, it might make sense to kill off the align_mask parameter
> and just go with SMP_CACHE_BYTES (which will be a power of two).
>
> Note, though, that most users seem to use an align_mask of ~0u which feels a
> bit dodgy (it's not an unsigned long), but is probably fine.

Yeah, now that I think about it this should be fine. I must have been
responding to this without enough sleep. When I originally read it I
saw it as limiting it to a power of 2 allocation, not limiting the
alignment mask to a power of 2.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
@ 2023-06-16 16:06       ` Alexander Duyck
  0 siblings, 0 replies; 69+ messages in thread
From: Alexander Duyck @ 2023-06-16 16:06 UTC (permalink / raw)
  To: David Howells
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On Fri, Jun 16, 2023 at 8:28 AM David Howells <dhowells@redhat.com> wrote:
>
> Alexander H Duyck <alexander.duyck@gmail.com> wrote:
>
> > The requirement should only be cache alignment, not power of 2
> > alignment.
>
> Sure, but, upstream, page_frag_alloc_align() allows the specification of an
> alignment and applies that alignment by:
>
>         offset &= align_mask;
>
> which doesn't really make sense unless the alignment boils down to being a
> power of two.  Now, it might make sense to kill off the align_mask parameter
> and just go with SMP_CACHE_BYTES (which will be a power of two).
>
> Note, though, that most users seem to use an align_mask of ~0u which feels a
> bit dodgy (it's not an unsigned long), but is probably fine.

Yeah, now that I think about it this should be fine. I must have been
responding to this without enough sleep. When I originally read it I
saw it as limiting it to a power of 2 allocation, not limiting the
alignment mask to a power of 2.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
  2023-05-27 15:54     ` Alexander H Duyck
@ 2023-11-30  9:00       ` Yunsheng Lin
  -1 siblings, 0 replies; 69+ messages in thread
From: Yunsheng Lin @ 2023-11-30  9:00 UTC (permalink / raw)
  To: Alexander H Duyck, David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On 2023/5/27 23:54, Alexander H Duyck wrote:
> On Wed, 2023-05-24 at 16:33 +0100, David Howells wrote:
>> Make the page_frag_cache allocator's alignment parameter a power of 2
>> rather than a mask and give a warning if it isn't.
>>
>> This means that it's consistent with {napi,netdec}_alloc_frag_align() and
>> allows __{napi,netdev}_alloc_frag_align() to be removed.

I am trying to rmove the page frag implemetation in
vhost_net_page_frag_refill() by using page_frag_alloc_align(), and
I ended up having a simiar patch as this one.

>>
> 
> This goes against the original intention of these functions. One of the
> reasons why this is being used is because when somebody enables
> something like 2K jumbo frames they don't necessarily want to have to
> allocate 4K SLABs. Instead they can just add a bit of overhead and get
> almost twice the utilization out of an order 3 page.
> 
> The requirement should only be cache alignment, not power of 2
> alignment. This isn't meant to be a slab allocator. We are just
> sectioning up pages to handle mixed workloads. In the case of
> networking we can end up getting everything from 60B packets, to 1514B
> in the standard cases. That was why we started sectioning up pages in
> the first place so putting a power of 2 requirement on it doens't fit
> our use case at all and is what we were trying to get away from with
> the SLAB allocators.

It seems that is_power_of_2() checking in this patch does not excluding
the non-align case if we are passing 'align' being 1, which means we still
can support the 'everything from 60B packets, to 1514B' case.

> 
> .
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2
@ 2023-11-30  9:00       ` Yunsheng Lin
  0 siblings, 0 replies; 69+ messages in thread
From: Yunsheng Lin @ 2023-11-30  9:00 UTC (permalink / raw)
  To: Alexander H Duyck, David Howells, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, David Ahern, Matthew Wilcox, Jens Axboe,
	linux-mm, linux-kernel, Jeroen de Borst, Catherine Sullivan,
	Shailend Chand, Felix Fietkau, John Crispin, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme

On 2023/5/27 23:54, Alexander H Duyck wrote:
> On Wed, 2023-05-24 at 16:33 +0100, David Howells wrote:
>> Make the page_frag_cache allocator's alignment parameter a power of 2
>> rather than a mask and give a warning if it isn't.
>>
>> This means that it's consistent with {napi,netdec}_alloc_frag_align() and
>> allows __{napi,netdev}_alloc_frag_align() to be removed.

I am trying to rmove the page frag implemetation in
vhost_net_page_frag_refill() by using page_frag_alloc_align(), and
I ended up having a simiar patch as this one.

>>
> 
> This goes against the original intention of these functions. One of the
> reasons why this is being used is because when somebody enables
> something like 2K jumbo frames they don't necessarily want to have to
> allocate 4K SLABs. Instead they can just add a bit of overhead and get
> almost twice the utilization out of an order 3 page.
> 
> The requirement should only be cache alignment, not power of 2
> alignment. This isn't meant to be a slab allocator. We are just
> sectioning up pages to handle mixed workloads. In the case of
> networking we can end up getting everything from 60B packets, to 1514B
> in the standard cases. That was why we started sectioning up pages in
> the first place so putting a power of 2 requirement on it doens't fit
> our use case at all and is what we were trying to get away from with
> the SLAB allocators.

It seems that is_power_of_2() checking in this patch does not excluding
the non-align case if we are passing 'align' being 1, which means we still
can support the 'everything from 60B packets, to 1514B' case.

> 
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2023-11-30  9:01 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
2023-05-24 15:33 ` [PATCH net-next 01/12] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
2023-05-24 15:33 ` [PATCH net-next 02/12] mm: Provide a page_frag_cache allocator cleanup function David Howells
2023-05-24 15:33   ` David Howells
2023-05-24 15:33 ` [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2 David Howells
2023-05-24 15:33   ` David Howells
2023-05-27 15:54   ` Alexander H Duyck
2023-05-27 15:54     ` Alexander H Duyck
2023-11-30  9:00     ` Yunsheng Lin
2023-11-30  9:00       ` Yunsheng Lin
2023-06-16 15:28   ` David Howells
2023-06-16 15:28     ` David Howells
2023-06-16 16:06     ` Alexander Duyck
2023-06-16 16:06       ` Alexander Duyck
2023-05-24 15:33 ` [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios David Howells
2023-05-24 15:33   ` David Howells
2023-05-26 11:56   ` Yunsheng Lin
2023-05-26 11:56     ` Yunsheng Lin
2023-05-27 15:47     ` Alexander H Duyck
2023-05-27 15:47       ` Alexander H Duyck
2023-06-06  8:25     ` David Howells
2023-06-06  8:25       ` David Howells
2023-06-06 14:59       ` Alexander Duyck
2023-06-06 14:59         ` Alexander Duyck
2023-05-26 12:47   ` David Howells
2023-05-26 12:47     ` David Howells
2023-05-26 14:06     ` Mika Penttilä
2023-05-26 14:06       ` Mika Penttilä
2023-05-27  0:50   ` Jakub Kicinski
2023-05-27  0:50     ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself David Howells
2023-05-24 15:33   ` David Howells
2023-05-27  0:57   ` Jakub Kicinski
2023-05-27  0:57     ` Jakub Kicinski
2023-05-27 15:54     ` Alexander Duyck
2023-05-27 15:54       ` Alexander Duyck
2023-05-24 15:33 ` [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu David Howells
2023-05-24 15:33   ` David Howells
2023-05-27  1:02   ` Jakub Kicinski
2023-05-27  1:02     ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 07/12] net: Clean up users of netdev_alloc_cache and napi_frag_cache David Howells
2023-05-24 15:33 ` [PATCH net-next 08/12] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES) David Howells
2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
2023-05-27  1:08   ` Jakub Kicinski
2023-05-30 22:26   ` Bug in short splice to socket? David Howells
2023-05-31  0:32     ` Jakub Kicinski
2023-06-01 11:01     ` David Laight
2023-06-01 13:09     ` Linus Torvalds
2023-06-01 13:19       ` Linus Torvalds
2023-06-01 14:34       ` David Howells
2023-06-01 15:12         ` Linus Torvalds
2023-06-05 11:03           ` David Laight
2023-06-05 15:52           ` David Howells
2023-06-01 17:14         ` David Howells
2023-06-02  4:20           ` Jakub Kicinski
2023-06-02  8:23           ` David Howells
2023-06-02 11:28             ` Linus Torvalds
2023-06-02 11:44             ` David Howells
2023-06-02 12:11               ` Linus Torvalds
2023-06-02 16:39                 ` Jakub Kicinski
2023-06-02 16:53                   ` Linus Torvalds
2023-06-02 17:05                     ` Linus Torvalds
2023-06-02 17:38                       ` Jakub Kicinski
2023-06-02 20:38                     ` David Howells
2023-06-02 20:50                     ` David Howells
2023-05-24 15:33 ` [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-05-27  1:13   ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 11/12] tls/device: Support MSG_SPLICE_PAGES David Howells
2023-05-24 15:33 ` [PATCH net-next 12/12] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.