netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/12] net: huge page backed page_pool
@ 2023-07-07 18:39 Jakub Kicinski
  2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Hi!

This is an "early PoC" at best. It seems to work for a basic
traffic test but there's no uAPI and a lot more general polish
is needed.

The problem we're seeing is that performance of some older NICs
degrades quite a bit when IOMMU is used (in non-passthru mode).
There is a long tail of old NICs deployed, especially in PoPs/
/on edge. From a conversation I had with Eric a few months
ago it sounded like others may have similar issues. So I thought
I'd take a swing at getting page pool to feed drivers huge pages.
1G pages require hooking into early init via CMA but it works
just fine.

I haven't tested this with a real workload, because I'm still
waiting to get my hands on the right machine. But the experiment
with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).

In terms of the missing parts - uAPI is definitely needed.
The rough plan would be to add memory config via the netdev
genl family. Should fit nicely there. Have the config stored
in struct netdevice. When page pool is created get to the netdev
and automatically select the provider without the driver even
knowing. Two problems with that are - 1) if the driver follows
the recommended flow of allocating new queues before freeing
old ones we will have page pools created before the old ones
are gone, which means we'd need to reserve 2x the number of
1G pages; 2) there's no callback to the driver to say "I did
something behind your back, don't worry about it, but recreate
your queues, please" so the change will not take effect until
some unrelated change like installing XDP. Which may be fine
in practice but is a bit odd.

Then we get into hand-wavy stuff like - if we can link page
pools to netdevs, we should also be able to export the page pool
stats via the netdev family instead doing it the ethtool -S.. ekhm..
"way". And if we start storing configs behind driver's back why
don't we also store other params, like ring size and queue count...
A lot of potential improvements as we iron out a new API...

Live tree: https://github.com/kuba-moo/linux/tree/pp-providers

Jakub Kicinski (12):
  net: hack together some page sharing
  net: create a 1G-huge-page-backed allocator
  net: page_pool: hide page_pool_release_page()
  net: page_pool: merge page_pool_release_page() with
    page_pool_return_page()
  net: page_pool: factor out releasing DMA from releasing the page
  net: page_pool: create hooks for custom page providers
  net: page_pool: add huge page backed memory providers
  eth: bnxt: let the page pool manage the DMA mapping
  eth: bnxt: use the page pool for data pages
  eth: bnxt: make sure we make for recycle skbs before freeing them
  eth: bnxt: wrap coherent allocations into helpers
  eth: bnxt: hack in the use of MEP

 Documentation/networking/page_pool.rst        |  10 +-
 arch/x86/kernel/setup.c                       |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 154 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +
 drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |   4 +-
 include/net/dcalloc.h                         |  28 +
 include/net/page_pool.h                       |  36 +-
 net/core/Makefile                             |   2 +-
 net/core/dcalloc.c                            | 615 +++++++++++++++++
 net/core/dcalloc.h                            |  96 +++
 net/core/page_pool.c                          | 625 +++++++++++++++++-
 12 files changed, 1478 insertions(+), 105 deletions(-)
 create mode 100644 include/net/dcalloc.h
 create mode 100644 net/core/dcalloc.c
 create mode 100644 net/core/dcalloc.h

-- 
2.41.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 01/12] net: hack together some page sharing
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 02/12] net: create a 1G-huge-page-backed allocator Jakub Kicinski
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Implement a simple buddy allocator with a fallback. It will be
used to split huge pages into smaller pools. And fallback to
alloc_pages() if huge pages are exhausted.

This code will be used exclusively on slow paths and is generally
"not great" but it doesn't seem to immediately crash which is
good enough for now?

This patch contains a basic "coherent allocator" which splits 2M
coherently mapped pages into smaller chunks. Certian drivers
appear to allocate a few MB in single coherent pages which is not
great for IOTLB pressure (simple iperf test on bnxt with Rx backed
by huge pages goes from 170k IOTLB misses to 60k when using this).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/dcalloc.h |  18 ++
 net/core/Makefile     |   2 +-
 net/core/dcalloc.c    | 390 ++++++++++++++++++++++++++++++++++++++++++
 net/core/dcalloc.h    |  93 ++++++++++
 4 files changed, 502 insertions(+), 1 deletion(-)
 create mode 100644 include/net/dcalloc.h
 create mode 100644 net/core/dcalloc.c
 create mode 100644 net/core/dcalloc.h

diff --git a/include/net/dcalloc.h b/include/net/dcalloc.h
new file mode 100644
index 000000000000..a85c59d7f844
--- /dev/null
+++ b/include/net/dcalloc.h
@@ -0,0 +1,18 @@
+#ifndef __NET_DCALLOC_H
+#define __NET_DCALLOC_H
+
+#include <linux/types.h>
+
+struct device;
+
+struct dma_cocoa;
+
+struct dma_cocoa *dma_cocoa_create(struct device *dev, gfp_t gfp);
+void dma_cocoa_destroy(struct dma_cocoa *cocoa);
+
+void *dma_cocoa_alloc(struct dma_cocoa *cocoa, unsigned long size,
+		      dma_addr_t *dma, gfp_t gfp);
+void dma_cocoa_free(struct dma_cocoa *cocoa, unsigned long size, void *addr,
+		    dma_addr_t dma);
+
+#endif
diff --git a/net/core/Makefile b/net/core/Makefile
index 731db2eaa610..3a98ad5d2b49 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y		     += dev.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
 			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
 			fib_notifier.o xdp.o flow_offload.o gro.o \
-			netdev-genl.o netdev-genl-gen.o gso.o
+			netdev-genl.o netdev-genl-gen.o gso.o dcalloc.o
 
 obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
diff --git a/net/core/dcalloc.c b/net/core/dcalloc.c
new file mode 100644
index 000000000000..af9029018353
--- /dev/null
+++ b/net/core/dcalloc.c
@@ -0,0 +1,390 @@
+#include "dcalloc.h"
+
+#include <linux/dma-mapping.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+
+static bool dma_sal_in_use(struct dma_slow_allocator *sal)
+{
+	return refcount_read(&sal->user_cnt);
+}
+
+int dma_slow_huge_init(struct dma_slow_huge *shu, void *addr,
+		       unsigned int size, dma_addr_t dma, gfp_t gfp)
+{
+	struct dma_slow_buddy *bud;
+
+	bud = kzalloc(sizeof(*bud), gfp);
+	if (!bud)
+		return -ENOMEM;
+
+	shu->addr = addr;
+	shu->size = size;
+	shu->dma = dma;
+
+	INIT_LIST_HEAD(&shu->buddy_list);
+
+	bud->size = size;
+	bud->free = true;
+	list_add(&bud->list, &shu->buddy_list);
+
+	return 0;
+}
+
+static struct dma_slow_buddy *
+dma_slow_bud_split(struct dma_slow_buddy *bud, gfp_t gfp)
+{
+	struct dma_slow_buddy *right;
+
+	right = kzalloc(sizeof(*bud), gfp);
+	if (!right)
+		return NULL;
+
+	bud->size /= 2;
+
+	right->offset = bud->offset + bud->size;
+	right->size = bud->size;
+	right->free = true;
+
+	list_add(&right->list, &bud->list);
+
+	return bud;
+}
+
+static bool dma_slow_bud_coalesce(struct dma_slow_huge *shu)
+{
+	struct dma_slow_buddy *bud, *left = NULL, *right = NULL;
+
+	list_for_each_entry(bud, &shu->buddy_list, list) {
+		if (left && bud &&
+		    left->free && bud->free &&
+		    left->size == bud->size &&
+		    (left->offset & bud->offset) == left->offset) {
+			right = bud;
+			break;
+		}
+		left = bud;
+	}
+
+	if (!right)
+		return false;
+
+	left->size *= 2;
+	list_del(&right->list);
+	kfree(right);
+	return true;
+}
+
+static void *
+__dma_sal_alloc_buddy(struct dma_slow_allocator *sal, struct dma_slow_huge *shu,
+		      unsigned int size, dma_addr_t *dma, gfp_t gfp)
+{
+	struct dma_slow_buddy *small_fit = NULL;
+	struct dma_slow_buddy *bud;
+
+	if (shu->size < size)
+		return NULL;
+
+	list_for_each_entry(bud, &shu->buddy_list, list) {
+		if (!bud->free || bud->size < size)
+			continue;
+
+		if (!small_fit || small_fit->size > bud->size)
+			small_fit = bud;
+		if (bud->size == size)
+			break;
+	}
+	if (!small_fit)
+		return NULL;
+	bud = small_fit;
+
+	while (bud->size >= size * 2) {
+		bud = dma_slow_bud_split(bud, gfp);
+		if (!bud)
+			return NULL;
+	}
+
+	bud->free = false;
+	*dma = shu->dma + bud->offset;
+	return shu->addr + (bud->offset >> sal->ops->ptr_shf);
+}
+
+static void *
+dma_sal_alloc_buddy(struct dma_slow_allocator *sal, unsigned int size,
+		    dma_addr_t *dma, gfp_t gfp)
+{
+	struct dma_slow_huge *shu;
+	void *addr;
+
+	list_for_each_entry(shu, &sal->huge, huge) {
+		addr = __dma_sal_alloc_buddy(sal, shu, size, dma, gfp);
+		if (addr)
+			return addr;
+	}
+
+	if (!sal->ops->alloc_huge)
+		return NULL;
+
+	shu = kzalloc(sizeof(*shu), gfp);
+	if (!shu)
+		return NULL;
+	if (sal->ops->alloc_huge(sal, shu, size, gfp)) {
+		kfree(shu);
+		return NULL;
+	}
+	list_add(&shu->huge, &sal->huge);
+
+	return __dma_sal_alloc_buddy(sal, shu, size, dma, gfp);
+}
+
+static bool
+__dma_sal_free_buddy(struct dma_slow_allocator *sal, struct dma_slow_huge *shu,
+		     void *addr, unsigned int size, dma_addr_t dma)
+{
+	struct dma_slow_buddy *bud;
+	dma_addr_t exp_dma;
+	void *exp_addr;
+
+	list_for_each_entry(bud, &shu->buddy_list, list) {
+		exp_dma = shu->dma + bud->offset;
+		exp_addr = shu->addr + (bud->offset >> sal->ops->ptr_shf);
+
+		if (exp_addr != addr)
+			continue;
+
+		if (exp_dma != dma || bud->size != size)
+			pr_warn("mep param mismatch: %u %u, %lu %lu\n",
+				bud->size, size, (ulong)exp_dma, (ulong)dma);
+		if (bud->free)
+			pr_warn("double free: %d %lu\n", size, (ulong)dma);
+		bud->free = true;
+		return true;
+	}
+
+	return false;
+}
+
+static void
+dma_slow_maybe_free_huge(struct dma_slow_allocator *sal,
+			 struct dma_slow_huge *shu)
+{
+	struct dma_slow_buddy *bud;
+
+	bud = list_first_entry(&shu->buddy_list, typeof(*bud), list);
+	if (!bud->free || bud->size != shu->size)
+		return;
+
+	if (!sal->ops->alloc_huge)
+		return;
+
+	kfree(bud);
+
+	sal->ops->free_huge(sal, shu);
+	list_del(&shu->huge);
+	kfree(shu);
+}
+
+static bool
+dma_sal_free_buddy(struct dma_slow_allocator *sal, void *addr,
+		   unsigned int order, dma_addr_t dma)
+{
+	struct dma_slow_huge *shu;
+	bool freed = false;
+
+	list_for_each_entry(shu, &sal->huge, huge) {
+		freed = __dma_sal_free_buddy(sal, shu, addr, order, dma);
+		if (freed)
+			break;
+	}
+	if (freed) {
+		while (dma_slow_bud_coalesce(shu))
+			/* I know, it's not efficient.
+			 * But all of SAL is on the config path.
+			 */;
+		dma_slow_maybe_free_huge(sal, shu);
+	}
+	return freed;
+}
+
+static void *
+dma_sal_alloc_fb(struct dma_slow_allocator *sal, unsigned int size,
+		 dma_addr_t *dma, gfp_t gfp)
+{
+	struct dma_slow_fall *fb;
+
+	fb = kzalloc(sizeof(*fb), gfp);
+	if (!fb)
+		return NULL;
+	fb->size = size;
+
+	if (sal->ops->alloc_fall(sal, fb, size, gfp)) {
+		kfree(fb);
+		return NULL;
+	}
+	list_add(&fb->fb, &sal->fallback);
+
+	*dma = fb->dma;
+	return fb->addr;
+}
+
+static bool dma_sal_free_fb(struct dma_slow_allocator *sal, void *addr,
+			    unsigned int size, dma_addr_t dma)
+{
+	struct dma_slow_fall *fb, *pos;
+
+	fb = NULL;
+	list_for_each_entry(pos, &sal->fallback, fb)
+		if (pos->addr == addr) {
+			fb = pos;
+			break;
+		}
+
+	if (!fb) {
+		pr_warn("free: address %px not found\n", addr);
+		return false;
+	}
+
+	if (fb->size != size || fb->dma != dma)
+		pr_warn("free: param mismatch: %u %u, %lu %lu\n",
+			fb->size, size, (ulong)fb->dma, (ulong)dma);
+
+	list_del(&fb->fb);
+	sal->ops->free_fall(sal, fb);
+	kfree(fb);
+	return true;
+}
+
+void *dma_sal_alloc(struct dma_slow_allocator *sal, unsigned int size,
+		    dma_addr_t *dma, gfp_t gfp)
+{
+	void *ret;
+
+	ret = dma_sal_alloc_buddy(sal, size, dma, gfp);
+	if (!ret)
+		ret = dma_sal_alloc_fb(sal, size, dma, gfp);
+	if (!ret)
+		return NULL;
+
+	dma_slow_get(sal);
+	return ret;
+}
+
+void dma_sal_free(struct dma_slow_allocator *sal, void *addr,
+		  unsigned int size, dma_addr_t dma)
+{
+	if (!dma_sal_free_buddy(sal, addr, size, dma) &&
+	    !dma_sal_free_fb(sal, addr, size, dma))
+		return;
+
+	dma_slow_put(sal);
+}
+
+void dma_sal_init(struct dma_slow_allocator *sal,
+		  const struct dma_slow_allocator_ops *ops,
+		  struct device *dev)
+{
+	sal->ops = ops;
+	sal->dev = dev;
+
+	INIT_LIST_HEAD(&sal->huge);
+	INIT_LIST_HEAD(&sal->fallback);
+
+	refcount_set(&sal->user_cnt, 1);
+}
+
+/*****************************
+ ***  DMA COCOA allocator  ***
+ *****************************/
+static int
+dma_cocoa_alloc_huge(struct dma_slow_allocator *sal, struct dma_slow_huge *shu,
+		     unsigned int size, gfp_t gfp)
+{
+	if (size >= SZ_2M)
+		return -ENOMEM;
+
+	shu->addr = dma_alloc_coherent(sal->dev, SZ_2M, &shu->dma, gfp);
+	if (!shu->addr)
+		return -ENOMEM;
+
+	if (dma_slow_huge_init(shu, shu->addr, SZ_2M, shu->dma, gfp))
+		goto err_free_dma;
+
+	return 0;
+
+err_free_dma:
+	dma_free_coherent(sal->dev, SZ_2M, shu->addr, shu->dma);
+	return -ENOMEM;
+}
+
+static void
+dma_cocoa_free_huge(struct dma_slow_allocator *sal, struct dma_slow_huge *shu)
+{
+	dma_free_coherent(sal->dev, SZ_2M, shu->addr, shu->dma);
+}
+
+static int
+dma_cocoa_alloc_fall(struct dma_slow_allocator *sal, struct dma_slow_fall *fb,
+		     unsigned int size, gfp_t gfp)
+{
+	fb->addr = dma_alloc_coherent(sal->dev, size, &fb->dma, gfp);
+	if (!fb->addr)
+		return -ENOMEM;
+	return 0;
+}
+
+static void
+dma_cocoa_free_fall(struct dma_slow_allocator *sal, struct dma_slow_fall *fb)
+{
+	dma_free_coherent(sal->dev, fb->size, fb->addr, fb->dma);
+}
+
+struct dma_slow_allocator_ops dma_cocoa_ops = {
+	.alloc_huge	= dma_cocoa_alloc_huge,
+	.free_huge	= dma_cocoa_free_huge,
+	.alloc_fall	= dma_cocoa_alloc_fall,
+	.free_fall	= dma_cocoa_free_fall,
+};
+
+struct dma_cocoa {
+	struct dma_slow_allocator sal;
+};
+
+struct dma_cocoa *dma_cocoa_create(struct device *dev, gfp_t gfp)
+{
+	struct dma_cocoa *cocoa;
+
+	cocoa = kzalloc(sizeof(*cocoa), gfp);
+	if (!cocoa)
+		return NULL;
+
+	dma_sal_init(&cocoa->sal, &dma_cocoa_ops, dev);
+
+	return cocoa;
+}
+
+void dma_cocoa_destroy(struct dma_cocoa *cocoa)
+{
+	dma_slow_put(&cocoa->sal);
+	WARN_ON(dma_sal_in_use(&cocoa->sal));
+	kfree(cocoa);
+}
+
+void *dma_cocoa_alloc(struct dma_cocoa *cocoa, unsigned long size,
+		      dma_addr_t *dma, gfp_t gfp)
+{
+	void *addr;
+
+	size = roundup_pow_of_two(size);
+	addr = dma_sal_alloc(&cocoa->sal, size, dma, gfp);
+	if (!addr)
+		return NULL;
+	memset(addr, 0, size);
+	return addr;
+}
+
+void dma_cocoa_free(struct dma_cocoa *cocoa, unsigned long size, void *addr,
+		    dma_addr_t dma)
+{
+	size = roundup_pow_of_two(size);
+	return dma_sal_free(&cocoa->sal, addr, size, dma);
+}
diff --git a/net/core/dcalloc.h b/net/core/dcalloc.h
new file mode 100644
index 000000000000..c7e75ef0cb81
--- /dev/null
+++ b/net/core/dcalloc.h
@@ -0,0 +1,93 @@
+#ifndef __DCALLOC_H
+#define __DCALLOC_H
+
+#include <linux/dma-mapping.h>
+#include <net/dcalloc.h>
+
+struct device;
+
+/* struct dma_slow_huge - AKA @shu, large block which will get chopped up */
+struct dma_slow_huge {
+	void *addr;
+	unsigned int size;
+	dma_addr_t dma;
+
+	struct list_head huge;
+	struct list_head buddy_list;	/* struct dma_slow_buddy */
+};
+
+/* Single allocation piece */
+struct dma_slow_buddy {
+	unsigned int offset;
+	unsigned int size;
+
+	bool free;
+
+	struct list_head list;
+};
+
+/* struct dma_slow_fall - AKA @fb, fallback when huge can't be allocated */
+struct dma_slow_fall {
+	void *addr;
+	unsigned int size;
+	dma_addr_t dma;
+
+	struct list_head fb;
+};
+
+/* struct dma_slow_allocator - AKA @sal, per device allocator */
+struct dma_slow_allocator {
+	const struct dma_slow_allocator_ops *ops;
+	struct device *dev;
+
+	unsigned int ptr_shf;
+	refcount_t user_cnt;
+
+	struct list_head huge;		/* struct dma_slow_huge */
+	struct list_head fallback;	/* struct dma_slow_fall */
+};
+
+struct dma_slow_allocator_ops {
+	u8	ptr_shf;
+
+	int (*alloc_huge)(struct dma_slow_allocator *sal,
+			  struct dma_slow_huge *shu,
+			  unsigned int size, gfp_t gfp);
+	void (*free_huge)(struct dma_slow_allocator *sal,
+			  struct dma_slow_huge *fb);
+	int (*alloc_fall)(struct dma_slow_allocator *sal,
+			  struct dma_slow_fall *fb,
+			  unsigned int size, gfp_t gfp);
+	void (*free_fall)(struct dma_slow_allocator *sal,
+			  struct dma_slow_fall *fb);
+
+	void (*release)(struct dma_slow_allocator *sal);
+};
+
+int dma_slow_huge_init(struct dma_slow_huge *shu, void *addr,
+		       unsigned int size, dma_addr_t dma, gfp_t gfp);
+
+void dma_sal_init(struct dma_slow_allocator *sal,
+		  const struct dma_slow_allocator_ops *ops,
+		  struct device *dev);
+
+void *dma_sal_alloc(struct dma_slow_allocator *sal, unsigned int size,
+		    dma_addr_t *dma, gfp_t gfp);
+void dma_sal_free(struct dma_slow_allocator *sal, void *addr,
+		  unsigned int size, dma_addr_t dma);
+
+static inline void dma_slow_get(struct dma_slow_allocator *sal)
+{
+	refcount_inc(&sal->user_cnt);
+}
+
+static inline void dma_slow_put(struct dma_slow_allocator *sal)
+{
+	if (!refcount_dec_and_test(&sal->user_cnt))
+		return;
+
+	if (sal->ops->release)
+		sal->ops->release(sal);
+}
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 02/12] net: create a 1G-huge-page-backed allocator
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
  2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 03/12] net: page_pool: hide page_pool_release_page() Jakub Kicinski
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Get 1G pages from CMA, driver will be able to sub-allocate from
those for individual queues. Each Rx queue (taking recycling into
account) needs 32MB - 128MB of memory. With 32 active queues even
with 2MB pages we'll end up using a lot of IOTLB entries.

There are some workarounds for 2MB pages like trying to sort
the buffers (i.e. making sure that the buffers used by the NIC
at any time belong to as few 2MB pages as possible). But 1G pages
seem so much simpler.

Grab 4 pages for now, the real thing will probably need some
Kconfigs and command line params. And a lot more uAPI in general.

Also IDK how to hook this properly into early init :(

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 arch/x86/kernel/setup.c |   6 +-
 include/net/dcalloc.h   |  10 ++
 net/core/dcalloc.c      | 225 ++++++++++++++++++++++++++++++++++++++++
 net/core/dcalloc.h      |   3 +
 4 files changed, 243 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd975a4a5200..cc6acd1fa67a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -843,6 +843,8 @@ static void __init x86_report_nx(void)
 	}
 }
 
+int __init mep_cma_init(void);
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use these data structures
@@ -1223,8 +1225,10 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-	if (boot_cpu_has(X86_FEATURE_GBPAGES))
+	if (boot_cpu_has(X86_FEATURE_GBPAGES)) {
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+		mep_cma_init();
+	}
 
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
diff --git a/include/net/dcalloc.h b/include/net/dcalloc.h
index a85c59d7f844..21c0fcaaa163 100644
--- a/include/net/dcalloc.h
+++ b/include/net/dcalloc.h
@@ -15,4 +15,14 @@ void *dma_cocoa_alloc(struct dma_cocoa *cocoa, unsigned long size,
 void dma_cocoa_free(struct dma_cocoa *cocoa, unsigned long size, void *addr,
 		    dma_addr_t dma);
 
+struct mem_provider;
+
+struct mem_provider *mep_create(struct device *dev);
+void mep_destroy(struct mem_provider *mep);
+
+struct page *mep_alloc(struct mem_provider *mep, unsigned int order,
+		       dma_addr_t *dma, gfp_t gfp);
+void mep_free(struct mem_provider *mep, struct page *page,
+	      unsigned int order, dma_addr_t dma);
+
 #endif
diff --git a/net/core/dcalloc.c b/net/core/dcalloc.c
index af9029018353..821b9dbfb655 100644
--- a/net/core/dcalloc.c
+++ b/net/core/dcalloc.c
@@ -388,3 +388,228 @@ void dma_cocoa_free(struct dma_cocoa *cocoa, unsigned long size, void *addr,
 	size = roundup_pow_of_two(size);
 	return dma_sal_free(&cocoa->sal, addr, size, dma);
 }
+
+/*****************************
+ ***   DMA MEP allocator   ***
+ *****************************/
+
+#include <linux/cma.h>
+
+static struct cma *mep_cma;
+static int mep_err;
+
+int __init mep_cma_init(void);
+int __init mep_cma_init(void)
+{
+	int order_per_bit;
+
+	order_per_bit = min(30 - PAGE_SHIFT, MAX_ORDER - 1);
+	order_per_bit = min(order_per_bit, HUGETLB_PAGE_ORDER);
+
+	mep_err = cma_declare_contiguous_nid(0,		/* base */
+					     SZ_4G,	/* size */
+					     0,		/* limit */
+					     SZ_1G,	/* alignment */
+					     order_per_bit,  /* order_per_bit */
+					     false,	/* fixed */
+					     "net_mep",	/* name */
+					     &mep_cma,	/* res_cma */
+					     NUMA_NO_NODE);  /* nid */
+	if (mep_err)
+		pr_warn("Net MEP init failed: %d\n", mep_err);
+	else
+		pr_info("Net MEP reserved 4G of memory\n");
+
+	return 0;
+}
+
+/** ----- MEP (slow / ctrl) allocator ----- */
+
+void mp_huge_split(struct page *page, unsigned int order)
+{
+	int i;
+
+	split_page(page, order);
+	/* The subsequent pages have a poisoned next, and since we only
+	 * OR in the PP_SIGNATURE this will mess up PP detection.
+	 */
+	for (i = 0; i < (1 << order); i++)
+		page[i].pp_magic &= 3UL;
+}
+
+struct mem_provider {
+	struct dma_slow_allocator sal;
+
+	struct work_struct work;
+};
+
+static int
+dma_mep_alloc_fall(struct dma_slow_allocator *sal, struct dma_slow_fall *fb,
+		   unsigned int size, gfp_t gfp)
+{
+	int order = get_order(size);
+
+	fb->addr = alloc_pages(gfp, order);
+	if (!fb->addr)
+		return -ENOMEM;
+
+	fb->dma = dma_map_page_attrs(sal->dev, fb->addr, 0, size,
+				     DMA_BIDIRECTIONAL, DMA_ATTR_SKIP_CPU_SYNC);
+	if (dma_mapping_error(sal->dev, fb->dma)) {
+		put_page(fb->addr);
+		return -ENOMEM;
+	}
+
+	mp_huge_split(fb->addr, order);
+	return 0;
+}
+
+static void
+dma_mep_free_fall(struct dma_slow_allocator *sal, struct dma_slow_fall *fb)
+{
+	int order = get_order(fb->size);
+	struct page *page;
+	int i;
+
+	page = fb->addr;
+	dma_unmap_page_attrs(sal->dev, fb->dma, fb->size,
+			     DMA_BIDIRECTIONAL, DMA_ATTR_SKIP_CPU_SYNC);
+	for (i = 0; i < (1 << order); i++)
+		put_page(page + i);
+}
+
+static void mep_release_work(struct work_struct *work)
+{
+	struct mem_provider *mep;
+
+	mep = container_of(work, struct mem_provider, work);
+
+	while (!list_empty(&mep->sal.huge)) {
+		struct dma_slow_buddy *bud;
+		struct dma_slow_huge *shu;
+
+		shu = list_first_entry(&mep->sal.huge, typeof(*shu), huge);
+
+		dma_unmap_page_attrs(mep->sal.dev, shu->dma, SZ_1G,
+				     DMA_BIDIRECTIONAL, DMA_ATTR_SKIP_CPU_SYNC);
+		cma_release(mep_cma, shu->addr, SZ_1G / PAGE_SIZE);
+
+		bud = list_first_entry_or_null(&shu->buddy_list,
+					       typeof(*bud), list);
+		if (WARN_ON(!bud || bud->size != SZ_1G))
+			continue;
+		kfree(bud);
+
+		list_del(&shu->huge);
+		kfree(shu);
+	}
+	put_device(mep->sal.dev);
+	kfree(mep);
+}
+
+static void dma_mep_release(struct dma_slow_allocator *sal)
+{
+	struct mem_provider *mep;
+
+	mep = container_of(sal, struct mem_provider, sal);
+
+	INIT_WORK(&mep->work, mep_release_work);
+	schedule_work(&mep->work);
+}
+
+struct dma_slow_allocator_ops dma_mep_ops = {
+	.ptr_shf	= PAGE_SHIFT - order_base_2(sizeof(struct page)),
+
+	.alloc_fall	= dma_mep_alloc_fall,
+	.free_fall	= dma_mep_free_fall,
+
+	.release	= dma_mep_release,
+};
+
+struct mem_provider *mep_create(struct device *dev)
+{
+	struct mem_provider *mep;
+	int i;
+
+	mep = kzalloc(sizeof(*mep), GFP_KERNEL);
+	if (!mep)
+		return NULL;
+
+	dma_sal_init(&mep->sal, &dma_mep_ops, dev);
+	get_device(mep->sal.dev);
+
+	if (mep_err)
+		goto done;
+
+	/* Hardcoded for now */
+	for (i = 0; i < 2; i++) {
+		const unsigned int order = 30 - PAGE_SHIFT; /* 1G */
+		struct dma_slow_huge *shu;
+		struct page *page;
+
+		shu = kzalloc(sizeof(*shu), GFP_KERNEL);
+		if (!shu)
+			break;
+
+		page = cma_alloc(mep_cma, SZ_1G / PAGE_SIZE, order, false);
+		if (!page) {
+			pr_err("mep: CMA alloc failed\n");
+			goto err_free_shu;
+		}
+
+		shu->dma = dma_map_page_attrs(mep->sal.dev, page, 0,
+					      PAGE_SIZE << order,
+					      DMA_BIDIRECTIONAL,
+					      DMA_ATTR_SKIP_CPU_SYNC);
+		if (dma_mapping_error(mep->sal.dev, shu->dma)) {
+			pr_err("mep: DMA map failed\n");
+			goto err_free_page;
+		}
+
+		if (dma_slow_huge_init(shu, page, SZ_1G, shu->dma,
+				       GFP_KERNEL)) {
+			pr_err("mep: shu init failed\n");
+			goto err_unmap;
+		}
+
+		mp_huge_split(page, 30 - PAGE_SHIFT);
+
+		list_add(&shu->huge, &mep->sal.huge);
+		continue;
+
+err_unmap:
+		dma_unmap_page_attrs(mep->sal.dev, shu->dma, SZ_1G,
+				     DMA_BIDIRECTIONAL, DMA_ATTR_SKIP_CPU_SYNC);
+err_free_page:
+		put_page(page);
+err_free_shu:
+		kfree(shu);
+		break;
+	}
+done:
+	if (list_empty(&mep->sal.huge))
+		pr_warn("mep: no huge pages acquired\n");
+
+	return mep;
+}
+EXPORT_SYMBOL_GPL(mep_create);
+
+void mep_destroy(struct mem_provider *mep)
+{
+	dma_slow_put(&mep->sal);
+}
+EXPORT_SYMBOL_GPL(mep_destroy);
+
+struct page *mep_alloc(struct mem_provider *mep, unsigned int order,
+		       dma_addr_t *dma, gfp_t gfp)
+{
+	return dma_sal_alloc(&mep->sal, PAGE_SIZE << order, dma, gfp);
+}
+EXPORT_SYMBOL_GPL(mep_alloc);
+
+void mep_free(struct mem_provider *mep, struct page *page,
+	      unsigned int order, dma_addr_t dma)
+{
+	dma_sal_free(&mep->sal, page, PAGE_SIZE << order, dma);
+}
+EXPORT_SYMBOL_GPL(mep_free);
diff --git a/net/core/dcalloc.h b/net/core/dcalloc.h
index c7e75ef0cb81..2664f933c8e1 100644
--- a/net/core/dcalloc.h
+++ b/net/core/dcalloc.h
@@ -90,4 +90,7 @@ static inline void dma_slow_put(struct dma_slow_allocator *sal)
 		sal->ops->release(sal);
 }
 
+/* misc */
+void mp_huge_split(struct page *page, unsigned int order);
+
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 03/12] net: page_pool: hide page_pool_release_page()
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
  2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
  2023-07-07 18:39 ` [RFC 02/12] net: create a 1G-huge-page-backed allocator Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

There seems to be no user calling page_pool_release_page()
for legit reasons, all the users simply haven't been converted
to skb-based recycling, yet. Convert them, update the docs,
and unexport the function.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/page_pool.rst            | 10 +++-------
 drivers/net/ethernet/engleder/tsnep_main.c        |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  4 ++--
 include/net/page_pool.h                           | 10 ++--------
 net/core/page_pool.c                              |  3 +--
 5 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index 873efd97f822..e4506c1aeac4 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -13,9 +13,8 @@ replacing dev_alloc_pages().
 
 API keeps track of in-flight pages, in order to let API user know
 when it is safe to free a page_pool object.  Thus, API users
-must run page_pool_release_page() when a page is leaving the page_pool or
-call page_pool_put_page() where appropriate in order to maintain correct
-accounting.
+must call page_pool_put_page() where appropiate and only attach
+the page to a page_pool-aware objects, like skbs.
 
 API user must call page_pool_put_page() once on a page, as it
 will either recycle the page, or in case of refcnt > 1, it will
@@ -87,9 +86,6 @@ a page will cause no race conditions is enough.
   must guarantee safe context (e.g NAPI), since it will recycle the page
   directly into the pool fast cache.
 
-* page_pool_release_page(): Unmap the page (if mapped) and account for it on
-  in-flight counters.
-
 * page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool
   caches.
 
@@ -194,7 +190,7 @@ NAPI poller
             if XDP_DROP:
                 page_pool_recycle_direct(page_pool, page);
         } else (packet_is_skb) {
-            page_pool_release_page(page_pool, page);
+            skb_mark_for_recycle(skb);
             new_page = page_pool_dev_alloc_pages(page_pool);
         }
     }
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index 84751bb303a6..079f9f6ae21a 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -1333,7 +1333,7 @@ static void tsnep_rx_page(struct tsnep_rx *rx, struct napi_struct *napi,
 
 	skb = tsnep_build_skb(rx, page, length);
 	if (skb) {
-		page_pool_release_page(rx->page_pool, page);
+		skb_mark_for_recycle(skb);
 
 		rx->packets++;
 		rx->bytes += length;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 4727f7be4f86..3a6cd2b73aea 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5413,7 +5413,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 					priv->dma_conf.dma_buf_sz);
 
 			/* Data payload appended into SKB */
-			page_pool_release_page(rx_q->page_pool, buf->page);
+			skb_mark_for_recycle(skb);
 			buf->page = NULL;
 		}
 
@@ -5425,7 +5425,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 					priv->dma_conf.dma_buf_sz);
 
 			/* Data payload appended into SKB */
-			page_pool_release_page(rx_q->page_pool, buf->sec_page);
+			skb_mark_for_recycle(skb);
 			buf->sec_page = NULL;
 		}
 
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 126f9e294389..b082c9118f05 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -18,9 +18,8 @@
  *
  * API keeps track of in-flight pages, in-order to let API user know
  * when it is safe to dealloactor page_pool object.  Thus, API users
- * must make sure to call page_pool_release_page() when a page is
- * "leaving" the page_pool.  Or call page_pool_put_page() where
- * appropiate.  For maintaining correct accounting.
+ * must call page_pool_put_page() where appropiate and only attach
+ * the page to a page_pool-aware objects, like skbs.
  *
  * API user must only call page_pool_put_page() once on a page, as it
  * will either recycle the page, or in case of elevated refcnt, it
@@ -251,7 +250,6 @@ void page_pool_unlink_napi(struct page_pool *pool);
 void page_pool_destroy(struct page_pool *pool);
 void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
 			   struct xdp_mem_info *mem);
-void page_pool_release_page(struct page_pool *pool, struct page *page);
 void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 			     int count);
 #else
@@ -268,10 +266,6 @@ static inline void page_pool_use_xdp_mem(struct page_pool *pool,
 					 struct xdp_mem_info *mem)
 {
 }
-static inline void page_pool_release_page(struct page_pool *pool,
-					  struct page *page)
-{
-}
 
 static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 					   int count)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a3e12a61d456..2c7cf5f2bcb8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -492,7 +492,7 @@ static s32 page_pool_inflight(struct page_pool *pool)
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-void page_pool_release_page(struct page_pool *pool, struct page *page)
+static void page_pool_release_page(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
 	int count;
@@ -519,7 +519,6 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 }
-EXPORT_SYMBOL(page_pool_release_page);
 
 /* Return a page to the page allocator, cleaning up our state */
 static void page_pool_return_page(struct page_pool *pool, struct page *page)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page()
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (2 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 03/12] net: page_pool: hide page_pool_release_page() Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-10 16:07   ` Jesper Dangaard Brouer
  2023-07-07 18:39 ` [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page Jakub Kicinski
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Now that page_pool_release_page() is not exported we can
merge it with page_pool_return_page(). I believe that
the "Do not replace this with page_pool_return_page()"
comment was there in case page_pool_return_page() was
not inlined, to avoid two function calls.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/page_pool.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2c7cf5f2bcb8..7ca456bfab71 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -492,7 +492,7 @@ static s32 page_pool_inflight(struct page_pool *pool)
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-static void page_pool_release_page(struct page_pool *pool, struct page *page)
+static void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
 	int count;
@@ -518,12 +518,6 @@ static void page_pool_release_page(struct page_pool *pool, struct page *page)
 	 */
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
-}
-
-/* Return a page to the page allocator, cleaning up our state */
-static void page_pool_return_page(struct page_pool *pool, struct page *page)
-{
-	page_pool_release_page(pool, page);
 
 	put_page(page);
 	/* An optimization would be to call __free_pages(page, pool->p.order)
@@ -615,9 +609,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 	 * will be invoking put_page.
 	 */
 	recycle_stat_inc(pool, released_refcnt);
-	/* Do not replace this with page_pool_return_page() */
-	page_pool_release_page(pool, page);
-	put_page(page);
+	page_pool_return_page(pool, page);
 
 	return NULL;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (3 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Releasing the DMA mapping will be useful for other types
of pages, so factor it out. Make sure compiler inlines it,
to avoid any regressions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/page_pool.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 7ca456bfab71..09f8c34ad4a7 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -487,21 +487,16 @@ static s32 page_pool_inflight(struct page_pool *pool)
 	return inflight;
 }
 
-/* Disconnects a page (from a page_pool).  API users can have a need
- * to disconnect a page (from a page_pool), to allow it to be used as
- * a regular page (that will eventually be returned to the normal
- * page-allocator via put_page).
- */
-static void page_pool_return_page(struct page_pool *pool, struct page *page)
+static __always_inline
+void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
-	int count;
 
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		/* Always account for inflight pages, even if we didn't
 		 * map them
 		 */
-		goto skip_dma_unmap;
+		return;
 
 	dma = page_pool_get_dma_addr(page);
 
@@ -510,7 +505,19 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page)
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
 	page_pool_set_dma_addr(page, 0);
-skip_dma_unmap:
+}
+
+/* Disconnects a page (from a page_pool).  API users can have a need
+ * to disconnect a page (from a page_pool), to allow it to be used as
+ * a regular page (that will eventually be returned to the normal
+ * page-allocator via put_page).
+ */
+void page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	int count;
+
+	__page_pool_release_page_dma(pool, page);
+
 	page_pool_clear_pp_info(page);
 
 	/* This may be the last page returned, releasing the pool, so
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 06/12] net: page_pool: create hooks for custom page providers
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (4 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 19:50   ` Mina Almasry
  2023-07-07 18:39 ` [RFC 07/12] net: page_pool: add huge page backed memory providers Jakub Kicinski
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

The rest if pretty obvious.

Add a test provider which should behave identically to
a normal page pool.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/page_pool.h | 20 +++++++++++
 net/core/page_pool.c    | 80 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b082c9118f05..5859ab838ed2 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -77,6 +77,7 @@ struct page_pool_params {
 	int		nid;  /* Numa node id to allocate from pages from */
 	struct device	*dev; /* device, for DMA pre-mapping purposes */
 	struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
+	u8		memory_provider; /* haaacks! should be user-facing */
 	enum dma_data_direction dma_dir; /* DMA mapping direction */
 	unsigned int	max_len; /* max DMA sync memory size */
 	unsigned int	offset;  /* DMA addr offset */
@@ -147,6 +148,22 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
 
 #endif
 
+struct mem_provider;
+
+enum pp_memory_provider_type {
+	__PP_MP_NONE, /* Use system allocator directly */
+	PP_MP_BASIC, /* Test purposes only, Hacky McHackface */
+};
+
+struct pp_memory_provider_ops {
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_page)(struct page_pool *pool, struct page *page);
+};
+
+extern const struct pp_memory_provider_ops basic_ops;
+
 struct page_pool {
 	struct page_pool_params p;
 
@@ -194,6 +211,9 @@ struct page_pool {
 	 */
 	struct ptr_ring ring;
 
+	const struct pp_memory_provider_ops *mp_ops;
+	void *mp_priv;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 09f8c34ad4a7..e886a439f9bb 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -23,6 +23,8 @@
 
 #include <trace/events/page_pool.h>
 
+static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
@@ -161,6 +163,7 @@ static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	memcpy(&pool->p, params, sizeof(pool->p));
 
@@ -218,10 +221,36 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
+	switch (pool->p.memory_provider) {
+	case __PP_MP_NONE:
+		break;
+	case PP_MP_BASIC:
+		pool->mp_ops = &basic_ops;
+		break;
+	default:
+		err = -EINVAL;
+		goto free_ptr_ring;
+	}
+
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n",
+				__func__, err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	if (pool->p.flags & PP_FLAG_DMA_MAP)
 		get_device(pool->p.dev);
 
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+	return err;
 }
 
 struct page_pool *page_pool_create(const struct page_pool_params *params)
@@ -463,7 +492,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		page = pool->mp_ops->alloc_pages(pool, gfp);
+	else
+		page = __page_pool_alloc_pages_slow(pool, gfp);
 	return page;
 }
 EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -515,8 +547,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, page);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_page(pool, page);
+	else
+		__page_pool_release_page_dma(pool, page);
 
 	page_pool_clear_pp_info(page);
 
@@ -526,7 +563,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 
-	put_page(page);
+	if (put)
+		put_page(page);
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -779,6 +817,11 @@ static void page_pool_free(struct page_pool *pool)
 	if (pool->disconnect)
 		pool->disconnect(pool);
 
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	ptr_ring_cleanup(&pool->ring, NULL);
 
 	if (pool->p.flags & PP_FLAG_DMA_MAP)
@@ -952,3 +995,34 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
 	return true;
 }
 EXPORT_SYMBOL(page_pool_return_skb_page);
+
+/***********************
+ *  Mem provider hack  *
+ ***********************/
+
+static int mp_basic_init(struct page_pool *pool)
+{
+	return 0;
+}
+
+static void mp_basic_destroy(struct page_pool *pool)
+{
+}
+
+static struct page *mp_basic_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	return __page_pool_alloc_pages_slow(pool, gfp);
+}
+
+static bool mp_basic_release(struct page_pool *pool, struct page *page)
+{
+	__page_pool_release_page_dma(pool, page);
+	return true;
+}
+
+const struct pp_memory_provider_ops basic_ops = {
+	.init			= mp_basic_init,
+	.destroy		= mp_basic_destroy,
+	.alloc_pages		= mp_basic_alloc_pages,
+	.release_page		= mp_basic_release,
+};
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 07/12] net: page_pool: add huge page backed memory providers
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (5 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Add 3 huge page backed memory provider examples.
1. using 2MB pages, which are allocated as needed,
   and not directly reused (based on code I got from Eric)
2. like 1, but pages are preallocated and allocator tries
   to re-use them in order, if we run out of space there
   (or rather can't find a free page quickly) we allocate
   4k pages, the same exact way as normal page pool would
3. like 2, but using MEP to get 1G pages.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/page_pool.h |   6 +
 net/core/page_pool.c    | 511 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 517 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 5859ab838ed2..364fe6924258 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -153,6 +153,9 @@ struct mem_provider;
 enum pp_memory_provider_type {
 	__PP_MP_NONE, /* Use system allocator directly */
 	PP_MP_BASIC, /* Test purposes only, Hacky McHackface */
+	PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */
+	PP_MP_HUGE, /* 2MB, all memory pre-allocated */
+	PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */
 };
 
 struct pp_memory_provider_ops {
@@ -163,6 +166,9 @@ struct pp_memory_provider_ops {
 };
 
 extern const struct pp_memory_provider_ops basic_ops;
+extern const struct pp_memory_provider_ops hugesp_ops;
+extern const struct pp_memory_provider_ops huge_ops;
+extern const struct pp_memory_provider_ops huge_1g_ops;
 
 struct page_pool {
 	struct page_pool_params p;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e886a439f9bb..d50f6728e4f6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -227,6 +227,15 @@ static int page_pool_init(struct page_pool *pool,
 	case PP_MP_BASIC:
 		pool->mp_ops = &basic_ops;
 		break;
+	case PP_MP_HUGE_SPLIT:
+		pool->mp_ops = &hugesp_ops;
+		break;
+	case PP_MP_HUGE:
+		pool->mp_ops = &huge_ops;
+		break;
+	case PP_MP_HUGE_1G:
+		pool->mp_ops = &huge_1g_ops;
+		break;
 	default:
 		err = -EINVAL;
 		goto free_ptr_ring;
@@ -1000,6 +1009,8 @@ EXPORT_SYMBOL(page_pool_return_skb_page);
  *  Mem provider hack  *
  ***********************/
 
+#include "dcalloc.h"
+
 static int mp_basic_init(struct page_pool *pool)
 {
 	return 0;
@@ -1026,3 +1037,503 @@ const struct pp_memory_provider_ops basic_ops = {
 	.alloc_pages		= mp_basic_alloc_pages,
 	.release_page		= mp_basic_release,
 };
+
+/*** "Huge page" ***/
+struct mp_hugesp {
+	struct page *page;
+	unsigned int pre_allocated;
+	unsigned char order;
+	struct timer_list timer;
+};
+
+static void mp_hugesp_timer(struct timer_list *t)
+{
+	struct mp_hugesp *hu = from_timer(hu, t, timer);
+
+	/* Retry large page alloc every 2 minutes */
+	mod_timer(&hu->timer, jiffies + 2 * 60 * HZ);
+	WRITE_ONCE(hu->order, MAX_ORDER - 1);
+}
+
+static int mp_hugesp_init(struct page_pool *pool)
+{
+	struct mp_hugesp *hu = pool->mp_priv;
+
+	if (pool->p.order)
+		return -EINVAL;
+
+	hu = kzalloc_node(sizeof(struct mp_hugesp), GFP_KERNEL, pool->p.nid);
+	if (!hu)
+		return -ENOMEM;
+
+	hu->order = MAX_ORDER - 1;
+	pool->mp_priv = hu;
+	timer_setup(&hu->timer, mp_hugesp_timer, TIMER_DEFERRABLE);
+	mod_timer(&hu->timer, jiffies + 2 * 60 * HZ);
+	return 0;
+}
+
+static void mp_hugesp_destroy(struct page_pool *pool)
+{
+	struct mp_hugesp *hu = pool->mp_priv;
+
+	while (hu->pre_allocated--)
+		put_page(hu->page++);
+
+	del_timer_sync(&hu->timer);
+	kfree(hu);
+}
+
+static int mp_huge_nid(struct page_pool *pool)
+{
+	int nid;
+
+#ifdef CONFIG_NUMA
+	nid = pool->p.nid;
+	nid = (nid == NUMA_NO_NODE) ? numa_mem_id() : nid;
+	nid = (nid < 0) ? numa_mem_id() : nid;
+#else
+	/* Ignore pool->p.nid setting if !CONFIG_NUMA, helps compiler */
+	nid = numa_mem_id(); /* will be zero like page_to_nid() */
+#endif
+	return nid;
+}
+
+static struct page *mp_hugesp_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	unsigned int pp_flags = pool->p.flags;
+	struct mp_hugesp *hu = pool->mp_priv;
+	int order = READ_ONCE(hu->order);
+	struct page *page;
+
+	/* Unnecessary as alloc cache is empty, but guarantees zero count */
+	if (unlikely(pool->alloc.count > 0))
+		return pool->alloc.cache[--pool->alloc.count];
+
+	/* For small allocations we're probably better off using bulk API */
+	if (order < 3)
+		goto use_bulk;
+
+	if (!hu->pre_allocated) {
+		int nid = mp_huge_nid(pool);
+
+		page = __alloc_pages_node(nid, (gfp & ~__GFP_DIRECT_RECLAIM) |
+						__GFP_NOMEMALLOC |
+						__GFP_NOWARN |
+						__GFP_NORETRY,
+					  order);
+		if (!page) {
+			WRITE_ONCE(hu->order, hu->order - 1);
+			goto use_bulk;
+		}
+
+		hu->page = page;
+		split_page(hu->page, order);
+		hu->pre_allocated = 1U << order;
+	}
+
+	/* We have some pages, feed the cache */
+	while (pool->alloc.count < PP_ALLOC_CACHE_REFILL && hu->pre_allocated) {
+		page = hu->page++;
+		hu->pre_allocated--;
+
+		if ((pp_flags & PP_FLAG_DMA_MAP) &&
+		    unlikely(!page_pool_dma_map(pool, page))) {
+			put_page(page);
+			continue;
+		}
+
+		page_pool_set_pp_info(pool, page);
+		pool->alloc.cache[pool->alloc.count++] = page;
+		/* Track how many pages are held 'in-flight' */
+		pool->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pool, page,
+					   pool->pages_state_hold_cnt);
+	}
+
+	/* Return last page */
+	if (likely(pool->alloc.count > 0)) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		alloc_stat_inc(pool, slow);
+	} else {
+		page = NULL;
+	}
+
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+
+use_bulk:
+	return __page_pool_alloc_pages_slow(pool, gfp);
+}
+
+static bool mp_hugesp_release(struct page_pool *pool, struct page *page)
+{
+	__page_pool_release_page_dma(pool, page);
+	return true;
+}
+
+const struct pp_memory_provider_ops hugesp_ops = {
+	.init			= mp_hugesp_init,
+	.destroy		= mp_hugesp_destroy,
+	.alloc_pages		= mp_hugesp_alloc_pages,
+	.release_page		= mp_hugesp_release,
+};
+
+/*** "Huge page" ***/
+
+/* Huge page memory provider allocates huge pages and splits them up into
+ * 4k pages. Whenever a page is outside of the page pool MP holds an extra
+ * reference to it, so that it doesn't get returned back into the allocator.
+ * On allocation request MP scans its banks of pages for pages with a single
+ * ref count held.
+ */
+#define MP_HUGE_ORDER	min(MAX_ORDER - 1, 21 - PAGE_SHIFT)
+#define MP_HUGE_CNT	32
+
+struct mp_huge {
+	struct page *page[MP_HUGE_CNT];
+	dma_addr_t dma[MP_HUGE_CNT];
+
+	unsigned int cur_idx ____cacheline_aligned_in_smp;
+	unsigned int cur_page;
+};
+
+static int mp_huge_init(struct page_pool *pool)
+{
+	struct mp_huge *hu = pool->mp_priv;
+	struct page *page;
+	int i;
+
+	if (pool->p.order)
+		return -EINVAL;
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return -EINVAL;
+
+	hu = kzalloc_node(sizeof(struct mp_huge), GFP_KERNEL, pool->p.nid);
+	if (!hu)
+		return -ENOMEM;
+
+	pool->mp_priv = hu;
+
+	for (i = 0; i < MP_HUGE_CNT; i++) {
+		int nid = mp_huge_nid(pool);
+		dma_addr_t dma;
+
+		page = __alloc_pages_node(nid, GFP_KERNEL | __GFP_NOWARN,
+					  MP_HUGE_ORDER);
+		if (!page)
+			goto err_free;
+
+		dma = dma_map_page_attrs(pool->p.dev, page, 0,
+					 PAGE_SIZE << MP_HUGE_ORDER,
+					 pool->p.dma_dir,
+					 DMA_ATTR_SKIP_CPU_SYNC);
+		if (dma_mapping_error(pool->p.dev, dma))
+			goto err_free_page;
+
+		hu->page[i] = page;
+		hu->dma[i] = dma;
+	}
+
+	for (i = 0; i < MP_HUGE_CNT; i++)
+		mp_huge_split(hu->page[i], MP_HUGE_ORDER);
+
+	return 0;
+
+err_free:
+	while (i--) {
+		dma_unmap_page_attrs(pool->p.dev, hu->dma[i],
+				     PAGE_SIZE << MP_HUGE_ORDER, pool->p.dma_dir,
+				     DMA_ATTR_SKIP_CPU_SYNC);
+err_free_page:
+		put_page(hu->page[i]);
+	}
+	kfree(pool->mp_priv);
+	return -ENOMEM;
+}
+
+static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
+{
+	struct page *page;
+	int j;
+
+	for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
+		page = hu->page[idx] + j;
+		if (page_ref_count(page) != 1) {
+			pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
+				page_ref_count(page), idx, j);
+			return true;
+		}
+	}
+	return false;
+}
+
+static void mp_huge_destroy(struct page_pool *pool)
+{
+	struct mp_huge *hu = pool->mp_priv;
+	int i, j;
+
+	for (i = 0; i < MP_HUGE_CNT; i++) {
+		if (mp_huge_busy(hu, i))
+			continue;
+
+		dma_unmap_page_attrs(pool->p.dev, hu->dma[i],
+				     PAGE_SIZE << MP_HUGE_ORDER, pool->p.dma_dir,
+				     DMA_ATTR_SKIP_CPU_SYNC);
+
+		for (j = 0; j < (1 << MP_HUGE_ORDER); j++)
+			put_page(hu->page[i] + j);
+	}
+
+	kfree(hu);
+}
+
+static atomic_t mp_huge_ins_a = ATOMIC_INIT(0); // alloc
+static atomic_t mp_huge_ins_r = ATOMIC_INIT(0); // release
+static atomic_t mp_huge_ins_b = ATOMIC_INIT(0); // busy
+static atomic_t mp_huge_out_a = ATOMIC_INIT(0);
+static atomic_t mp_huge_out_r = ATOMIC_INIT(0);
+
+static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct mp_huge *hu = pool->mp_priv;
+	unsigned int i, page_i, huge_i;
+	struct page *page;
+
+	/* Try to find pages which are are the sole owner of */
+	for (i = 0; i < PP_ALLOC_CACHE_REFILL * 2; i++) {
+		page_i = hu->cur_idx + i;
+		huge_i = hu->cur_page + (page_i >> MP_HUGE_ORDER);
+		huge_i %= MP_HUGE_CNT;
+		page_i %= 1 << MP_HUGE_ORDER;
+
+		if (pool->alloc.count >= PP_ALLOC_CACHE_REFILL)
+			break;
+
+		page = hu->page[huge_i] + page_i;
+
+		if (page != compound_head(page))
+			continue;
+
+		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
+		    page_ref_count(page) != 1) {
+			atomic_inc(&mp_huge_ins_b);
+			continue;
+		}
+
+		atomic_inc(&mp_huge_ins_a);
+
+		page_pool_set_pp_info(pool, page);
+		page_pool_set_dma_addr(page,
+				       hu->dma[huge_i] + page_i * PAGE_SIZE);
+
+		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+			page_pool_dma_sync_for_device(pool, page,
+						      pool->p.max_len);
+
+		pool->alloc.cache[pool->alloc.count++] = page;
+		/* Track how many pages are held 'in-flight' */
+		pool->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pool, page,
+					   pool->pages_state_hold_cnt);
+	}
+
+	hu->cur_idx = page_i + 1; /* start from next, "going over" is okay */
+	hu->cur_page = huge_i;
+
+	/* Return last page */
+	if (likely(pool->alloc.count > 0)) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		alloc_stat_inc(pool, slow);
+	} else {
+		atomic_inc(&mp_huge_out_a);
+		page = __page_pool_alloc_pages_slow(pool, gfp);
+	}
+
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+static bool mp_huge_release(struct page_pool *pool, struct page *page)
+{
+	struct mp_huge *hu = pool->mp_priv;
+	bool ours = false;
+	int i;
+
+	/* Check if the page comes from one of huge pages */
+	for (i = 0; i < MP_HUGE_CNT; i++) {
+		if (page - hu->page[i] < (1UL << MP_HUGE_ORDER)) {
+			ours = true;
+			break;
+		}
+	}
+
+	if (ours) {
+		atomic_inc(&mp_huge_ins_r);
+		/* Do not actually unmap this page, we have one "huge" mapping */
+		page_pool_set_dma_addr(page, 0);
+	} else {
+		atomic_inc(&mp_huge_out_r);
+		/* Give it up */
+		__page_pool_release_page_dma(pool, page);
+	}
+
+	return !ours;
+}
+
+const struct pp_memory_provider_ops huge_ops = {
+	.init			= mp_huge_init,
+	.destroy		= mp_huge_destroy,
+	.alloc_pages		= mp_huge_alloc_pages,
+	.release_page		= mp_huge_release,
+};
+
+/*** 1G "Huge page" ***/
+
+/* Huge page memory provider allocates huge pages and splits them up into
+ * 4k pages. Whenever a page is outside of the page pool MP holds an extra
+ * reference to it, so that it doesn't get returned back into the allocator.
+ * On allocation request MP scans its banks of pages for pages with a single
+ * ref count held.
+ */
+#define MP_HUGE_1G_CNT		(SZ_128M / PAGE_SIZE)
+#define MP_HUGE_1G_ORDER	(27 - PAGE_SHIFT)
+
+struct mp_huge_1g {
+	struct mem_provider *mep;
+	struct page *page;
+	dma_addr_t dma;
+
+	unsigned int cur_idx ____cacheline_aligned_in_smp;
+};
+
+static int mp_huge_1g_init(struct page_pool *pool)
+{
+	struct mp_huge_1g *hu;
+
+	if (pool->p.order)
+		return -EINVAL;
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return -EINVAL;
+
+	hu = kzalloc_node(sizeof(struct mp_huge_1g), GFP_KERNEL, pool->p.nid);
+	if (!hu)
+		return -ENOMEM;
+	hu->mep = pool->p.init_arg;
+
+	pool->mp_priv = hu;
+
+	hu->page = mep_alloc(hu->mep, MP_HUGE_1G_ORDER, &hu->dma, GFP_KERNEL);
+	if (!hu->page)
+		goto err_free_priv;
+
+	return 0;
+
+err_free_priv:
+	kfree(pool->mp_priv);
+	return -ENOMEM;
+}
+
+static void mp_huge_1g_destroy(struct page_pool *pool)
+{
+	struct mp_huge_1g *hu = pool->mp_priv;
+	struct page *page;
+	bool free;
+	int i;
+
+	free = true;
+	for (i = 0; i < MP_HUGE_1G_CNT; i++) {
+		page = hu->page + i;
+		if (page_ref_count(page) != 1) {
+			pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
+				page_ref_count(page), i);
+			free = false;
+			break;
+		}
+	}
+
+	if (free)
+		mep_free(hu->mep, hu->page, MP_HUGE_1G_ORDER, hu->dma);
+
+	kfree(hu);
+}
+
+static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct mp_huge_1g *hu = pool->mp_priv;
+	unsigned int i, page_i;
+	struct page *page;
+
+	/* Try to find pages which are are the sole owner of */
+	for (i = 0; i < PP_ALLOC_CACHE_REFILL * 2; i++) {
+		page_i = hu->cur_idx + i;
+		page_i %= MP_HUGE_1G_CNT;
+
+		if (pool->alloc.count >= PP_ALLOC_CACHE_REFILL)
+			break;
+
+		page = hu->page + page_i;
+
+		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
+		    page_ref_count(page) != 1) {
+			atomic_inc(&mp_huge_ins_b);
+			continue;
+		}
+
+		atomic_inc(&mp_huge_ins_a);
+
+		page_pool_set_pp_info(pool, page);
+		page_pool_set_dma_addr(page, hu->dma + page_i * PAGE_SIZE);
+
+		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+			page_pool_dma_sync_for_device(pool, page,
+						      pool->p.max_len);
+
+		pool->alloc.cache[pool->alloc.count++] = page;
+		/* Track how many pages are held 'in-flight' */
+		pool->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pool, page,
+					   pool->pages_state_hold_cnt);
+	}
+
+	hu->cur_idx = page_i + 1; /* start from next, "going over" is okay */
+
+	/* Return last page */
+	if (likely(pool->alloc.count > 0)) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		alloc_stat_inc(pool, slow);
+	} else {
+		atomic_inc(&mp_huge_out_a);
+		page = __page_pool_alloc_pages_slow(pool, gfp);
+	}
+
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+static bool mp_huge_1g_release(struct page_pool *pool, struct page *page)
+{
+	struct mp_huge_1g *hu = pool->mp_priv;
+	bool ours;
+
+	/* Check if the page comes from one of huge pages */
+	ours = page - hu->page < (unsigned long)MP_HUGE_1G_CNT;
+	if (ours) {
+		atomic_inc(&mp_huge_ins_r);
+		/* Do not actually unmap this page, we have one "huge" mapping */
+		page_pool_set_dma_addr(page, 0);
+	} else {
+		atomic_inc(&mp_huge_out_r);
+		/* Give it up */
+		__page_pool_release_page_dma(pool, page);
+	}
+
+	return !ours;
+}
+
+const struct pp_memory_provider_ops huge_1g_ops = {
+	.init			= mp_huge_1g_init,
+	.destroy		= mp_huge_1g_destroy,
+	.alloc_pages		= mp_huge_1g_alloc_pages,
+	.release_page		= mp_huge_1g_release,
+};
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (6 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 07/12] net: page_pool: add huge page backed memory providers Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-10 10:12   ` Jesper Dangaard Brouer
  2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Use the page pool's ability to maintain DMA mappings for us.
This avoid re-mapping recycled pages.

Note that pages in the pool are always mapped DMA_BIDIRECTIONAL,
so we should use that instead of looking at bp->rx_dir.

The syncing is probably wrong, TBH, I haven't studied the page
pool rules, they always confused me. But for a hack, who cares,
x86 :D

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 ++++++++---------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index e5b54e6025be..6512514cd498 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -706,12 +706,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 	if (!page)
 		return NULL;
 
-	*mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir,
-				      DMA_ATTR_WEAK_ORDERING);
-	if (dma_mapping_error(dev, *mapping)) {
-		page_pool_recycle_direct(rxr->page_pool, page);
-		return NULL;
-	}
+	*mapping = page_pool_get_dma_addr(page);
+	dma_sync_single_for_device(dev, *mapping, PAGE_SIZE, DMA_BIDIRECTIONAL);
+
 	return page;
 }
 
@@ -951,6 +948,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 					      unsigned int offset_and_len)
 {
 	unsigned int len = offset_and_len & 0xffff;
+	struct device *dev = &bp->pdev->dev;
 	struct page *page = data;
 	u16 prod = rxr->rx_prod;
 	struct sk_buff *skb;
@@ -962,8 +960,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
-			     DMA_ATTR_WEAK_ORDERING);
+	dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 	skb = build_skb(page_address(page), PAGE_SIZE);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
@@ -984,6 +981,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 {
 	unsigned int payload = offset_and_len >> 16;
 	unsigned int len = offset_and_len & 0xffff;
+	struct device *dev = &bp->pdev->dev;
 	skb_frag_t *frag;
 	struct page *page = data;
 	u16 prod = rxr->rx_prod;
@@ -996,8 +994,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
-			     DMA_ATTR_WEAK_ORDERING);
+	dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 
 	if (unlikely(!payload))
 		payload = eth_get_headlen(bp->dev, data_ptr, len);
@@ -2943,9 +2940,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
 		rx_buf->data = NULL;
 		if (BNXT_RX_PAGE_MODE(bp)) {
 			mapping -= bp->rx_dma_offset;
-			dma_unmap_page_attrs(&pdev->dev, mapping, PAGE_SIZE,
-					     bp->rx_dir,
-					     DMA_ATTR_WEAK_ORDERING);
 			page_pool_recycle_direct(rxr->page_pool, data);
 		} else {
 			dma_unmap_single_attrs(&pdev->dev, mapping,
@@ -2967,9 +2961,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
 			continue;
 
 		if (BNXT_RX_PAGE_MODE(bp)) {
-			dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping,
-					     BNXT_RX_PAGE_SIZE, bp->rx_dir,
-					     DMA_ATTR_WEAK_ORDERING);
 			rx_agg_buf->page = NULL;
 			__clear_bit(i, rxr->rx_agg_bmap);
 
@@ -3208,6 +3199,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 {
 	struct page_pool_params pp = { 0 };
 
+	pp.flags = PP_FLAG_DMA_MAP;
 	pp.pool_size = bp->rx_ring_size;
 	pp.nid = dev_to_node(&bp->pdev->dev);
 	pp.napi = &rxr->bnapi->napi;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 09/12] eth: bnxt: use the page pool for data pages
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (7 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-10  4:22   ` Michael Chan
  2023-07-07 18:39 ` [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them Jakub Kicinski
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

To benefit from page recycling allocate the agg pages (used by HW-GRO
and jumbo) from the page pool.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 43 ++++++++++++-----------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6512514cd498..734c2c6cad69 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -811,33 +811,27 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
 	u16 sw_prod = rxr->rx_sw_agg_prod;
 	unsigned int offset = 0;
 
-	if (BNXT_RX_PAGE_MODE(bp)) {
+	if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE || BNXT_RX_PAGE_MODE(bp)) {
 		page = __bnxt_alloc_rx_page(bp, &mapping, rxr, gfp);
 
 		if (!page)
 			return -ENOMEM;
 
 	} else {
-		if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
-			page = rxr->rx_page;
-			if (!page) {
-				page = alloc_page(gfp);
-				if (!page)
-					return -ENOMEM;
-				rxr->rx_page = page;
-				rxr->rx_page_offset = 0;
-			}
-			offset = rxr->rx_page_offset;
-			rxr->rx_page_offset += BNXT_RX_PAGE_SIZE;
-			if (rxr->rx_page_offset == PAGE_SIZE)
-				rxr->rx_page = NULL;
-			else
-				get_page(page);
-		} else {
+		page = rxr->rx_page;
+		if (!page) {
 			page = alloc_page(gfp);
 			if (!page)
 				return -ENOMEM;
+			rxr->rx_page = page;
+			rxr->rx_page_offset = 0;
 		}
+		offset = rxr->rx_page_offset;
+		rxr->rx_page_offset += BNXT_RX_PAGE_SIZE;
+		if (rxr->rx_page_offset == PAGE_SIZE)
+			rxr->rx_page = NULL;
+		else
+			get_page(page);
 
 		mapping = dma_map_page_attrs(&pdev->dev, page, offset,
 					     BNXT_RX_PAGE_SIZE, DMA_FROM_DEVICE,
@@ -1046,6 +1040,8 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
 
 	skb_reserve(skb, bp->rx_offset);
 	skb_put(skb, offset_and_len & 0xffff);
+	skb_mark_for_recycle(skb);
+
 	return skb;
 }
 
@@ -1110,9 +1106,13 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 			return 0;
 		}
 
-		dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
-				     bp->rx_dir,
-				     DMA_ATTR_WEAK_ORDERING);
+		if (PAGE_SIZE > BNXT_RX_PAGE_SIZE)
+			dma_unmap_page_attrs(&pdev->dev, mapping,
+					     BNXT_RX_PAGE_SIZE, bp->rx_dir,
+					     DMA_ATTR_WEAK_ORDERING);
+		else
+			dma_sync_single_for_cpu(&pdev->dev, mapping,
+						PAGE_SIZE, DMA_BIDIRECTIONAL);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -1754,6 +1754,7 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi,
 		return;
 	}
 	skb_record_rx_queue(skb, bnapi->index);
+	skb_mark_for_recycle(skb);
 	napi_gro_receive(&bnapi->napi, skb);
 }
 
@@ -2960,7 +2961,7 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
 		if (!page)
 			continue;
 
-		if (BNXT_RX_PAGE_MODE(bp)) {
+		if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE || BNXT_RX_PAGE_MODE(bp)) {
 			rx_agg_buf->page = NULL;
 			__clear_bit(i, rxr->rx_agg_bmap);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (8 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers Jakub Kicinski
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Just in case the skbs we allocated have any PP pages attached
or head is PP backed - make sure we mark the for recycle before
dropping.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 734c2c6cad69..679a28c038a2 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1132,6 +1132,7 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
 	total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
 					     agg_bufs, tpa, NULL);
 	if (!total_frag_len) {
+		skb_mark_for_recycle(skb);
 		dev_kfree_skb(skb);
 		return NULL;
 	}
@@ -1535,6 +1536,7 @@ static struct sk_buff *bnxt_gro_func_5730x(struct bnxt_tpa_info *tpa_info,
 		th = tcp_hdr(skb);
 		th->check = ~tcp_v6_check(len, &iph->saddr, &iph->daddr, 0);
 	} else {
+		skb_mark_for_recycle(skb);
 		dev_kfree_skb_any(skb);
 		return NULL;
 	}
@@ -1715,6 +1717,7 @@ static inline struct sk_buff *bnxt_tpa_end(struct bnxt *bp,
 		if (eth_type_vlan(vlan_proto)) {
 			__vlan_hwaccel_put_tag(skb, vlan_proto, vtag);
 		} else {
+			skb_mark_for_recycle(skb);
 			dev_kfree_skb(skb);
 			return NULL;
 		}
@@ -1987,6 +1990,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 		if (eth_type_vlan(vlan_proto)) {
 			__vlan_hwaccel_put_tag(skb, vlan_proto, vtag);
 		} else {
+			skb_mark_for_recycle(skb);
 			dev_kfree_skb(skb);
 			goto next_rx;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (9 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 18:39 ` [RFC 12/12] eth: bnxt: hack in the use of MEP Jakub Kicinski
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Prep for using a huge-page backed allocator.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 58 +++++++++++++----------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 679a28c038a2..b36c42d37a38 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2831,6 +2831,18 @@ static int bnxt_poll_p5(struct napi_struct *napi, int budget)
 	return work_done;
 }
 
+static void *bnxt_alloc_coherent(struct bnxt *bp, unsigned long size,
+				 dma_addr_t *dma, gfp_t gfp)
+{
+	return dma_alloc_coherent(&bp->pdev->dev, size, dma, gfp);
+}
+
+static void bnxt_free_coherent(struct bnxt *bp, unsigned long size,
+			       void *addr, dma_addr_t dma)
+{
+	return dma_free_coherent(&bp->pdev->dev, size, addr, dma);
+}
+
 static void bnxt_free_tx_skbs(struct bnxt *bp)
 {
 	int i, max_idx;
@@ -3027,7 +3039,6 @@ static void bnxt_init_ctx_mem(struct bnxt_mem_init *mem_init, void *p, int len)
 
 static void bnxt_free_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 {
-	struct pci_dev *pdev = bp->pdev;
 	int i;
 
 	if (!rmem->pg_arr)
@@ -3037,8 +3048,8 @@ static void bnxt_free_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 		if (!rmem->pg_arr[i])
 			continue;
 
-		dma_free_coherent(&pdev->dev, rmem->page_size,
-				  rmem->pg_arr[i], rmem->dma_arr[i]);
+		bnxt_free_coherent(bp, rmem->page_size,
+				   rmem->pg_arr[i], rmem->dma_arr[i]);
 
 		rmem->pg_arr[i] = NULL;
 	}
@@ -3048,8 +3059,8 @@ static void bnxt_free_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 
 		if (rmem->flags & BNXT_RMEM_USE_FULL_PAGE_FLAG)
 			pg_tbl_size = rmem->page_size;
-		dma_free_coherent(&pdev->dev, pg_tbl_size,
-				  rmem->pg_tbl, rmem->pg_tbl_map);
+		bnxt_free_coherent(bp, pg_tbl_size,
+				   rmem->pg_tbl, rmem->pg_tbl_map);
 		rmem->pg_tbl = NULL;
 	}
 	if (rmem->vmem_size && *rmem->vmem) {
@@ -3060,7 +3071,6 @@ static void bnxt_free_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 
 static int bnxt_alloc_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 {
-	struct pci_dev *pdev = bp->pdev;
 	u64 valid_bit = 0;
 	int i;
 
@@ -3071,7 +3081,7 @@ static int bnxt_alloc_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 
 		if (rmem->flags & BNXT_RMEM_USE_FULL_PAGE_FLAG)
 			pg_tbl_size = rmem->page_size;
-		rmem->pg_tbl = dma_alloc_coherent(&pdev->dev, pg_tbl_size,
+		rmem->pg_tbl = bnxt_alloc_coherent(bp, pg_tbl_size,
 						  &rmem->pg_tbl_map,
 						  GFP_KERNEL);
 		if (!rmem->pg_tbl)
@@ -3081,7 +3091,7 @@ static int bnxt_alloc_ring(struct bnxt *bp, struct bnxt_ring_mem_info *rmem)
 	for (i = 0; i < rmem->nr_pages; i++) {
 		u64 extra_bits = valid_bit;
 
-		rmem->pg_arr[i] = dma_alloc_coherent(&pdev->dev,
+		rmem->pg_arr[i] = bnxt_alloc_coherent(bp,
 						     rmem->page_size,
 						     &rmem->dma_arr[i],
 						     GFP_KERNEL);
@@ -3282,7 +3292,6 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 static void bnxt_free_tx_rings(struct bnxt *bp)
 {
 	int i;
-	struct pci_dev *pdev = bp->pdev;
 
 	if (!bp->tx_ring)
 		return;
@@ -3292,8 +3301,8 @@ static void bnxt_free_tx_rings(struct bnxt *bp)
 		struct bnxt_ring_struct *ring;
 
 		if (txr->tx_push) {
-			dma_free_coherent(&pdev->dev, bp->tx_push_size,
-					  txr->tx_push, txr->tx_push_mapping);
+			bnxt_free_coherent(bp, bp->tx_push_size,
+					   txr->tx_push, txr->tx_push_mapping);
 			txr->tx_push = NULL;
 		}
 
@@ -3306,7 +3315,6 @@ static void bnxt_free_tx_rings(struct bnxt *bp)
 static int bnxt_alloc_tx_rings(struct bnxt *bp)
 {
 	int i, j, rc;
-	struct pci_dev *pdev = bp->pdev;
 
 	bp->tx_push_size = 0;
 	if (bp->tx_push_thresh) {
@@ -3341,7 +3349,7 @@ static int bnxt_alloc_tx_rings(struct bnxt *bp)
 			/* One pre-allocated DMA buffer to backup
 			 * TX push operation
 			 */
-			txr->tx_push = dma_alloc_coherent(&pdev->dev,
+			txr->tx_push = bnxt_alloc_coherent(bp,
 						bp->tx_push_size,
 						&txr->tx_push_mapping,
 						GFP_KERNEL);
@@ -4017,7 +4025,6 @@ static void bnxt_free_vnic_attributes(struct bnxt *bp)
 {
 	int i;
 	struct bnxt_vnic_info *vnic;
-	struct pci_dev *pdev = bp->pdev;
 
 	if (!bp->vnic_info)
 		return;
@@ -4032,15 +4039,15 @@ static void bnxt_free_vnic_attributes(struct bnxt *bp)
 		vnic->uc_list = NULL;
 
 		if (vnic->mc_list) {
-			dma_free_coherent(&pdev->dev, vnic->mc_list_size,
-					  vnic->mc_list, vnic->mc_list_mapping);
+			bnxt_free_coherent(bp, vnic->mc_list_size,
+					   vnic->mc_list, vnic->mc_list_mapping);
 			vnic->mc_list = NULL;
 		}
 
 		if (vnic->rss_table) {
-			dma_free_coherent(&pdev->dev, vnic->rss_table_size,
-					  vnic->rss_table,
-					  vnic->rss_table_dma_addr);
+			bnxt_free_coherent(bp, vnic->rss_table_size,
+					   vnic->rss_table,
+					   vnic->rss_table_dma_addr);
 			vnic->rss_table = NULL;
 		}
 
@@ -4053,7 +4060,6 @@ static int bnxt_alloc_vnic_attributes(struct bnxt *bp)
 {
 	int i, rc = 0, size;
 	struct bnxt_vnic_info *vnic;
-	struct pci_dev *pdev = bp->pdev;
 	int max_rings;
 
 	for (i = 0; i < bp->nr_vnics; i++) {
@@ -4074,7 +4080,7 @@ static int bnxt_alloc_vnic_attributes(struct bnxt *bp)
 		if (vnic->flags & BNXT_VNIC_MCAST_FLAG) {
 			vnic->mc_list_size = BNXT_MAX_MC_ADDRS * ETH_ALEN;
 			vnic->mc_list =
-				dma_alloc_coherent(&pdev->dev,
+				bnxt_alloc_coherent(bp,
 						   vnic->mc_list_size,
 						   &vnic->mc_list_mapping,
 						   GFP_KERNEL);
@@ -4108,7 +4114,7 @@ static int bnxt_alloc_vnic_attributes(struct bnxt *bp)
 			size = L1_CACHE_ALIGN(BNXT_MAX_RSS_TABLE_SIZE_P5);
 
 		vnic->rss_table_size = size + HW_HASH_KEY_SIZE;
-		vnic->rss_table = dma_alloc_coherent(&pdev->dev,
+		vnic->rss_table = bnxt_alloc_coherent(bp,
 						     vnic->rss_table_size,
 						     &vnic->rss_table_dma_addr,
 						     GFP_KERNEL);
@@ -4159,8 +4165,8 @@ static void bnxt_free_stats_mem(struct bnxt *bp, struct bnxt_stats_mem *stats)
 	kfree(stats->sw_stats);
 	stats->sw_stats = NULL;
 	if (stats->hw_stats) {
-		dma_free_coherent(&bp->pdev->dev, stats->len, stats->hw_stats,
-				  stats->hw_stats_map);
+		bnxt_free_coherent(bp, stats->len, stats->hw_stats,
+				   stats->hw_stats_map);
 		stats->hw_stats = NULL;
 	}
 }
@@ -4168,8 +4174,8 @@ static void bnxt_free_stats_mem(struct bnxt *bp, struct bnxt_stats_mem *stats)
 static int bnxt_alloc_stats_mem(struct bnxt *bp, struct bnxt_stats_mem *stats,
 				bool alloc_masks)
 {
-	stats->hw_stats = dma_alloc_coherent(&bp->pdev->dev, stats->len,
-					     &stats->hw_stats_map, GFP_KERNEL);
+	stats->hw_stats = bnxt_alloc_coherent(bp, stats->len,
+					      &stats->hw_stats_map, GFP_KERNEL);
 	if (!stats->hw_stats)
 		return -ENOMEM;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 12/12] eth: bnxt: hack in the use of MEP
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (10 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers Jakub Kicinski
@ 2023-07-07 18:39 ` Jakub Kicinski
  2023-07-07 19:45 ` [RFC 00/12] net: huge page backed page_pool Mina Almasry
  2023-07-11 15:49 ` Jesper Dangaard Brouer
  13 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Well, the uAPI is lacking so... module params?

No datapath changes needed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 29 ++++++++++++++++++++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  5 ++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b36c42d37a38..e745ce1f50d7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -58,6 +58,8 @@
 #include <linux/align.h>
 #include <net/netdev_queues.h>
 
+#include <net/dcalloc.h>
+
 #include "bnxt_hsi.h"
 #include "bnxt.h"
 #include "bnxt_hwrm.h"
@@ -76,6 +78,9 @@
 #define BNXT_DEF_MSG_ENABLE	(NETIF_MSG_DRV | NETIF_MSG_HW | \
 				 NETIF_MSG_TX_ERR)
 
+static int pp_mode;
+module_param(pp_mode, int, 0644);
+
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Broadcom BCM573xx network driver");
 
@@ -2834,13 +2839,14 @@ static int bnxt_poll_p5(struct napi_struct *napi, int budget)
 static void *bnxt_alloc_coherent(struct bnxt *bp, unsigned long size,
 				 dma_addr_t *dma, gfp_t gfp)
 {
-	return dma_alloc_coherent(&bp->pdev->dev, size, dma, gfp);
+	ASSERT_RTNL();
+	return dma_cocoa_alloc(bp->mp.dco, size, dma, gfp);
 }
 
 static void bnxt_free_coherent(struct bnxt *bp, unsigned long size,
 			       void *addr, dma_addr_t dma)
 {
-	return dma_free_coherent(&bp->pdev->dev, size, addr, dma);
+	dma_cocoa_free(bp->mp.dco, size, addr, dma);
 }
 
 static void bnxt_free_tx_skbs(struct bnxt *bp)
@@ -3220,6 +3226,8 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.napi = &rxr->bnapi->napi;
 	pp.dev = &bp->pdev->dev;
 	pp.dma_dir = DMA_BIDIRECTIONAL;
+	pp.memory_provider = pp_mode;
+	pp.init_arg = bp->mp.mp;
 
 	rxr->page_pool = page_pool_create(&pp);
 	if (IS_ERR(rxr->page_pool)) {
@@ -13607,6 +13615,14 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (rc < 0)
 		goto init_err_free;
 
+	bp->mp.mp = mep_create(&pdev->dev);
+	if (!bp->mp.mp)
+		goto init_err_pci_clean;
+
+	bp->mp.dco = dma_cocoa_create(&bp->pdev->dev, GFP_KERNEL);
+	if (!bp->mp.dco)
+		goto init_err_mep_destroy;
+
 	dev->netdev_ops = &bnxt_netdev_ops;
 	dev->watchdog_timeo = BNXT_TX_TIMEOUT;
 	dev->ethtool_ops = &bnxt_ethtool_ops;
@@ -13614,7 +13630,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	rc = bnxt_alloc_hwrm_resources(bp);
 	if (rc)
-		goto init_err_pci_clean;
+		goto init_err_dco_destroy;
 
 	mutex_init(&bp->hwrm_cmd_lock);
 	mutex_init(&bp->link_lock);
@@ -13788,6 +13804,11 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bnxt_shutdown_tc(bp);
 	bnxt_clear_int_mode(bp);
 
+init_err_dco_destroy:
+	dma_cocoa_destroy(bp->mp.dco);
+init_err_mep_destroy:
+	mep_destroy(bp->mp.mp);
+
 init_err_pci_clean:
 	bnxt_hwrm_func_drv_unrgtr(bp);
 	bnxt_free_hwrm_resources(bp);
@@ -13826,6 +13847,8 @@ static void bnxt_shutdown(struct pci_dev *pdev)
 		dev_close(dev);
 
 	bnxt_clear_int_mode(bp);
+	dma_cocoa_destroy(bp->mp.dco);
+	mep_destroy(bp->mp.mp);
 	pci_disable_device(pdev);
 
 	if (system_state == SYSTEM_POWER_OFF) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 080e73496066..9b323b27075f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2170,6 +2170,11 @@ struct bnxt {
 	struct dentry		*debugfs_pdev;
 	struct device		*hwmon_dev;
 	enum board_idx		board_idx;
+
+	struct {
+		struct mem_provider *mp;
+		struct dma_cocoa *dco;
+	} mp;
 };
 
 #define BNXT_NUM_RX_RING_STATS			8
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (11 preceding siblings ...)
  2023-07-07 18:39 ` [RFC 12/12] eth: bnxt: hack in the use of MEP Jakub Kicinski
@ 2023-07-07 19:45 ` Mina Almasry
  2023-07-07 22:45   ` Jakub Kicinski
  2023-07-11 15:49 ` Jesper Dangaard Brouer
  13 siblings, 1 reply; 33+ messages in thread
From: Mina Almasry @ 2023-07-07 19:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, hawk, ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Fri, Jul 7, 2023 at 11:39 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Hi!
>
> This is an "early PoC" at best. It seems to work for a basic
> traffic test but there's no uAPI and a lot more general polish
> is needed.
>
> The problem we're seeing is that performance of some older NICs
> degrades quite a bit when IOMMU is used (in non-passthru mode).
> There is a long tail of old NICs deployed, especially in PoPs/
> /on edge. From a conversation I had with Eric a few months
> ago it sounded like others may have similar issues. So I thought
> I'd take a swing at getting page pool to feed drivers huge pages.
> 1G pages require hooking into early init via CMA but it works
> just fine.
>
> I haven't tested this with a real workload, because I'm still
> waiting to get my hands on the right machine. But the experiment
> with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
>

Thanks for CCing me Jakub. I'm working on a proposal for device memory
TCP, and I recently migrated it to be on top of your pp-provider idea
and I think I can share my test results as well. I had my code working
on top of your slightly older API I found here a few days ago:
https://github.com/kuba-moo/linux/tree/pp-providers

On top of the old API I had something with all my functionality tests
passing and performance benchmarking hitting ~96.5% line rate (with
all data going straight to the device - GPU - memory, which is the
point of the proposal). Of course, when you look at the code you may
not like the approach and I may need to try something else, which is
perfectly fine, but my current implementation is pp-provider based.

I'll look into rebasing my changes on top of this RFC and retesting,
but I should share my RFC either way sometime next week maybe. I took
a quick look at the changes you made here, and I don't think you
changed anything that would break my use case.

> In terms of the missing parts - uAPI is definitely needed.
> The rough plan would be to add memory config via the netdev
> genl family. Should fit nicely there. Have the config stored
> in struct netdevice. When page pool is created get to the netdev
> and automatically select the provider without the driver even
> knowing.

I guess I misunderstood the intent behind the original patches. I
thought you wanted the user to tell the driver what memory provider to
use, and the driver to recreate the page pool with that provider. What
you're saying here sounds much better, and less changes to the driver.

>  Two problems with that are - 1) if the driver follows
> the recommended flow of allocating new queues before freeing
> old ones we will have page pools created before the old ones
> are gone, which means we'd need to reserve 2x the number of
> 1G pages; 2) there's no callback to the driver to say "I did
> something behind your back, don't worry about it, but recreate
> your queues, please" so the change will not take effect until
> some unrelated change like installing XDP. Which may be fine
> in practice but is a bit odd.
>

I have the same problem with device memory TCP. I solved it in a
similar way, doing something else in the driver that triggers
gve_close() & gve_open(). I wonder if the cleanest way to do this is
calling ethtool_ops->reset() or something like that? That was my idea
at least. I haven't tested it, but from reading the code it should do
a gve_close() + gve_open() like I want.

> Then we get into hand-wavy stuff like - if we can link page
> pools to netdevs, we should also be able to export the page pool
> stats via the netdev family instead doing it the ethtool -S.. ekhm..
> "way". And if we start storing configs behind driver's back why
> don't we also store other params, like ring size and queue count...
> A lot of potential improvements as we iron out a new API...
>
> Live tree: https://github.com/kuba-moo/linux/tree/pp-providers
>
> Jakub Kicinski (12):
>   net: hack together some page sharing
>   net: create a 1G-huge-page-backed allocator
>   net: page_pool: hide page_pool_release_page()
>   net: page_pool: merge page_pool_release_page() with
>     page_pool_return_page()
>   net: page_pool: factor out releasing DMA from releasing the page
>   net: page_pool: create hooks for custom page providers
>   net: page_pool: add huge page backed memory providers
>   eth: bnxt: let the page pool manage the DMA mapping
>   eth: bnxt: use the page pool for data pages
>   eth: bnxt: make sure we make for recycle skbs before freeing them
>   eth: bnxt: wrap coherent allocations into helpers
>   eth: bnxt: hack in the use of MEP
>
>  Documentation/networking/page_pool.rst        |  10 +-
>  arch/x86/kernel/setup.c                       |   6 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 154 +++--
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +
>  drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
>  .../net/ethernet/stmicro/stmmac/stmmac_main.c |   4 +-
>  include/net/dcalloc.h                         |  28 +
>  include/net/page_pool.h                       |  36 +-
>  net/core/Makefile                             |   2 +-
>  net/core/dcalloc.c                            | 615 +++++++++++++++++
>  net/core/dcalloc.h                            |  96 +++
>  net/core/page_pool.c                          | 625 +++++++++++++++++-
>  12 files changed, 1478 insertions(+), 105 deletions(-)
>  create mode 100644 include/net/dcalloc.h
>  create mode 100644 net/core/dcalloc.c
>  create mode 100644 net/core/dcalloc.h
>
> --
> 2.41.0
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 06/12] net: page_pool: create hooks for custom page providers
  2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
@ 2023-07-07 19:50   ` Mina Almasry
  2023-07-07 22:28     ` Jakub Kicinski
  0 siblings, 1 reply; 33+ messages in thread
From: Mina Almasry @ 2023-07-07 19:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, hawk, ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Fri, Jul 7, 2023 at 11:39 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.
>
> The rest if pretty obvious.
>
> Add a test provider which should behave identically to
> a normal page pool.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/page_pool.h | 20 +++++++++++
>  net/core/page_pool.c    | 80 +++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 97 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b082c9118f05..5859ab838ed2 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -77,6 +77,7 @@ struct page_pool_params {
>         int             nid;  /* Numa node id to allocate from pages from */
>         struct device   *dev; /* device, for DMA pre-mapping purposes */
>         struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
> +       u8              memory_provider; /* haaacks! should be user-facing */
>         enum dma_data_direction dma_dir; /* DMA mapping direction */
>         unsigned int    max_len; /* max DMA sync memory size */
>         unsigned int    offset;  /* DMA addr offset */
> @@ -147,6 +148,22 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats)
>
>  #endif
>
> +struct mem_provider;
> +
> +enum pp_memory_provider_type {
> +       __PP_MP_NONE, /* Use system allocator directly */
> +       PP_MP_BASIC, /* Test purposes only, Hacky McHackface */
> +};
> +
> +struct pp_memory_provider_ops {
> +       int (*init)(struct page_pool *pool);
> +       void (*destroy)(struct page_pool *pool);
> +       struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> +       bool (*release_page)(struct page_pool *pool, struct page *page);
> +};
> +
> +extern const struct pp_memory_provider_ops basic_ops;
> +
>  struct page_pool {
>         struct page_pool_params p;
>
> @@ -194,6 +211,9 @@ struct page_pool {
>          */
>         struct ptr_ring ring;
>
> +       const struct pp_memory_provider_ops *mp_ops;
> +       void *mp_priv;
> +
>  #ifdef CONFIG_PAGE_POOL_STATS
>         /* recycle stats are per-cpu to avoid locking */
>         struct page_pool_recycle_stats __percpu *recycle_stats;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 09f8c34ad4a7..e886a439f9bb 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -23,6 +23,8 @@
>
>  #include <trace/events/page_pool.h>
>
> +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
> +
>  #define DEFER_TIME (msecs_to_jiffies(1000))
>  #define DEFER_WARN_INTERVAL (60 * HZ)
>
> @@ -161,6 +163,7 @@ static int page_pool_init(struct page_pool *pool,
>                           const struct page_pool_params *params)
>  {
>         unsigned int ring_qsize = 1024; /* Default */
> +       int err;
>
>         memcpy(&pool->p, params, sizeof(pool->p));
>
> @@ -218,10 +221,36 @@ static int page_pool_init(struct page_pool *pool,
>         /* Driver calling page_pool_create() also call page_pool_destroy() */
>         refcount_set(&pool->user_cnt, 1);
>
> +       switch (pool->p.memory_provider) {
> +       case __PP_MP_NONE:
> +               break;
> +       case PP_MP_BASIC:
> +               pool->mp_ops = &basic_ops;
> +               break;
> +       default:
> +               err = -EINVAL;
> +               goto free_ptr_ring;
> +       }
> +
> +       if (pool->mp_ops) {
> +               err = pool->mp_ops->init(pool);
> +               if (err) {
> +                       pr_warn("%s() mem-provider init failed %d\n",
> +                               __func__, err);
> +                       goto free_ptr_ring;
> +               }
> +
> +               static_branch_inc(&page_pool_mem_providers);
> +       }
> +
>         if (pool->p.flags & PP_FLAG_DMA_MAP)
>                 get_device(pool->p.dev);
>
>         return 0;
> +
> +free_ptr_ring:
> +       ptr_ring_cleanup(&pool->ring, NULL);
> +       return err;
>  }
>
>  struct page_pool *page_pool_create(const struct page_pool_params *params)
> @@ -463,7 +492,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
>                 return page;
>
>         /* Slow-path: cache empty, do real allocation */
> -       page = __page_pool_alloc_pages_slow(pool, gfp);
> +       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> +               page = pool->mp_ops->alloc_pages(pool, gfp);
> +       else
> +               page = __page_pool_alloc_pages_slow(pool, gfp);
>         return page;
>  }
>  EXPORT_SYMBOL(page_pool_alloc_pages);
> @@ -515,8 +547,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
>  void page_pool_return_page(struct page_pool *pool, struct page *page)
>  {
>         int count;
> +       bool put;
>
> -       __page_pool_release_page_dma(pool, page);
> +       put = true;
> +       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> +               put = pool->mp_ops->release_page(pool, page);
> +       else
> +               __page_pool_release_page_dma(pool, page);
>
>         page_pool_clear_pp_info(page);
>
> @@ -526,7 +563,8 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
>         count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>         trace_page_pool_state_release(pool, page, count);
>
> -       put_page(page);
> +       if (put)
> +               put_page(page);

+1 to giving memory providers the option to replace put_page() with a
custom release function. In your original proposal, the put_page() was
intact, and I thought it was some requirement from you that pages must
be freed with put_page(). I made my code with/around that, but I think
it's nice to give future memory providers the option to replace this.

>         /* An optimization would be to call __free_pages(page, pool->p.order)
>          * knowing page is not part of page-cache (thus avoiding a
>          * __page_cache_release() call).
> @@ -779,6 +817,11 @@ static void page_pool_free(struct page_pool *pool)
>         if (pool->disconnect)
>                 pool->disconnect(pool);
>
> +       if (pool->mp_ops) {
> +               pool->mp_ops->destroy(pool);
> +               static_branch_dec(&page_pool_mem_providers);
> +       }
> +
>         ptr_ring_cleanup(&pool->ring, NULL);
>
>         if (pool->p.flags & PP_FLAG_DMA_MAP)
> @@ -952,3 +995,34 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>         return true;
>  }
>  EXPORT_SYMBOL(page_pool_return_skb_page);
> +
> +/***********************
> + *  Mem provider hack  *
> + ***********************/
> +
> +static int mp_basic_init(struct page_pool *pool)
> +{
> +       return 0;
> +}
> +
> +static void mp_basic_destroy(struct page_pool *pool)
> +{
> +}
> +
> +static struct page *mp_basic_alloc_pages(struct page_pool *pool, gfp_t gfp)
> +{
> +       return __page_pool_alloc_pages_slow(pool, gfp);
> +}
> +
> +static bool mp_basic_release(struct page_pool *pool, struct page *page)
> +{
> +       __page_pool_release_page_dma(pool, page);
> +       return true;
> +}
> +
> +const struct pp_memory_provider_ops basic_ops = {
> +       .init                   = mp_basic_init,
> +       .destroy                = mp_basic_destroy,
> +       .alloc_pages            = mp_basic_alloc_pages,
> +       .release_page           = mp_basic_release,
> +};
> --
> 2.41.0
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 06/12] net: page_pool: create hooks for custom page providers
  2023-07-07 19:50   ` Mina Almasry
@ 2023-07-07 22:28     ` Jakub Kicinski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 22:28 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, hawk, ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Fri, 7 Jul 2023 12:50:51 -0700 Mina Almasry wrote:
> > -       put_page(page);
> > +       if (put)
> > +               put_page(page);  
> 
> +1 to giving memory providers the option to replace put_page() with a
> custom release function. In your original proposal, the put_page() was
> intact, and I thought it was some requirement from you that pages must
> be freed with put_page(). I made my code with/around that, but I think
> it's nice to give future memory providers the option to replace this.

I was kinda trying to pretend there is a reason, so that I could
justify the second callback - hoping it could be useful for the
device / user memory cases. But the way I was using it was racy
so I dropped it for now.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-07 19:45 ` [RFC 00/12] net: huge page backed page_pool Mina Almasry
@ 2023-07-07 22:45   ` Jakub Kicinski
  2023-07-10 17:31     ` Mina Almasry
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 22:45 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, hawk, ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Fri, 7 Jul 2023 12:45:26 -0700 Mina Almasry wrote:
> > This is an "early PoC" at best. It seems to work for a basic
> > traffic test but there's no uAPI and a lot more general polish
> > is needed.
> >
> > The problem we're seeing is that performance of some older NICs
> > degrades quite a bit when IOMMU is used (in non-passthru mode).
> > There is a long tail of old NICs deployed, especially in PoPs/
> > /on edge. From a conversation I had with Eric a few months
> > ago it sounded like others may have similar issues. So I thought
> > I'd take a swing at getting page pool to feed drivers huge pages.
> > 1G pages require hooking into early init via CMA but it works
> > just fine.
> >
> > I haven't tested this with a real workload, because I'm still
> > waiting to get my hands on the right machine. But the experiment
> > with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> 
> Thanks for CCing me Jakub. I'm working on a proposal for device memory
> TCP, and I recently migrated it to be on top of your pp-provider idea
> and I think I can share my test results as well. I had my code working
> on top of your slightly older API I found here a few days ago:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> On top of the old API I had something with all my functionality tests
> passing and performance benchmarking hitting ~96.5% line rate (with
> all data going straight to the device - GPU - memory, which is the
> point of the proposal). Of course, when you look at the code you may
> not like the approach and I may need to try something else, which is
> perfectly fine, but my current implementation is pp-provider based.
> 
> I'll look into rebasing my changes on top of this RFC and retesting,
> but I should share my RFC either way sometime next week maybe. I took
> a quick look at the changes you made here, and I don't think you
> changed anything that would break my use case.

Oh, sorry I didn't realize you were working on top of my changes
already. Yes, the memory provider API should not have changed much.
I mostly reshuffled the MEP code to have both a coherent and
non-coherent buddy allocator since then.

> > In terms of the missing parts - uAPI is definitely needed.
> > The rough plan would be to add memory config via the netdev
> > genl family. Should fit nicely there. Have the config stored
> > in struct netdevice. When page pool is created get to the netdev
> > and automatically select the provider without the driver even
> > knowing.  
> 
> I guess I misunderstood the intent behind the original patches. I
> thought you wanted the user to tell the driver what memory provider to
> use, and the driver to recreate the page pool with that provider. What
> you're saying here sounds much better, and less changes to the driver.
> 
> >  Two problems with that are - 1) if the driver follows
> > the recommended flow of allocating new queues before freeing
> > old ones we will have page pools created before the old ones
> > are gone, which means we'd need to reserve 2x the number of
> > 1G pages; 2) there's no callback to the driver to say "I did
> > something behind your back, don't worry about it, but recreate
> > your queues, please" so the change will not take effect until
> > some unrelated change like installing XDP. Which may be fine
> > in practice but is a bit odd.
> 
> I have the same problem with device memory TCP. I solved it in a
> similar way, doing something else in the driver that triggers
> gve_close() & gve_open(). I wonder if the cleanest way to do this is
> calling ethtool_ops->reset() or something like that? That was my idea
> at least. I haven't tested it, but from reading the code it should do
> a gve_close() + gve_open() like I want.

The prevailing wisdom so far was that close() + open() is not a good
idea. Some NICs will require large contiguous allocations for rings 
and context memory and there's no guarantee that open() will succeed
in prod when memory is fragmented. So you may end up with a close()d
NIC and a failure to open(), and the machine dropping off the net.

But if we don't close() before we open() and the memory provider is
single consumer we'll have problem #1 :(

BTW are you planning to use individual queues in prod? I anticipated
that for ZC we'll need to tie multiple queues into an RSS context, 
and then configure at the level of an RSS context.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 09/12] eth: bnxt: use the page pool for data pages
  2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
@ 2023-07-10  4:22   ` Michael Chan
  2023-07-10 17:04     ` Jakub Kicinski
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Chan @ 2023-07-10  4:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	willemb, Andrew Gospodarek, Somnath Kotur

On Fri, Jul 7, 2023 at 11:39 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> To benefit from page recycling allocate the agg pages (used by HW-GRO
> and jumbo) from the page pool.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 43 ++++++++++++-----------
>  1 file changed, 22 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 6512514cd498..734c2c6cad69 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -811,33 +811,27 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
>         u16 sw_prod = rxr->rx_sw_agg_prod;
>         unsigned int offset = 0;
>
> -       if (BNXT_RX_PAGE_MODE(bp)) {
> +       if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE || BNXT_RX_PAGE_MODE(bp)) {

We have a very similar set of patches from my colleague Somnath to
support page pool and it supports PAGE_SIZE >= BNXT_RX_PAGE_SIZE in a
more unified way.  So here, we don't have to deal with the if/else
condition. I should be able to post the patches later in the week
after some more QA.

>                 page = __bnxt_alloc_rx_page(bp, &mapping, rxr, gfp);
>
>                 if (!page)
>                         return -ENOMEM;
>
>         } else {
> -               if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
> -                       page = rxr->rx_page;
> -                       if (!page) {
> -                               page = alloc_page(gfp);
> -                               if (!page)
> -                                       return -ENOMEM;
> -                               rxr->rx_page = page;
> -                               rxr->rx_page_offset = 0;
> -                       }
> -                       offset = rxr->rx_page_offset;
> -                       rxr->rx_page_offset += BNXT_RX_PAGE_SIZE;
> -                       if (rxr->rx_page_offset == PAGE_SIZE)
> -                               rxr->rx_page = NULL;
> -                       else
> -                               get_page(page);
> -               } else {
> +               page = rxr->rx_page;
> +               if (!page) {
>                         page = alloc_page(gfp);
>                         if (!page)
>                                 return -ENOMEM;
> +                       rxr->rx_page = page;
> +                       rxr->rx_page_offset = 0;
>                 }
> +               offset = rxr->rx_page_offset;
> +               rxr->rx_page_offset += BNXT_RX_PAGE_SIZE;
> +               if (rxr->rx_page_offset == PAGE_SIZE)
> +                       rxr->rx_page = NULL;
> +               else
> +                       get_page(page);
>
>                 mapping = dma_map_page_attrs(&pdev->dev, page, offset,
>                                              BNXT_RX_PAGE_SIZE, DMA_FROM_DEVICE,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping
  2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
@ 2023-07-10 10:12   ` Jesper Dangaard Brouer
  2023-07-26  6:56     ` Ilias Apalodimas
  0 siblings, 1 reply; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-10 10:12 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: brouer, almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb



On 07/07/2023 20.39, Jakub Kicinski wrote:
> Use the page pool's ability to maintain DMA mappings for us.
> This avoid re-mapping recycled pages.
> 

For DMA using IOMMU mappings, using page_pool like this patch solves the
main bottleneck.  Thus, I suspect this patch will give the biggest
performance boost on it's own.

As you have already discovered, the next bottleneck then becomes the
IOMMU's address resolution, which the IOTLB (I/O Translation Lookaside
Buffer) hardware helps speed up.

There are a number of techniques for reducing IOTLB misses.
I recommend reading:
  IOMMU: Strategies for Mitigating the IOTLB Bottleneck
  - https://inria.hal.science/inria-00493752/document


> Note that pages in the pool are always mapped DMA_BIDIRECTIONAL,
> so we should use that instead of looking at bp->rx_dir.
> 
> The syncing is probably wrong, TBH, I haven't studied the page
> pool rules, they always confused me. But for a hack, who cares,
> x86 :D
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 ++++++++---------------
>   1 file changed, 8 insertions(+), 16 deletions(-)

Love seeing these stats, where page_pool reduce lines in drivers.

> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index e5b54e6025be..6512514cd498 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -706,12 +706,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
>   	if (!page)
>   		return NULL;
>   
> -	*mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir,
> -				      DMA_ATTR_WEAK_ORDERING);
> -	if (dma_mapping_error(dev, *mapping)) {
> -		page_pool_recycle_direct(rxr->page_pool, page);
> -		return NULL;
> -	}
> +	*mapping = page_pool_get_dma_addr(page);
> +	dma_sync_single_for_device(dev, *mapping, PAGE_SIZE, DMA_BIDIRECTIONAL);
> +

You can keep this as-is, but I just wanted mention that page_pool
supports doing the "dma_sync_for_device" via PP_FLAG_DMA_SYNC_DEV.
Thus, removing more lines from driver code.

>   	return page;
>   }
>   
> @@ -951,6 +948,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
>   					      unsigned int offset_and_len)
>   {
>   	unsigned int len = offset_and_len & 0xffff;
> +	struct device *dev = &bp->pdev->dev;
>   	struct page *page = data;
>   	u16 prod = rxr->rx_prod;
>   	struct sk_buff *skb;
> @@ -962,8 +960,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
>   		return NULL;
>   	}
>   	dma_addr -= bp->rx_dma_offset;
> -	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
> -			     DMA_ATTR_WEAK_ORDERING);
> +	dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>   	skb = build_skb(page_address(page), PAGE_SIZE);
>   	if (!skb) {
>   		page_pool_recycle_direct(rxr->page_pool, page);
> @@ -984,6 +981,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
>   {
>   	unsigned int payload = offset_and_len >> 16;
>   	unsigned int len = offset_and_len & 0xffff;
> +	struct device *dev = &bp->pdev->dev;
>   	skb_frag_t *frag;
>   	struct page *page = data;
>   	u16 prod = rxr->rx_prod;
> @@ -996,8 +994,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
>   		return NULL;
>   	}
>   	dma_addr -= bp->rx_dma_offset;
> -	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
> -			     DMA_ATTR_WEAK_ORDERING);
> +	dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
>   
>   	if (unlikely(!payload))
>   		payload = eth_get_headlen(bp->dev, data_ptr, len);
> @@ -2943,9 +2940,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
>   		rx_buf->data = NULL;
>   		if (BNXT_RX_PAGE_MODE(bp)) {
>   			mapping -= bp->rx_dma_offset;
> -			dma_unmap_page_attrs(&pdev->dev, mapping, PAGE_SIZE,
> -					     bp->rx_dir,
> -					     DMA_ATTR_WEAK_ORDERING);
>   			page_pool_recycle_direct(rxr->page_pool, data);
>   		} else {
>   			dma_unmap_single_attrs(&pdev->dev, mapping,
> @@ -2967,9 +2961,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
>   			continue;
>   
>   		if (BNXT_RX_PAGE_MODE(bp)) {
> -			dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping,
> -					     BNXT_RX_PAGE_SIZE, bp->rx_dir,
> -					     DMA_ATTR_WEAK_ORDERING);
>   			rx_agg_buf->page = NULL;
>   			__clear_bit(i, rxr->rx_agg_bmap);
>   
> @@ -3208,6 +3199,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>   {
>   	struct page_pool_params pp = { 0 };
>   
> +	pp.flags = PP_FLAG_DMA_MAP;
>   	pp.pool_size = bp->rx_ring_size;
>   	pp.nid = dev_to_node(&bp->pdev->dev);
>   	pp.napi = &rxr->bnapi->napi;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page()
  2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
@ 2023-07-10 16:07   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-10 16:07 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: brouer, almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb



On 07/07/2023 20.39, Jakub Kicinski wrote:
> Now that page_pool_release_page() is not exported we can
> merge it with page_pool_return_page(). I believe that
> the "Do not replace this with page_pool_return_page()"
> comment was there in case page_pool_return_page() was
> not inlined, to avoid two function calls.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

I forgot the exact reason, but the "avoid two function calls" argument
makes sense.  As this is no-longer an issues, I'm okay with this change.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

> ---
>   net/core/page_pool.c | 12 ++----------
>   1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2c7cf5f2bcb8..7ca456bfab71 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -492,7 +492,7 @@ static s32 page_pool_inflight(struct page_pool *pool)
>    * a regular page (that will eventually be returned to the normal
>    * page-allocator via put_page).
>    */
> -static void page_pool_release_page(struct page_pool *pool, struct page *page)
> +static void page_pool_return_page(struct page_pool *pool, struct page *page)
>   {
>   	dma_addr_t dma;
>   	int count;
> @@ -518,12 +518,6 @@ static void page_pool_release_page(struct page_pool *pool, struct page *page)
>   	 */
>   	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>   	trace_page_pool_state_release(pool, page, count);
> -}
> -
> -/* Return a page to the page allocator, cleaning up our state */
> -static void page_pool_return_page(struct page_pool *pool, struct page *page)
> -{
> -	page_pool_release_page(pool, page);
>   
>   	put_page(page);
>   	/* An optimization would be to call __free_pages(page, pool->p.order)
> @@ -615,9 +609,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>   	 * will be invoking put_page.
>   	 */
>   	recycle_stat_inc(pool, released_refcnt);
> -	/* Do not replace this with page_pool_return_page() */
> -	page_pool_release_page(pool, page);
> -	put_page(page);
> +	page_pool_return_page(pool, page);
>   
>   	return NULL;
>   }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 09/12] eth: bnxt: use the page pool for data pages
  2023-07-10  4:22   ` Michael Chan
@ 2023-07-10 17:04     ` Jakub Kicinski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-10 17:04 UTC (permalink / raw)
  To: Michael Chan
  Cc: netdev, almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	willemb, Andrew Gospodarek, Somnath Kotur

On Sun, 9 Jul 2023 21:22:50 -0700 Michael Chan wrote:
> > -       if (BNXT_RX_PAGE_MODE(bp)) {
> > +       if (PAGE_SIZE <= BNXT_RX_PAGE_SIZE || BNXT_RX_PAGE_MODE(bp)) {  
> 
> We have a very similar set of patches from my colleague Somnath to
> support page pool and it supports PAGE_SIZE >= BNXT_RX_PAGE_SIZE in a
> more unified way.  So here, we don't have to deal with the if/else
> condition. I should be able to post the patches later in the week
> after some more QA.

I should have made it clearer, I'm testing with bnxt because of HW
availability but I have no strong need to get the bnxt bits merged
if it conflicts with your work.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-07 22:45   ` Jakub Kicinski
@ 2023-07-10 17:31     ` Mina Almasry
  0 siblings, 0 replies; 33+ messages in thread
From: Mina Almasry @ 2023-07-10 17:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, hawk, ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Fri, Jul 7, 2023 at 3:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 7 Jul 2023 12:45:26 -0700 Mina Almasry wrote:
> > > This is an "early PoC" at best. It seems to work for a basic
> > > traffic test but there's no uAPI and a lot more general polish
> > > is needed.
> > >
> > > The problem we're seeing is that performance of some older NICs
> > > degrades quite a bit when IOMMU is used (in non-passthru mode).
> > > There is a long tail of old NICs deployed, especially in PoPs/
> > > /on edge. From a conversation I had with Eric a few months
> > > ago it sounded like others may have similar issues. So I thought
> > > I'd take a swing at getting page pool to feed drivers huge pages.
> > > 1G pages require hooking into early init via CMA but it works
> > > just fine.
> > >
> > > I haven't tested this with a real workload, because I'm still
> > > waiting to get my hands on the right machine. But the experiment
> > > with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> >
> > Thanks for CCing me Jakub. I'm working on a proposal for device memory
> > TCP, and I recently migrated it to be on top of your pp-provider idea
> > and I think I can share my test results as well. I had my code working
> > on top of your slightly older API I found here a few days ago:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > On top of the old API I had something with all my functionality tests
> > passing and performance benchmarking hitting ~96.5% line rate (with
> > all data going straight to the device - GPU - memory, which is the
> > point of the proposal). Of course, when you look at the code you may
> > not like the approach and I may need to try something else, which is
> > perfectly fine, but my current implementation is pp-provider based.
> >
> > I'll look into rebasing my changes on top of this RFC and retesting,
> > but I should share my RFC either way sometime next week maybe. I took
> > a quick look at the changes you made here, and I don't think you
> > changed anything that would break my use case.
>
> Oh, sorry I didn't realize you were working on top of my changes
> already. Yes, the memory provider API should not have changed much.
> I mostly reshuffled the MEP code to have both a coherent and
> non-coherent buddy allocator since then.
>

No worries at all. I don't mind rebasing to new versions (and finding
out if they work for me).

> > > In terms of the missing parts - uAPI is definitely needed.
> > > The rough plan would be to add memory config via the netdev
> > > genl family. Should fit nicely there. Have the config stored
> > > in struct netdevice. When page pool is created get to the netdev
> > > and automatically select the provider without the driver even
> > > knowing.
> >
> > I guess I misunderstood the intent behind the original patches. I
> > thought you wanted the user to tell the driver what memory provider to
> > use, and the driver to recreate the page pool with that provider. What
> > you're saying here sounds much better, and less changes to the driver.
> >
> > >  Two problems with that are - 1) if the driver follows
> > > the recommended flow of allocating new queues before freeing
> > > old ones we will have page pools created before the old ones
> > > are gone, which means we'd need to reserve 2x the number of
> > > 1G pages; 2) there's no callback to the driver to say "I did
> > > something behind your back, don't worry about it, but recreate
> > > your queues, please" so the change will not take effect until
> > > some unrelated change like installing XDP. Which may be fine
> > > in practice but is a bit odd.
> >
> > I have the same problem with device memory TCP. I solved it in a
> > similar way, doing something else in the driver that triggers
> > gve_close() & gve_open(). I wonder if the cleanest way to do this is
> > calling ethtool_ops->reset() or something like that? That was my idea
> > at least. I haven't tested it, but from reading the code it should do
> > a gve_close() + gve_open() like I want.
>
> The prevailing wisdom so far was that close() + open() is not a good
> idea. Some NICs will require large contiguous allocations for rings
> and context memory and there's no guarantee that open() will succeed
> in prod when memory is fragmented. So you may end up with a close()d
> NIC and a failure to open(), and the machine dropping off the net.
>
> But if we don't close() before we open() and the memory provider is
> single consumer we'll have problem #1 :(
>
> BTW are you planning to use individual queues in prod? I anticipated
> that for ZC we'll need to tie multiple queues into an RSS context,
> and then configure at the level of an RSS context.

Our configuration:

- We designate a number of RX queues as devmem TCP queues.
- We designate the rest as regular TCP queues.
- We use RSS to steer all incoming traffic to the regular TCP queues.
- We use flow steering to steer specific TCP flows to the devmem TCP queues.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
                   ` (12 preceding siblings ...)
  2023-07-07 19:45 ` [RFC 00/12] net: huge page backed page_pool Mina Almasry
@ 2023-07-11 15:49 ` Jesper Dangaard Brouer
  2023-07-12  0:08   ` Jakub Kicinski
  13 siblings, 1 reply; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-11 15:49 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: brouer, almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb



On 07/07/2023 20.39, Jakub Kicinski wrote:
> Hi!
> 
> This is an "early PoC" at best. It seems to work for a basic
> traffic test but there's no uAPI and a lot more general polish
> is needed.
> 
> The problem we're seeing is that performance of some older NICs
> degrades quite a bit when IOMMU is used (in non-passthru mode).
> There is a long tail of old NICs deployed, especially in PoPs/
> /on edge. From a conversation I had with Eric a few months
> ago it sounded like others may have similar issues. So I thought

Using page_pool on systems with IOMMU is a big performance gain in
itself.  As it removes the DMA map+unmap call that need to change the
IOMMU setup/table.

> I'd take a swing at getting page pool to feed drivers huge pages.
> 1G pages require hooking into early init via CMA but it works
> just fine.
> 
> I haven't tested this with a real workload, because I'm still
> waiting to get my hands on the right machine. But the experiment
> with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> 

I see you have discovered that the next bottleneck are the IOTLB misses.
One of the techniques for reducing IOTLB misses is using huge pages.
Called "super-pages" in article (below), and they report that this trick
doesn't work on AMD (Pacifica arch).

I think you have convinced me that the pp_provider idea makes sense for
*this* use-case, because it feels like natural to extend PP with
mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
use-case).

What is your page refcnt strategy for these huge-pages. I assume this
rely on PP frags-scheme, e.g. using page->pp_frag_count.
Is this correctly understood?

Generally the pp_provider's will have to use the refcnt schemes
supported by page_pool.  (Which is why I'm not 100% sure this fits
Mina's use-case).


[IOTLB details]:

As mentioned on [RFC 08/12] there are other techniques for reducing 
IOTLB misses, described in:
  IOMMU: Strategies for Mitigating the IOTLB Bottleneck
   - https://inria.hal.science/inria-00493752/document

I took a deeper look at also discovered Intel's documentation:
  - Intel virtualization technology for directed I/O, arch spec
  - 
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html

One problem that is interesting to notice is how NICs access the packets
via ring-queue, which is likely larger that number of IOTLB entries.
Thus, a high change of IOTLB misses.  They suggest marking pages with
Eviction Hints (EH) that cause pages to be marked as Transient Mappings
(TM) which allows IOMMU to evict these faster (making room for others).
And then combine this with prefetching.

In this context of how fast a page is reused by NIC and spatial
locality, it is worth remembering that PP have two schemes, (1) the fast
alloc cache that in certain cases can recycle pages (and it based on a
stack approach), (2) normal recycling via the ptr_ring that will have a
longer time before page gets reused.


--Jesper

[RFC 08/12] 
https://lore.kernel.org/all/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-11 15:49 ` Jesper Dangaard Brouer
@ 2023-07-12  0:08   ` Jakub Kicinski
  2023-07-12 11:47     ` Yunsheng Lin
  2023-07-12 14:00     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-12  0:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, brouer, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb

On Tue, 11 Jul 2023 17:49:19 +0200 Jesper Dangaard Brouer wrote:
> I see you have discovered that the next bottleneck are the IOTLB misses.
> One of the techniques for reducing IOTLB misses is using huge pages.
> Called "super-pages" in article (below), and they report that this trick
> doesn't work on AMD (Pacifica arch).
> 
> I think you have convinced me that the pp_provider idea makes sense for
> *this* use-case, because it feels like natural to extend PP with
> mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
> use-case).

We're on the same page then (no pun intended).

> What is your page refcnt strategy for these huge-pages. I assume this
> rely on PP frags-scheme, e.g. using page->pp_frag_count.
> Is this correctly understood?

Oh, I split the page into individual 4k pages after DMA mapping.
There's no need for the host memory to be a huge page. I mean, 
the actual kernel identity mapping is a huge page AFAIU, and the 
struct pages are allocated, anyway. We just need it to be a huge 
page at DMA mapping time.

So the pages from the huge page provider only differ from normal
alloc_page() pages by the fact that they are a part of a 1G DMA
mapping.

I'm talking mostly about the 1G provider, 2M providers can be
implemented using various strategies cause 2M is smaller than 
MAX_ORDER.

> Generally the pp_provider's will have to use the refcnt schemes
> supported by page_pool.  (Which is why I'm not 100% sure this fits
> Mina's use-case).
>
> [IOTLB details]:
> 
> As mentioned on [RFC 08/12] there are other techniques for reducing 
> IOTLB misses, described in:
>   IOMMU: Strategies for Mitigating the IOTLB Bottleneck
>    - https://inria.hal.science/inria-00493752/document
> 
> I took a deeper look at also discovered Intel's documentation:
>   - Intel virtualization technology for directed I/O, arch spec
>   - 
> https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
> 
> One problem that is interesting to notice is how NICs access the packets
> via ring-queue, which is likely larger that number of IOTLB entries.
> Thus, a high change of IOTLB misses.  They suggest marking pages with
> Eviction Hints (EH) that cause pages to be marked as Transient Mappings
> (TM) which allows IOMMU to evict these faster (making room for others).
> And then combine this with prefetching.

Interesting, didn't know about EH.

> In this context of how fast a page is reused by NIC and spatial
> locality, it is worth remembering that PP have two schemes, (1) the fast
> alloc cache that in certain cases can recycle pages (and it based on a
> stack approach), (2) normal recycling via the ptr_ring that will have a
> longer time before page gets reused.

I read somewhere that Intel IOTLB can be as small as 256 entries. 
So it seems pretty much impossible for it to cache accesses to 4k 
pages thru recycling. I thought that even 2M pages will start to 
be problematic for multi queue devices (1k entries on each ring x 
32 rings == 128MB just sitting on the ring, let alone circulation).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12  0:08   ` Jakub Kicinski
@ 2023-07-12 11:47     ` Yunsheng Lin
  2023-07-12 12:43       ` Jesper Dangaard Brouer
  2023-07-12 14:00     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-12 11:47 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev, brouer, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb

On 2023/7/12 8:08, Jakub Kicinski wrote:
> On Tue, 11 Jul 2023 17:49:19 +0200 Jesper Dangaard Brouer wrote:
>> I see you have discovered that the next bottleneck are the IOTLB misses.
>> One of the techniques for reducing IOTLB misses is using huge pages.
>> Called "super-pages" in article (below), and they report that this trick
>> doesn't work on AMD (Pacifica arch).
>>
>> I think you have convinced me that the pp_provider idea makes sense for
>> *this* use-case, because it feels like natural to extend PP with
>> mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
>> use-case).
> 
> We're on the same page then (no pun intended).
> 
>> What is your page refcnt strategy for these huge-pages. I assume this
>> rely on PP frags-scheme, e.g. using page->pp_frag_count.
>> Is this correctly understood?
> 
> Oh, I split the page into individual 4k pages after DMA mapping.
> There's no need for the host memory to be a huge page. I mean, 
> the actual kernel identity mapping is a huge page AFAIU, and the 
> struct pages are allocated, anyway. We just need it to be a huge 
> page at DMA mapping time.
> 
> So the pages from the huge page provider only differ from normal
> alloc_page() pages by the fact that they are a part of a 1G DMA
> mapping.

If it is about DMA mapping, is it possible to use dma_map_sg()
to enable a big continuous dma map for a lot of discontinuous
4k pages to avoid allocating big huge page?

As the comment:
"The scatter gather list elements are merged together (if possible)
and tagged with the appropriate dma address and length."

https://elixir.free-electrons.com/linux/v4.16.18/source/arch/arm/mm/dma-mapping.c#L1805

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12 11:47     ` Yunsheng Lin
@ 2023-07-12 12:43       ` Jesper Dangaard Brouer
  2023-07-12 17:01         ` Jakub Kicinski
  0 siblings, 1 reply; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-12 12:43 UTC (permalink / raw)
  To: Yunsheng Lin, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb



On 12/07/2023 13.47, Yunsheng Lin wrote:
> On 2023/7/12 8:08, Jakub Kicinski wrote:
>> On Tue, 11 Jul 2023 17:49:19 +0200 Jesper Dangaard Brouer wrote:
>>> I see you have discovered that the next bottleneck are the IOTLB misses.
>>> One of the techniques for reducing IOTLB misses is using huge pages.
>>> Called "super-pages" in article (below), and they report that this trick
>>> doesn't work on AMD (Pacifica arch).
>>>
>>> I think you have convinced me that the pp_provider idea makes sense for
>>> *this* use-case, because it feels like natural to extend PP with
>>> mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
>>> use-case).
>>
>> We're on the same page then (no pun intended).
>>
>>> What is your page refcnt strategy for these huge-pages. I assume this
>>> rely on PP frags-scheme, e.g. using page->pp_frag_count.
>>> Is this correctly understood?
>>
>> Oh, I split the page into individual 4k pages after DMA mapping.
>> There's no need for the host memory to be a huge page. I mean,
>> the actual kernel identity mapping is a huge page AFAIU, and the
>> struct pages are allocated, anyway. We just need it to be a huge
>> page at DMA mapping time.
>>
>> So the pages from the huge page provider only differ from normal
>> alloc_page() pages by the fact that they are a part of a 1G DMA
>> mapping.

So, Jakub you are saying the PP refcnt's are still done "as usual" on 
individual pages.

> 
> If it is about DMA mapping, is it possible to use dma_map_sg()
> to enable a big continuous dma map for a lot of discontinuous
> 4k pages to avoid allocating big huge page?
> 
> As the comment:
> "The scatter gather list elements are merged together (if possible)
> and tagged with the appropriate dma address and length."
> 
> https://elixir.free-electrons.com/linux/v4.16.18/source/arch/arm/mm/dma-mapping.c#L1805
> 

This is interesting for two reasons.

(1) if this DMA merging helps IOTLB misses (?)

(2) PP could use dma_map_sg() to amortize dma_map call cost.

For case (2) __page_pool_alloc_pages_slow() already does bulk allocation
of pages (alloc_pages_bulk_array_node()), and then loops over the pages
to DMA map them individually.  It seems like an obvious win to use
dma_map_sg() here?

--Jesper




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12  0:08   ` Jakub Kicinski
  2023-07-12 11:47     ` Yunsheng Lin
@ 2023-07-12 14:00     ` Jesper Dangaard Brouer
  2023-07-12 17:19       ` Jakub Kicinski
  1 sibling, 1 reply; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-12 14:00 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb, Ulrich Drepper



On 12/07/2023 02.08, Jakub Kicinski wrote:
> On Tue, 11 Jul 2023 17:49:19 +0200 Jesper Dangaard Brouer wrote:
>> I see you have discovered that the next bottleneck are the IOTLB misses.
>> One of the techniques for reducing IOTLB misses is using huge pages.
>> Called "super-pages" in article (below), and they report that this trick
>> doesn't work on AMD (Pacifica arch).
>>
>> I think you have convinced me that the pp_provider idea makes sense for
>> *this* use-case, because it feels like natural to extend PP with
>> mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
>> use-case).
> 
> We're on the same page then (no pun intended).
> 
>> What is your page refcnt strategy for these huge-pages. I assume this
>> rely on PP frags-scheme, e.g. using page->pp_frag_count.
>> Is this correctly understood?
> 
> Oh, I split the page into individual 4k pages after DMA mapping.
> There's no need for the host memory to be a huge page. I mean,
> the actual kernel identity mapping is a huge page AFAIU, and the
> struct pages are allocated, anyway. We just need it to be a huge
> page at DMA mapping time.
> 
> So the pages from the huge page provider only differ from normal
> alloc_page() pages by the fact that they are a part of a 1G DMA
> mapping.
> 
> I'm talking mostly about the 1G provider, 2M providers can be
> implemented using various strategies cause 2M is smaller than
> MAX_ORDER.
> 
>> Generally the pp_provider's will have to use the refcnt schemes
>> supported by page_pool.  (Which is why I'm not 100% sure this fits
>> Mina's use-case).
>>
>> [IOTLB details]:
>>
>> As mentioned on [RFC 08/12] there are other techniques for reducing
>> IOTLB misses, described in:
>>    IOMMU: Strategies for Mitigating the IOTLB Bottleneck
>>     - https://inria.hal.science/inria-00493752/document
>>
>> I took a deeper look at also discovered Intel's documentation:
>>    - Intel virtualization technology for directed I/O, arch spec
>>    -
>> https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
>>
>> One problem that is interesting to notice is how NICs access the packets
>> via ring-queue, which is likely larger that number of IOTLB entries.
>> Thus, a high change of IOTLB misses.  They suggest marking pages with
>> Eviction Hints (EH) that cause pages to be marked as Transient Mappings
>> (TM) which allows IOMMU to evict these faster (making room for others).
>> And then combine this with prefetching.
> 
> Interesting, didn't know about EH.
> 

I was looking for a way to set this Eviction Hint (EH) the article
talked about, but I'm at a loss.


>> In this context of how fast a page is reused by NIC and spatial
>> locality, it is worth remembering that PP have two schemes, (1) the fast
>> alloc cache that in certain cases can recycle pages (and it based on a
>> stack approach), (2) normal recycling via the ptr_ring that will have a
>> longer time before page gets reused.
> 
> I read somewhere that Intel IOTLB can be as small as 256 entries.

Are IOTLB hardware different from the TLB hardware block?

I can find data on TLB sizes, which says there are two levels on Intel,
quote from "248966-Software-Optimization-Manual-R047.pdf":

  Nehalem microarchitecture implements two levels of translation 
lookaside buffer (TLB). The first level consists of separate TLBs for 
data and code. DTLB0 handles address translation for data accesses, it 
provides 64 entries to support 4KB pages and 32 entries for large pages.
The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries 
(per thread) for large pages.

  The second level TLB (STLB) handles both code and data accesses for 
4KB pages. It support 4KB page translation operation that missed DTLB0 
or ITLB. All entries are 4-way associative. Here is a list of entries
in each DTLB:

  • STLB for 4-KByte pages: 512 entries (services both data and 
instruction look-ups).
  • DTLB0 for large pages: 32 entries.
  • DTLB0 for 4-KByte pages: 64 entries.

  An DTLB0 miss and STLB hit causes a penalty of 7cycles. Software only 
pays this penalty if the DTLB0 is used in some dispatch cases. The 
delays associated with a miss to the STLB and PMH are largely nonblocking.


> So it seems pretty much impossible for it to cache accesses to 4k
> pages thru recycling. I thought that even 2M pages will start to
> be problematic for multi queue devices (1k entries on each ring x
> 32 rings == 128MB just sitting on the ring, let alone circulation).
> 

Yes, I'm also worried about how badly these NIC rings and PP ptr_ring
affects the IOTLB's ability to cache entries.  Why I suggested testing
out the Eviction Hint (EH), but I have not found a way to use/enable
these as a quick test in your environment.

--Jesper


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12 12:43       ` Jesper Dangaard Brouer
@ 2023-07-12 17:01         ` Jakub Kicinski
  2023-07-14 13:05           ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-12 17:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yunsheng Lin, brouer, netdev, almasrymina, hawk,
	ilias.apalodimas, edumazet, dsahern, michael.chan, willemb

On Wed, 12 Jul 2023 14:43:32 +0200 Jesper Dangaard Brouer wrote:
> On 12/07/2023 13.47, Yunsheng Lin wrote:
> > On 2023/7/12 8:08, Jakub Kicinski wrote:  
> >> Oh, I split the page into individual 4k pages after DMA mapping.
> >> There's no need for the host memory to be a huge page. I mean,
> >> the actual kernel identity mapping is a huge page AFAIU, and the
> >> struct pages are allocated, anyway. We just need it to be a huge
> >> page at DMA mapping time.
> >>
> >> So the pages from the huge page provider only differ from normal
> >> alloc_page() pages by the fact that they are a part of a 1G DMA
> >> mapping.  
> 
> So, Jakub you are saying the PP refcnt's are still done "as usual" on 
> individual pages.

Yes - other than coming from a specific 1G of physical memory 
the resulting pages are really pretty ordinary 4k pages.

> > If it is about DMA mapping, is it possible to use dma_map_sg()
> > to enable a big continuous dma map for a lot of discontinuous
> > 4k pages to avoid allocating big huge page?
> > 
> > As the comment:
> > "The scatter gather list elements are merged together (if possible)
> > and tagged with the appropriate dma address and length."
> > 
> > https://elixir.free-electrons.com/linux/v4.16.18/source/arch/arm/mm/dma-mapping.c#L1805
> >   
> 
> This is interesting for two reasons.
> 
> (1) if this DMA merging helps IOTLB misses (?)

Maybe I misunderstand how IOMMU / virtual addressing works, but I don't
see how one can merge mappings from physically non-contiguous pages.
IOW we can't get 1G-worth of random 4k pages and hope that thru some
magic they get strung together and share an IOTLB entry (if that's
where Yunsheng's suggestion was going..)

> (2) PP could use dma_map_sg() to amortize dma_map call cost.
> 
> For case (2) __page_pool_alloc_pages_slow() already does bulk allocation
> of pages (alloc_pages_bulk_array_node()), and then loops over the pages
> to DMA map them individually.  It seems like an obvious win to use
> dma_map_sg() here?

That could well be worth investigating!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12 14:00     ` Jesper Dangaard Brouer
@ 2023-07-12 17:19       ` Jakub Kicinski
  2023-07-13 10:07         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-12 17:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb, Ulrich Drepper

On Wed, 12 Jul 2023 16:00:46 +0200 Jesper Dangaard Brouer wrote:
> On 12/07/2023 02.08, Jakub Kicinski wrote:
> >> Generally the pp_provider's will have to use the refcnt schemes
> >> supported by page_pool.  (Which is why I'm not 100% sure this fits
> >> Mina's use-case).
> >>
> >> [IOTLB details]:
> >>
> >> As mentioned on [RFC 08/12] there are other techniques for reducing
> >> IOTLB misses, described in:
> >>    IOMMU: Strategies for Mitigating the IOTLB Bottleneck
> >>     - https://inria.hal.science/inria-00493752/document
> >>
> >> I took a deeper look at also discovered Intel's documentation:
> >>    - Intel virtualization technology for directed I/O, arch spec
> >>    -
> >> https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
> >>
> >> One problem that is interesting to notice is how NICs access the packets
> >> via ring-queue, which is likely larger that number of IOTLB entries.
> >> Thus, a high change of IOTLB misses.  They suggest marking pages with
> >> Eviction Hints (EH) that cause pages to be marked as Transient Mappings
> >> (TM) which allows IOMMU to evict these faster (making room for others).
> >> And then combine this with prefetching.  
> > 
> > Interesting, didn't know about EH.
> 
> I was looking for a way to set this Eviction Hint (EH) the article
> talked about, but I'm at a loss.

Could possibly be something that the NIC has to set inside the PCIe
transaction headers? Like the old cache hints that predated DDIO?

> >> In this context of how fast a page is reused by NIC and spatial
> >> locality, it is worth remembering that PP have two schemes, (1) the fast
> >> alloc cache that in certain cases can recycle pages (and it based on a
> >> stack approach), (2) normal recycling via the ptr_ring that will have a
> >> longer time before page gets reused.  
> > 
> > I read somewhere that Intel IOTLB can be as small as 256 entries.  
> 
> Are IOTLB hardware different from the TLB hardware block?
> 
> I can find data on TLB sizes, which says there are two levels on Intel,
> quote from "248966-Software-Optimization-Manual-R047.pdf":
> 
>   Nehalem microarchitecture implements two levels of translation 
> lookaside buffer (TLB). The first level consists of separate TLBs for 
> data and code. DTLB0 handles address translation for data accesses, it 
> provides 64 entries to support 4KB pages and 32 entries for large pages.
> The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries 
> (per thread) for large pages.
> 
>   The second level TLB (STLB) handles both code and data accesses for 
> 4KB pages. It support 4KB page translation operation that missed DTLB0 
> or ITLB. All entries are 4-way associative. Here is a list of entries
> in each DTLB:
> 
>   • STLB for 4-KByte pages: 512 entries (services both data and 
> instruction look-ups).
>   • DTLB0 for large pages: 32 entries.
>   • DTLB0 for 4-KByte pages: 64 entries.
> 
>   An DTLB0 miss and STLB hit causes a penalty of 7cycles. Software only 
> pays this penalty if the DTLB0 is used in some dispatch cases. The 
> delays associated with a miss to the STLB and PMH are largely nonblocking.

No idea :( This is an old paper from Rolf in his Netronome days which
says ~Sandy Bridge had only IOTLB 64 entries:

https://dl.acm.org/doi/pdf/10.1145/3230543.3230560

But it's a pretty old paper.

> > So it seems pretty much impossible for it to cache accesses to 4k
> > pages thru recycling. I thought that even 2M pages will start to
> > be problematic for multi queue devices (1k entries on each ring x
> > 32 rings == 128MB just sitting on the ring, let alone circulation).
> >   
> 
> Yes, I'm also worried about how badly these NIC rings and PP ptr_ring
> affects the IOTLB's ability to cache entries.  Why I suggested testing
> out the Eviction Hint (EH), but I have not found a way to use/enable
> these as a quick test in your environment.

FWIW the first version of the code I wrote actually had the coherent
ring memory also use the huge pages - the MEP allocator underlying the
page pool can be used by the driver directly to allocate memory for
other uses than the page pool.

But I figured that's going to be a nightmare to upstream, and Alex said
that even on x86 coherent DMA memory is just write combining not cached
(which frankly IDK why, possibly yet another thing we could consider
optimizing?!)

So I created two allocators, one for coherent (backed by 2M pages) and
one for non-coherent (backed by 1G pages).

For the ptr_ring I was considering bumping the refcount of pages
allocated from outside the 1G pool, so that they do not get recycled.
I deferred optimizing that until I can get some production results.
The extra CPU cost of loss of recycling could outweigh the IOTLB win.

All very exciting stuff, I wish the days were slightly longer :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12 17:19       ` Jakub Kicinski
@ 2023-07-13 10:07         ` Jesper Dangaard Brouer
  2023-07-13 16:27           ` Jakub Kicinski
  0 siblings, 1 reply; 33+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-13 10:07 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb, Ulrich Drepper, Luigi Rizzo,
	Luigi Rizzo, farshin



On 12/07/2023 19.19, Jakub Kicinski wrote:
> On Wed, 12 Jul 2023 16:00:46 +0200 Jesper Dangaard Brouer wrote:
>> On 12/07/2023 02.08, Jakub Kicinski wrote:
>>>> Generally the pp_provider's will have to use the refcnt schemes
>>>> supported by page_pool.  (Which is why I'm not 100% sure this fits
>>>> Mina's use-case).
>>>>
>>>> [IOTLB details]:
>>>>
>>>> As mentioned on [RFC 08/12] there are other techniques for reducing
>>>> IOTLB misses, described in:
>>>>     IOMMU: Strategies for Mitigating the IOTLB Bottleneck
>>>>      - https://inria.hal.science/inria-00493752/document
>>>>
>>>> I took a deeper look at also discovered Intel's documentation:
>>>>     - Intel virtualization technology for directed I/O, arch spec
>>>>     -
>>>> https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
>>>>
>>>> One problem that is interesting to notice is how NICs access the packets
>>>> via ring-queue, which is likely larger that number of IOTLB entries.
>>>> Thus, a high change of IOTLB misses.  They suggest marking pages with
>>>> Eviction Hints (EH) that cause pages to be marked as Transient Mappings
>>>> (TM) which allows IOMMU to evict these faster (making room for others).
>>>> And then combine this with prefetching.
>>>
>>> Interesting, didn't know about EH.
>>
>> I was looking for a way to set this Eviction Hint (EH) the article
>> talked about, but I'm at a loss.
> 
> Could possibly be something that the NIC has to set inside the PCIe
> transaction headers? Like the old cache hints that predated DDIO?
> 

Yes, perhaps it is outdated?

>>>> In this context of how fast a page is reused by NIC and spatial
>>>> locality, it is worth remembering that PP have two schemes, (1) the fast
>>>> alloc cache that in certain cases can recycle pages (and it based on a
>>>> stack approach), (2) normal recycling via the ptr_ring that will have a
>>>> longer time before page gets reused.
>>>
>>> I read somewhere that Intel IOTLB can be as small as 256 entries.
>>
>> Are IOTLB hardware different from the TLB hardware block?

Anyone knows this?

>> I can find data on TLB sizes, which says there are two levels on Intel,
>> quote from "248966-Software-Optimization-Manual-R047.pdf":
>>
>>    Nehalem microarchitecture implements two levels of translation
>> lookaside buffer (TLB). The first level consists of separate TLBs for
>> data and code. DTLB0 handles address translation for data accesses, it
>> provides 64 entries to support 4KB pages and 32 entries for large pages.
>> The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries
>> (per thread) for large pages.
>>
>>    The second level TLB (STLB) handles both code and data accesses for
>> 4KB pages. It support 4KB page translation operation that missed DTLB0
>> or ITLB. All entries are 4-way associative. Here is a list of entries
>> in each DTLB:
>>
>>    • STLB for 4-KByte pages: 512 entries (services both data and
>> instruction look-ups).
>>    • DTLB0 for large pages: 32 entries.
>>    • DTLB0 for 4-KByte pages: 64 entries.
>>
>>    An DTLB0 miss and STLB hit causes a penalty of 7cycles. Software only
>> pays this penalty if the DTLB0 is used in some dispatch cases. The
>> delays associated with a miss to the STLB and PMH are largely nonblocking.
> 
> No idea :( This is an old paper from Rolf in his Netronome days which
> says ~Sandy Bridge had only IOTLB 64 entries:
> 
> https://dl.acm.org/doi/pdf/10.1145/3230543.3230560
> 
Title: "Understanding PCIe performance for end host networking"

> But it's a pretty old paper.

I *HIGHLY* recommend this paper, and I've recommended it before [1].

  [1] 
https://lore.kernel.org/all/b8fa06c4-1074-7b48-6868-4be6fecb4791@redhat.com/

There is a very new (May 2023) publication[2].
  - Title: Overcoming the IOTLB wall for multi-100-Gbps Linux-based 
networking
  - By Luigi Rizzo (netmap inventor) and Alireza Farshin (author of prev 
paper).
  - [2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280580/

There are actually benchmarking page_pool and are suggesting using 2MiB
huge-pages, which you just implemented for page_pool.
p.s. pretty cool to see my page_pool design being described in such 
details and with a picture [3]/

  [3] 
https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10280580_peerj-cs-09-1385-g020.jpg

> 
>>> So it seems pretty much impossible for it to cache accesses to 4k
>>> pages thru recycling. I thought that even 2M pages will start to
>>> be problematic for multi queue devices (1k entries on each ring x
>>> 32 rings == 128MB just sitting on the ring, let alone circulation).
>>>    
>>
>> Yes, I'm also worried about how badly these NIC rings and PP ptr_ring
>> affects the IOTLB's ability to cache entries.  Why I suggested testing
>> out the Eviction Hint (EH), but I have not found a way to use/enable
>> these as a quick test in your environment.
> 
> FWIW the first version of the code I wrote actually had the coherent
> ring memory also use the huge pages - the MEP allocator underlying the
> page pool can be used by the driver directly to allocate memory for
> other uses than the page pool.
> 
> But I figured that's going to be a nightmare to upstream, and Alex said
> that even on x86 coherent DMA memory is just write combining not cached
> (which frankly IDK why, possibly yet another thing we could consider
> optimizing?!)
> 
> So I created two allocators, one for coherent (backed by 2M pages) and
> one for non-coherent (backed by 1G pages).

I think it is called Coherent vs Streaming DMA.

> 
> For the ptr_ring I was considering bumping the refcount of pages
> allocated from outside the 1G pool, so that they do not get recycled.
> I deferred optimizing that until I can get some production results.
> The extra CPU cost of loss of recycling could outweigh the IOTLB win.
> 
> All very exciting stuff, I wish the days were slightly longer :)
> 

Know the problem of (human) cycles in a day.
Rizzo's article describes a lot of experiments, that might save us/you 
some time.

--Jesper



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-13 10:07         ` Jesper Dangaard Brouer
@ 2023-07-13 16:27           ` Jakub Kicinski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-13 16:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb, Ulrich Drepper, Luigi Rizzo,
	Luigi Rizzo, farshin

On Thu, 13 Jul 2023 12:07:06 +0200 Jesper Dangaard Brouer wrote:
> > For the ptr_ring I was considering bumping the refcount of pages
> > allocated from outside the 1G pool, so that they do not get recycled.
> > I deferred optimizing that until I can get some production results.
> > The extra CPU cost of loss of recycling could outweigh the IOTLB win.
> > 
> > All very exciting stuff, I wish the days were slightly longer :)
> 
> Know the problem of (human) cycles in a day.
> Rizzo's article describes a lot of experiments, that might save us/you 
> some time.

Speaking of saving time - if anyone has patches to convert mlx4 to page
pool I'd greatly appreciate!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/12] net: huge page backed page_pool
  2023-07-12 17:01         ` Jakub Kicinski
@ 2023-07-14 13:05           ` Yunsheng Lin
  0 siblings, 0 replies; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-14 13:05 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: brouer, netdev, almasrymina, hawk, ilias.apalodimas, edumazet,
	dsahern, michael.chan, willemb

On 2023/7/13 1:01, Jakub Kicinski wrote:
> On Wed, 12 Jul 2023 14:43:32 +0200 Jesper Dangaard Brouer wrote:
>> On 12/07/2023 13.47, Yunsheng Lin wrote:
>>> On 2023/7/12 8:08, Jakub Kicinski wrote:  
>>>> Oh, I split the page into individual 4k pages after DMA mapping.
>>>> There's no need for the host memory to be a huge page. I mean,
>>>> the actual kernel identity mapping is a huge page AFAIU, and the
>>>> struct pages are allocated, anyway. We just need it to be a huge
>>>> page at DMA mapping time.
>>>>
>>>> So the pages from the huge page provider only differ from normal
>>>> alloc_page() pages by the fact that they are a part of a 1G DMA
>>>> mapping.  
>>
>> So, Jakub you are saying the PP refcnt's are still done "as usual" on 
>> individual pages.
> 
> Yes - other than coming from a specific 1G of physical memory 
> the resulting pages are really pretty ordinary 4k pages.
> 
>>> If it is about DMA mapping, is it possible to use dma_map_sg()
>>> to enable a big continuous dma map for a lot of discontinuous
>>> 4k pages to avoid allocating big huge page?
>>>
>>> As the comment:
>>> "The scatter gather list elements are merged together (if possible)
>>> and tagged with the appropriate dma address and length."
>>>
>>> https://elixir.free-electrons.com/linux/v4.16.18/source/arch/arm/mm/dma-mapping.c#L1805
>>>   
>>
>> This is interesting for two reasons.
>>
>> (1) if this DMA merging helps IOTLB misses (?)
> 
> Maybe I misunderstand how IOMMU / virtual addressing works, but I don't
> see how one can merge mappings from physically non-contiguous pages.
> IOW we can't get 1G-worth of random 4k pages and hope that thru some
> magic they get strung together and share an IOTLB entry (if that's
> where Yunsheng's suggestion was going..)

From __arm_lpae_map(), it does seems that smmu in arm can install
pte in different level to point to page of different size.

> 
>> (2) PP could use dma_map_sg() to amortize dma_map call cost.
>>
>> For case (2) __page_pool_alloc_pages_slow() already does bulk allocation
>> of pages (alloc_pages_bulk_array_node()), and then loops over the pages
>> to DMA map them individually.  It seems like an obvious win to use
>> dma_map_sg() here?

For mapping, the above should work, the tricky problem is we need to ensure
all pages belonging to the same big dma mapping is released before we can do
the dma unmapping.

> 
> That could well be worth investigating!
> .
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping
  2023-07-10 10:12   ` Jesper Dangaard Brouer
@ 2023-07-26  6:56     ` Ilias Apalodimas
  0 siblings, 0 replies; 33+ messages in thread
From: Ilias Apalodimas @ 2023-07-26  6:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jakub Kicinski, netdev, brouer, almasrymina, hawk, edumazet,
	dsahern, michael.chan, willemb

[...]

> > -     *mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir,
> > -                                   DMA_ATTR_WEAK_ORDERING);
> > -     if (dma_mapping_error(dev, *mapping)) {
> > -             page_pool_recycle_direct(rxr->page_pool, page);
> > -             return NULL;
> > -     }
> > +     *mapping = page_pool_get_dma_addr(page);
> > +     dma_sync_single_for_device(dev, *mapping, PAGE_SIZE, DMA_BIDIRECTIONAL);
> > +
>
> You can keep this as-is, but I just wanted mention that page_pool
> supports doing the "dma_sync_for_device" via PP_FLAG_DMA_SYNC_DEV.
> Thus, removing more lines from driver code.

+1 to that.  Also, the direction is stored in pp->dma_dir, so it
should automatically do the right thing.

Regards
/Ilias

>
> >       return page;
> >   }
> >
> > @@ -951,6 +948,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
> >                                             unsigned int offset_and_len)
> >   {
> >       unsigned int len = offset_and_len & 0xffff;
> > +     struct device *dev = &bp->pdev->dev;
> >       struct page *page = data;
> >       u16 prod = rxr->rx_prod;
> >       struct sk_buff *skb;
> > @@ -962,8 +960,7 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
> >               return NULL;
> >       }
> >       dma_addr -= bp->rx_dma_offset;
> > -     dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
> > -                          DMA_ATTR_WEAK_ORDERING);
> > +     dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
> >       skb = build_skb(page_address(page), PAGE_SIZE);
> >       if (!skb) {
> >               page_pool_recycle_direct(rxr->page_pool, page);
> > @@ -984,6 +981,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
> >   {
> >       unsigned int payload = offset_and_len >> 16;
> >       unsigned int len = offset_and_len & 0xffff;
> > +     struct device *dev = &bp->pdev->dev;
> >       skb_frag_t *frag;
> >       struct page *page = data;
> >       u16 prod = rxr->rx_prod;
> > @@ -996,8 +994,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
> >               return NULL;
> >       }
> >       dma_addr -= bp->rx_dma_offset;
> > -     dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, PAGE_SIZE, bp->rx_dir,
> > -                          DMA_ATTR_WEAK_ORDERING);
> > +     dma_sync_single_for_cpu(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
> >
> >       if (unlikely(!payload))
> >               payload = eth_get_headlen(bp->dev, data_ptr, len);
> > @@ -2943,9 +2940,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
> >               rx_buf->data = NULL;
> >               if (BNXT_RX_PAGE_MODE(bp)) {
> >                       mapping -= bp->rx_dma_offset;
> > -                     dma_unmap_page_attrs(&pdev->dev, mapping, PAGE_SIZE,
> > -                                          bp->rx_dir,
> > -                                          DMA_ATTR_WEAK_ORDERING);
> >                       page_pool_recycle_direct(rxr->page_pool, data);
> >               } else {
> >                       dma_unmap_single_attrs(&pdev->dev, mapping,
> > @@ -2967,9 +2961,6 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
> >                       continue;
> >
> >               if (BNXT_RX_PAGE_MODE(bp)) {
> > -                     dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping,
> > -                                          BNXT_RX_PAGE_SIZE, bp->rx_dir,
> > -                                          DMA_ATTR_WEAK_ORDERING);
> >                       rx_agg_buf->page = NULL;
> >                       __clear_bit(i, rxr->rx_agg_bmap);
> >
> > @@ -3208,6 +3199,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >   {
> >       struct page_pool_params pp = { 0 };
> >
> > +     pp.flags = PP_FLAG_DMA_MAP;
> >       pp.pool_size = bp->rx_ring_size;
> >       pp.nid = dev_to_node(&bp->pdev->dev);
> >       pp.napi = &rxr->bnapi->napi;
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-07-26  6:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
2023-07-07 18:39 ` [RFC 02/12] net: create a 1G-huge-page-backed allocator Jakub Kicinski
2023-07-07 18:39 ` [RFC 03/12] net: page_pool: hide page_pool_release_page() Jakub Kicinski
2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
2023-07-10 16:07   ` Jesper Dangaard Brouer
2023-07-07 18:39 ` [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page Jakub Kicinski
2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
2023-07-07 19:50   ` Mina Almasry
2023-07-07 22:28     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 07/12] net: page_pool: add huge page backed memory providers Jakub Kicinski
2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
2023-07-10 10:12   ` Jesper Dangaard Brouer
2023-07-26  6:56     ` Ilias Apalodimas
2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
2023-07-10  4:22   ` Michael Chan
2023-07-10 17:04     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them Jakub Kicinski
2023-07-07 18:39 ` [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers Jakub Kicinski
2023-07-07 18:39 ` [RFC 12/12] eth: bnxt: hack in the use of MEP Jakub Kicinski
2023-07-07 19:45 ` [RFC 00/12] net: huge page backed page_pool Mina Almasry
2023-07-07 22:45   ` Jakub Kicinski
2023-07-10 17:31     ` Mina Almasry
2023-07-11 15:49 ` Jesper Dangaard Brouer
2023-07-12  0:08   ` Jakub Kicinski
2023-07-12 11:47     ` Yunsheng Lin
2023-07-12 12:43       ` Jesper Dangaard Brouer
2023-07-12 17:01         ` Jakub Kicinski
2023-07-14 13:05           ` Yunsheng Lin
2023-07-12 14:00     ` Jesper Dangaard Brouer
2023-07-12 17:19       ` Jakub Kicinski
2023-07-13 10:07         ` Jesper Dangaard Brouer
2023-07-13 16:27           ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).