linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
@ 2021-03-12 15:43 Mel Gorman
  2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
directly to the subsystem maintainers. While sunrpc is low-risk, I'm
vaguely aware that there are other prototype series on netdev that affect
page_pool. The conflict should be obvious in linux-next.

Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
  series, the mm patches have to be merged before the sunrpc and net
  users.

Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it
can be made more efficient. Despite that, this is a performance-related
for users that require multiple pages for an operation without multiple
round-trips to the page allocator. Quoting the last patch for the
high-speed networking use-case.

    For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
    redirecting xdp_frame packets into a veth, that does XDP_PASS to
    create an SKB from the xdp_frame, which then cannot return the page
    to the page_pool. In this case, we saw[1] an improvement of 18.8%
    from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Both users in this series are corner cases (NFS and high-speed networks)
so it is unlikely that most users will see any benefit in the short
term. Potential other users are batch allocations for page cache
readahead, fault around and SLUB allocations when high-order pages are
unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with
the initial RFC.

Patch 1 moves GFP flag initialision to prepare_alloc_pages

Patch 2 renames a variable name that is particularly unpopular

Patch 3 adds a bulk page allocator

Patch 4 is a sunrpc cleanup that is a pre-requisite.

Patch 5 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 6 is a preparation patch only for the network user

Patch 7 converts the net page pool to the bulk allocator for order-0 pages.

 include/linux/gfp.h   |  12 ++++
 mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----
 net/core/page_pool.c  | 101 +++++++++++++++++-----------
 net/sunrpc/svc_xprt.c |  47 +++++++++----
 4 files changed, 240 insertions(+), 69 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-19 16:11   ` Vlastimil Babka
  2021-03-12 15:43 ` [PATCH 2/7] mm/page_alloc: Rename alloced to allocated Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

__alloc_pages updates GFP flags to enforce what flags are allowed
during a global context such as booting or suspend. This patch moves the
enforcement from __alloc_pages to prepare_alloc_pages so the code can be
shared between the single page allocator and a new bulk page allocator.

When moving, it is obvious that __alloc_pages() and __alloc_pages
use different names for the same variable. This is an unnecessary
complication so rename gfp_mask to gfp in prepare_alloc_pages() so the
name is consistent.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00b67c47ad87..f0c1d74ead6f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4914,15 +4914,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
+static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_gfp,
 		unsigned int *alloc_flags)
 {
-	ac->highest_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
+	gfp &= gfp_allowed_mask;
+	*alloc_gfp = gfp;
+
+	ac->highest_zoneidx = gfp_zone(gfp);
+	ac->zonelist = node_zonelist(preferred_nid, gfp);
 	ac->nodemask = nodemask;
-	ac->migratetype = gfp_migratetype(gfp_mask);
+	ac->migratetype = gfp_migratetype(gfp);
 
 	if (cpusets_enabled()) {
 		*alloc_gfp |= __GFP_HARDWALL;
@@ -4936,18 +4939,18 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 			*alloc_flags |= ALLOC_CPUSET;
 	}
 
-	fs_reclaim_acquire(gfp_mask);
-	fs_reclaim_release(gfp_mask);
+	fs_reclaim_acquire(gfp);
+	fs_reclaim_release(gfp);
 
-	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
+	might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	if (should_fail_alloc_page(gfp, order))
 		return false;
 
-	*alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
+	*alloc_flags = current_alloc_flags(gfp, *alloc_flags);
 
 	/* Dirty zone balancing only done in the fast path */
-	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+	ac->spread_dirty_pages = (gfp & __GFP_WRITE);
 
 	/*
 	 * The preferred zone is used for statistics but crucially it is
@@ -4980,8 +4983,6 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 		return NULL;
 	}
 
-	gfp &= gfp_allowed_mask;
-	alloc_gfp = gfp;
 	if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
 			&alloc_gfp, &alloc_flags))
 		return NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/7] mm/page_alloc: Rename alloced to allocated
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
  2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-19 16:22   ` Vlastimil Babka
  2021-03-12 15:43 ` [PATCH 3/7] mm/page_alloc: Add a bulk page allocator Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

Review feedback of the bulk allocator twice found problems with "alloced"
being a counter for pages allocated. The naming was based on the API name
"alloc" and was based on the idea that verbal communication about malloc
tends to use the fake word "malloced" instead of the fake word mallocated.
To be consistent, this preparation patch renames alloced to allocated
in rmqueue_bulk so the bulk allocator and per-cpu allocator use similar
names when the bulk allocator is introduced.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f0c1d74ead6f..880b1d6368bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2904,7 +2904,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
-	int i, alloced = 0;
+	int i, allocated = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
@@ -2927,7 +2927,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->lru, list);
-		alloced++;
+		allocated++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
@@ -2936,12 +2936,12 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	/*
 	 * i pages were removed from the buddy list even if some leak due
 	 * to check_pcp_refill failing so adjust NR_FREE_PAGES based
-	 * on i. Do not confuse with 'alloced' which is the number of
+	 * on i. Do not confuse with 'allocated' which is the number of
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
-	return alloced;
+	return allocated;
 }
 
 #ifdef CONFIG_NUMA
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/7] mm/page_alloc: Add a bulk page allocator
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
  2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
  2021-03-12 15:43 ` [PATCH 2/7] mm/page_alloc: Rename alloced to allocated Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-19 18:18   ` Vlastimil Babka
  2021-03-12 15:43 ` [PATCH 4/7] SUNRPC: Set rq_page_end differently Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  12 +++++
 mm/page_alloc.c     | 116 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..e2cd98dba72e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page *page)
 struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 		nodemask_t *nodemask);
 
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+				nodemask_t *nodemask, int nr_pages,
+				struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+	return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -581,6 +592,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 880b1d6368bd..f48f94375b66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 	}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		trace_mm_page_free_batched(page);
+		if (put_page_testzero(page)) {
+			list_del(&page->lru);
+			__free_pages_ok(page, 0, FPI_NONE);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
 	return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+			nodemask_t *nodemask, int nr_pages,
+			struct list_head *alloc_list)
+{
+	struct page *page;
+	unsigned long flags;
+	struct zone *zone;
+	struct zoneref *z;
+	struct per_cpu_pages *pcp;
+	struct list_head *pcp_list;
+	struct alloc_context ac;
+	gfp_t alloc_gfp;
+	unsigned int alloc_flags;
+	int allocated = 0;
+
+	if (WARN_ON_ONCE(nr_pages <= 0))
+		return 0;
+
+	if (nr_pages == 1)
+		goto failed;
+
+	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+	if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
+	&alloc_gfp, &alloc_flags))
+		return 0;
+	gfp = alloc_gfp;
+
+	/* Find an allowed local zone that meets the high watermark. */
+	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+		unsigned long mark;
+
+		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+		    !__cpuset_zone_allowed(zone, gfp)) {
+			continue;
+		}
+
+		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+			goto failed;
+		}
+
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+		if (zone_watermark_fast(zone, 0,  mark,
+				zonelist_zone_idx(ac.preferred_zoneref),
+				alloc_flags, gfp)) {
+			break;
+		}
+	}
+	if (!zone)
+		return 0;
+
+	/* Attempt the batch allocation */
+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp_list = &pcp->lists[ac.migratetype];
+
+	while (allocated < nr_pages) {
+		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+								pcp, pcp_list);
+		if (!page) {
+			/* Try and get at least one page */
+			if (!allocated)
+				goto failed_irq;
+			break;
+		}
+
+		list_add(&page->lru, alloc_list);
+		allocated++;
+	}
+
+	__count_zid_vm_events(PGALLOC, zone_idx(zone), allocated);
+	zone_statistics(zone, zone);
+
+	local_irq_restore(flags);
+
+	/* Prep page with IRQs enabled to reduce disabled times */
+	list_for_each_entry(page, alloc_list, lru)
+		prep_new_page(page, 0, gfp, 0);
+
+	return allocated;
+
+failed_irq:
+	local_irq_restore(flags);
+
+failed:
+	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+	if (page) {
+		list_add(&page->lru, alloc_list);
+		allocated = 1;
+	}
+
+	return allocated;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/7] SUNRPC: Set rq_page_end differently
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (2 preceding siblings ...)
  2021-03-12 15:43 ` [PATCH 3/7] mm/page_alloc: Add a bulk page allocator Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-12 15:43 ` [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Refactor:

I'm about to use the loop variable @i for something else.

As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.

Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index dcc50ae54550..cfa7e4776d0e 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -667,8 +667,8 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 			}
 			rqstp->rq_pages[i] = p;
 		}
-	rqstp->rq_page_end = &rqstp->rq_pages[i];
-	rqstp->rq_pages[i++] = NULL; /* this might be seen in nfs_read_actor */
+	rqstp->rq_page_end = &rqstp->rq_pages[pages];
+	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
 	/* Make arg->head point to first page and arg->pages point to rest */
 	arg = &rqstp->rq_arg;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (3 preceding siblings ...)
  2021-03-12 15:43 ` [PATCH 4/7] SUNRPC: Set rq_page_end differently Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-12 18:44   ` Alexander Duyck
  2021-03-12 15:43 ` [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Reduce the rate at which nfsd threads hammer on the page allocator.
This improves throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cfa7e4776d0e..38a8d6283801 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
 	struct svc_serv *serv = rqstp->rq_server;
+	unsigned long needed;
 	struct xdr_buf *arg;
+	struct page *page;
 	int pages;
 	int i;
 
-	/* now allocate needed pages.  If we get a failure, sleep briefly */
 	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
 	if (pages > RPCSVC_MAXPAGES) {
 		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
@@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 		/* use as many pages as possible */
 		pages = RPCSVC_MAXPAGES;
 	}
-	for (i = 0; i < pages ; i++)
-		while (rqstp->rq_pages[i] == NULL) {
-			struct page *p = alloc_page(GFP_KERNEL);
-			if (!p) {
-				set_current_state(TASK_INTERRUPTIBLE);
-				if (signalled() || kthread_should_stop()) {
-					set_current_state(TASK_RUNNING);
-					return -EINTR;
-				}
-				schedule_timeout(msecs_to_jiffies(500));
+
+	for (needed = 0, i = 0; i < pages ; i++)
+		if (!rqstp->rq_pages[i])
+			needed++;
+	if (needed) {
+		LIST_HEAD(list);
+
+retry:
+		alloc_pages_bulk(GFP_KERNEL, needed, &list);
+		for (i = 0; i < pages; i++) {
+			if (!rqstp->rq_pages[i]) {
+				page = list_first_entry_or_null(&list,
+								struct page,
+								lru);
+				if (unlikely(!page))
+					goto empty_list;
+				list_del(&page->lru);
+				rqstp->rq_pages[i] = page;
+				needed--;
 			}
-			rqstp->rq_pages[i] = p;
 		}
+	}
 	rqstp->rq_page_end = &rqstp->rq_pages[pages];
 	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
@@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 	arg->len = (pages-1)*PAGE_SIZE;
 	arg->tail[0].iov_len = 0;
 	return 0;
+
+empty_list:
+	set_current_state(TASK_INTERRUPTIBLE);
+	if (signalled() || kthread_should_stop()) {
+		set_current_state(TASK_RUNNING);
+		return -EINTR;
+	}
+	schedule_timeout(msecs_to_jiffies(500));
+	goto retry;
 }
 
 static bool
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (4 preceding siblings ...)
  2021-03-12 15:43 ` [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-12 15:43 ` [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
  2021-03-17 16:31 ` [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Alexander Lobakin
  7 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Jesper Dangaard Brouer <brouer@redhat.com>

In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.

V2: make page_pool_dma_map return boolean (Ilias)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
 net/core/page_pool.c | 45 +++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..40e1b2beaa6c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,14 +180,37 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					 pool->p.dma_dir);
 }
 
+static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
+{
+	dma_addr_t dma;
+
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
+				 (PAGE_SIZE << pool->p.order),
+				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	if (dma_mapping_error(pool->p.dev, dma))
+		return false;
+
+	page->dma_addr = dma;
+
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+	return true;
+}
+
 /* slow path */
 noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	unsigned int pp_flags = pool->p.flags;
 	struct page *page;
 	gfp_t gfp = _gfp;
-	dma_addr_t dma;
 
 	/* We could always set __GFP_COMP, and avoid this branch, as
 	 * prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!page)
 		return NULL;
 
-	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
-		goto skip_dma_map;
-
-	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
-	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
-	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
-	 * This mapping is kept for lifetime of page, until leaving pool.
-	 */
-	dma = dma_map_page_attrs(pool->p.dev, page, 0,
-				 (PAGE_SIZE << pool->p.order),
-				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
-	if (dma_mapping_error(pool->p.dev, dma)) {
+	if ((pp_flags & PP_FLAG_DMA_MAP) &&
+	    unlikely(!page_pool_dma_map(pool, page))) {
 		put_page(page);
 		return NULL;
 	}
-	page->dma_addr = dma;
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
-skip_dma_map:
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-
 	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (5 preceding siblings ...)
  2021-03-12 15:43 ` [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
@ 2021-03-12 15:43 ` Mel Gorman
  2021-03-12 19:44   ` Alexander Duyck
  2021-03-17 16:31 ` [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Alexander Lobakin
  7 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-12 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Jesper Dangaard Brouer <brouer@redhat.com>

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
 net/core/page_pool.c | 62 ++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 40e1b2beaa6c..a5889f1b86aa 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -208,44 +208,60 @@ noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	const int bulk = PP_ALLOC_CACHE_REFILL;
+	struct page *page, *next, *first_page;
 	unsigned int pp_flags = pool->p.flags;
-	struct page *page;
+	unsigned int pp_order = pool->p.order;
+	int pp_nid = pool->p.nid;
+	LIST_HEAD(page_list);
 	gfp_t gfp = _gfp;
 
-	/* We could always set __GFP_COMP, and avoid this branch, as
-	 * prep_new_page() can handle order-0 with __GFP_COMP.
-	 */
-	if (pool->p.order)
+	/* Don't support bulk alloc for high-order pages */
+	if (unlikely(pp_order)) {
 		gfp |= __GFP_COMP;
+		first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+		if (unlikely(!first_page))
+			return NULL;
+		goto out;
+	}
 
-	/* FUTURE development:
-	 *
-	 * Current slow-path essentially falls back to single page
-	 * allocations, which doesn't improve performance.  This code
-	 * need bulk allocation support from the page allocator code.
-	 */
-
-	/* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
-	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
-	page = alloc_pages(gfp, pool->p.order);
-#endif
-	if (!page)
+	if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
 		return NULL;
 
+	/* First page is extracted and returned to caller */
+	first_page = list_first_entry(&page_list, struct page, lru);
+	list_del(&first_page->lru);
+
+	/* Remaining pages store in alloc.cache */
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		if ((pp_flags & PP_FLAG_DMA_MAP) &&
+		    unlikely(!page_pool_dma_map(pool, page))) {
+			put_page(page);
+			continue;
+		}
+		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+			pool->pages_state_hold_cnt++;
+			trace_page_pool_state_hold(pool, page,
+						   pool->pages_state_hold_cnt);
+		} else {
+			put_page(page);
+		}
+	}
+out:
 	if ((pp_flags & PP_FLAG_DMA_MAP) &&
-	    unlikely(!page_pool_dma_map(pool, page))) {
-		put_page(page);
+	    unlikely(!page_pool_dma_map(pool, first_page))) {
+		put_page(first_page);
 		return NULL;
 	}
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+	trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-	return page;
+	return first_page;
 }
 
 /* For using page_pool replace: alloc_pages() API calls, but provide
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-03-12 15:43 ` [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator Mel Gorman
@ 2021-03-12 18:44   ` Alexander Duyck
  2021-03-12 19:22     ` Chuck Lever III
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Duyck @ 2021-03-12 18:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Reduce the rate at which nfsd threads hammer on the page allocator.
> This improves throughput scalability by enabling the threads to run
> more independently of each other.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
>  1 file changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index cfa7e4776d0e..38a8d6283801 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
>  static int svc_alloc_arg(struct svc_rqst *rqstp)
>  {
>         struct svc_serv *serv = rqstp->rq_server;
> +       unsigned long needed;
>         struct xdr_buf *arg;
> +       struct page *page;
>         int pages;
>         int i;
>
> -       /* now allocate needed pages.  If we get a failure, sleep briefly */
>         pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>         if (pages > RPCSVC_MAXPAGES) {
>                 pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>                 /* use as many pages as possible */
>                 pages = RPCSVC_MAXPAGES;
>         }
> -       for (i = 0; i < pages ; i++)
> -               while (rqstp->rq_pages[i] == NULL) {
> -                       struct page *p = alloc_page(GFP_KERNEL);
> -                       if (!p) {
> -                               set_current_state(TASK_INTERRUPTIBLE);
> -                               if (signalled() || kthread_should_stop()) {
> -                                       set_current_state(TASK_RUNNING);
> -                                       return -EINTR;
> -                               }
> -                               schedule_timeout(msecs_to_jiffies(500));
> +

> +       for (needed = 0, i = 0; i < pages ; i++)
> +               if (!rqstp->rq_pages[i])
> +                       needed++;

I would use an opening and closing braces for the for loop since
technically the if is a multiline statement. It will make this more
readable.

> +       if (needed) {
> +               LIST_HEAD(list);
> +
> +retry:

Rather than kind of open code a while loop why not just make this
"while (needed)"? Then all you have to do is break out of the for loop
and you will automatically return here instead of having to jump to
two different labels.

> +               alloc_pages_bulk(GFP_KERNEL, needed, &list);

Rather than not using the return value would it make sense here to
perhaps subtract it from needed? Then you would know if any of the
allocation requests weren't fulfilled.

> +               for (i = 0; i < pages; i++) {

It is probably optimizing for the exception case, but I don't think
you want the "i = 0" here. If you are having to stop because the list
is empty it probably makes sense to resume where you left off. So you
should probably be initializing i to 0 before we check for needed.

> +                       if (!rqstp->rq_pages[i]) {

It might be cleaner here to just do a "continue" if rq_pages[i] is populated.

> +                               page = list_first_entry_or_null(&list,
> +                                                               struct page,
> +                                                               lru);
> +                               if (unlikely(!page))
> +                                       goto empty_list;

I think I preferred the original code that wasn't jumping away from
the loop here. With the change I suggested above that would switch the
if(needed) to while(needed) you could have it just break out of the
for loop to place itself back in the while loop.

> +                               list_del(&page->lru);
> +                               rqstp->rq_pages[i] = page;
> +                               needed--;
>                         }
> -                       rqstp->rq_pages[i] = p;
>                 }
> +       }
>         rqstp->rq_page_end = &rqstp->rq_pages[pages];
>         rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
>
> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>         arg->len = (pages-1)*PAGE_SIZE;
>         arg->tail[0].iov_len = 0;
>         return 0;
> +
> +empty_list:
> +       set_current_state(TASK_INTERRUPTIBLE);
> +       if (signalled() || kthread_should_stop()) {
> +               set_current_state(TASK_RUNNING);
> +               return -EINTR;
> +       }
> +       schedule_timeout(msecs_to_jiffies(500));
> +       goto retry;
>  }
>
>  static bool
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-03-12 18:44   ` Alexander Duyck
@ 2021-03-12 19:22     ` Chuck Lever III
  2021-03-13 12:59       ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Chuck Lever III @ 2021-03-12 19:22 UTC (permalink / raw)
  To: Alexander Duyck, Mel Gorman
  Cc: Andrew Morton, Jesper Dangaard Brouer, Christoph Hellwig,
	Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux NFS Mailing List

Mel, I can send you a tidied and tested update to this patch,
or you can drop the two NFSD patches and I can submit them via
the NFSD tree when alloc_pages_bulk() is merged.

> On Mar 12, 2021, at 1:44 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>> 
>> From: Chuck Lever <chuck.lever@oracle.com>
>> 
>> Reduce the rate at which nfsd threads hammer on the page allocator.
>> This improves throughput scalability by enabling the threads to run
>> more independently of each other.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> ---
>> net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
>> 1 file changed, 31 insertions(+), 12 deletions(-)
>> 
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index cfa7e4776d0e..38a8d6283801 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
>> static int svc_alloc_arg(struct svc_rqst *rqstp)
>> {
>>        struct svc_serv *serv = rqstp->rq_server;
>> +       unsigned long needed;
>>        struct xdr_buf *arg;
>> +       struct page *page;
>>        int pages;
>>        int i;
>> 
>> -       /* now allocate needed pages.  If we get a failure, sleep briefly */
>>        pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>        if (pages > RPCSVC_MAXPAGES) {
>>                pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>                /* use as many pages as possible */
>>                pages = RPCSVC_MAXPAGES;
>>        }
>> -       for (i = 0; i < pages ; i++)
>> -               while (rqstp->rq_pages[i] == NULL) {
>> -                       struct page *p = alloc_page(GFP_KERNEL);
>> -                       if (!p) {
>> -                               set_current_state(TASK_INTERRUPTIBLE);
>> -                               if (signalled() || kthread_should_stop()) {
>> -                                       set_current_state(TASK_RUNNING);
>> -                                       return -EINTR;
>> -                               }
>> -                               schedule_timeout(msecs_to_jiffies(500));
>> +
> 
>> +       for (needed = 0, i = 0; i < pages ; i++)
>> +               if (!rqstp->rq_pages[i])
>> +                       needed++;
> 
> I would use an opening and closing braces for the for loop since
> technically the if is a multiline statement. It will make this more
> readable.
> 
>> +       if (needed) {
>> +               LIST_HEAD(list);
>> +
>> +retry:
> 
> Rather than kind of open code a while loop why not just make this
> "while (needed)"? Then all you have to do is break out of the for loop
> and you will automatically return here instead of having to jump to
> two different labels.
> 
>> +               alloc_pages_bulk(GFP_KERNEL, needed, &list);
> 
> Rather than not using the return value would it make sense here to
> perhaps subtract it from needed? Then you would know if any of the
> allocation requests weren't fulfilled.
> 
>> +               for (i = 0; i < pages; i++) {
> 
> It is probably optimizing for the exception case, but I don't think
> you want the "i = 0" here. If you are having to stop because the list
> is empty it probably makes sense to resume where you left off. So you
> should probably be initializing i to 0 before we check for needed.
> 
>> +                       if (!rqstp->rq_pages[i]) {
> 
> It might be cleaner here to just do a "continue" if rq_pages[i] is populated.
> 
>> +                               page = list_first_entry_or_null(&list,
>> +                                                               struct page,
>> +                                                               lru);
>> +                               if (unlikely(!page))
>> +                                       goto empty_list;
> 
> I think I preferred the original code that wasn't jumping away from
> the loop here. With the change I suggested above that would switch the
> if(needed) to while(needed) you could have it just break out of the
> for loop to place itself back in the while loop.
> 
>> +                               list_del(&page->lru);
>> +                               rqstp->rq_pages[i] = page;
>> +                               needed--;
>>                        }
>> -                       rqstp->rq_pages[i] = p;
>>                }
>> +       }
>>        rqstp->rq_page_end = &rqstp->rq_pages[pages];
>>        rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
>> 
>> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>        arg->len = (pages-1)*PAGE_SIZE;
>>        arg->tail[0].iov_len = 0;
>>        return 0;
>> +
>> +empty_list:
>> +       set_current_state(TASK_INTERRUPTIBLE);
>> +       if (signalled() || kthread_should_stop()) {
>> +               set_current_state(TASK_RUNNING);
>> +               return -EINTR;
>> +       }
>> +       schedule_timeout(msecs_to_jiffies(500));
>> +       goto retry;
>> }
>> 
>> static bool
>> --
>> 2.26.2

--
Chuck Lever




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-12 15:43 ` [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
@ 2021-03-12 19:44   ` Alexander Duyck
  2021-03-12 20:05     ` Ilias Apalodimas
  2021-03-13 13:30     ` Mel Gorman
  0 siblings, 2 replies; 28+ messages in thread
From: Alexander Duyck @ 2021-03-12 19:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> From: Jesper Dangaard Brouer <brouer@redhat.com>
>
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
>
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> ---
>  net/core/page_pool.c | 62 ++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 23 deletions(-)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 40e1b2beaa6c..a5889f1b86aa 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -208,44 +208,60 @@ noinline
>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>                                                  gfp_t _gfp)
>  {
> +       const int bulk = PP_ALLOC_CACHE_REFILL;
> +       struct page *page, *next, *first_page;
>         unsigned int pp_flags = pool->p.flags;
> -       struct page *page;
> +       unsigned int pp_order = pool->p.order;
> +       int pp_nid = pool->p.nid;
> +       LIST_HEAD(page_list);
>         gfp_t gfp = _gfp;
>
> -       /* We could always set __GFP_COMP, and avoid this branch, as
> -        * prep_new_page() can handle order-0 with __GFP_COMP.
> -        */
> -       if (pool->p.order)
> +       /* Don't support bulk alloc for high-order pages */
> +       if (unlikely(pp_order)) {
>                 gfp |= __GFP_COMP;
> +               first_page = alloc_pages_node(pp_nid, gfp, pp_order);
> +               if (unlikely(!first_page))
> +                       return NULL;
> +               goto out;
> +       }
>
> -       /* FUTURE development:
> -        *
> -        * Current slow-path essentially falls back to single page
> -        * allocations, which doesn't improve performance.  This code
> -        * need bulk allocation support from the page allocator code.
> -        */
> -
> -       /* Cache was empty, do real allocation */
> -#ifdef CONFIG_NUMA
> -       page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> -#else
> -       page = alloc_pages(gfp, pool->p.order);
> -#endif
> -       if (!page)
> +       if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
>                 return NULL;
>
> +       /* First page is extracted and returned to caller */
> +       first_page = list_first_entry(&page_list, struct page, lru);
> +       list_del(&first_page->lru);
> +

This seems kind of broken to me. If you pull the first page and then
cannot map it you end up returning NULL even if you placed a number of
pages in the cache.

It might make more sense to have the loop below record a pointer to
the last page you processed and handle things in two stages so that on
the first iteration you map one page.

So something along the lines of:
1. Initialize last_page to NULL

for each page in the list
  2. Map page
  3. If last_page is non-NULL, move to cache
  4. Assign page to last_page
  5. Return to step 2 for each page in list

6. return last_page

> +       /* Remaining pages store in alloc.cache */
> +       list_for_each_entry_safe(page, next, &page_list, lru) {
> +               list_del(&page->lru);
> +               if ((pp_flags & PP_FLAG_DMA_MAP) &&
> +                   unlikely(!page_pool_dma_map(pool, page))) {
> +                       put_page(page);
> +                       continue;
> +               }

So if you added a last_page pointer what you could do is check for it
here and assign it to the alloc cache. If last_page is not set the
block would be skipped.

> +               if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> +                       pool->alloc.cache[pool->alloc.count++] = page;
> +                       pool->pages_state_hold_cnt++;
> +                       trace_page_pool_state_hold(pool, page,
> +                                                  pool->pages_state_hold_cnt);
> +               } else {
> +                       put_page(page);

If you are just calling put_page here aren't you leaking DMA mappings?
Wouldn't you need to potentially unmap the page before you call
put_page on it?

> +               }
> +       }
> +out:
>         if ((pp_flags & PP_FLAG_DMA_MAP) &&
> -           unlikely(!page_pool_dma_map(pool, page))) {
> -               put_page(page);
> +           unlikely(!page_pool_dma_map(pool, first_page))) {
> +               put_page(first_page);

I would probably move this block up and make it a part of the pp_order
block above. Also since you are doing this in 2 spots it might make
sense to look at possibly making this an inline function.

>                 return NULL;
>         }
>
>         /* Track how many pages are held 'in-flight' */
>         pool->pages_state_hold_cnt++;
> -       trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> +       trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
>
>         /* When page just alloc'ed is should/must have refcnt 1. */
> -       return page;
> +       return first_page;
>  }
>
>  /* For using page_pool replace: alloc_pages() API calls, but provide
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-12 19:44   ` Alexander Duyck
@ 2021-03-12 20:05     ` Ilias Apalodimas
  2021-03-15 13:39       ` Jesper Dangaard Brouer
  2021-03-13 13:30     ` Mel Gorman
  1 sibling, 1 reply; 28+ messages in thread
From: Ilias Apalodimas @ 2021-03-12 20:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Mel Gorman, Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

[...]
> 6. return last_page
> 
> > +       /* Remaining pages store in alloc.cache */
> > +       list_for_each_entry_safe(page, next, &page_list, lru) {
> > +               list_del(&page->lru);
> > +               if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > +                   unlikely(!page_pool_dma_map(pool, page))) {
> > +                       put_page(page);
> > +                       continue;
> > +               }
> 
> So if you added a last_page pointer what you could do is check for it
> here and assign it to the alloc cache. If last_page is not set the
> block would be skipped.
> 
> > +               if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> > +                       pool->alloc.cache[pool->alloc.count++] = page;
> > +                       pool->pages_state_hold_cnt++;
> > +                       trace_page_pool_state_hold(pool, page,
> > +                                                  pool->pages_state_hold_cnt);
> > +               } else {
> > +                       put_page(page);
> 
> If you are just calling put_page here aren't you leaking DMA mappings?
> Wouldn't you need to potentially unmap the page before you call
> put_page on it?

Oops, I completely missed that. Alexander is right here.

> 
> > +               }
> > +       }
> > +out:
> >         if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > -           unlikely(!page_pool_dma_map(pool, page))) {
> > -               put_page(page);
> > +           unlikely(!page_pool_dma_map(pool, first_page))) {
> > +               put_page(first_page);
> 
> I would probably move this block up and make it a part of the pp_order
> block above. Also since you are doing this in 2 spots it might make
> sense to look at possibly making this an inline function.
> 
> >                 return NULL;
> >         }
> >
> >         /* Track how many pages are held 'in-flight' */
> >         pool->pages_state_hold_cnt++;
> > -       trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> > +       trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
> >
> >         /* When page just alloc'ed is should/must have refcnt 1. */
> > -       return page;
> > +       return first_page;
> >  }
> >
> >  /* For using page_pool replace: alloc_pages() API calls, but provide
> > --
> > 2.26.2
> >

Cheers
/Ilias

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-03-12 19:22     ` Chuck Lever III
@ 2021-03-13 12:59       ` Mel Gorman
  0 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-13 12:59 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Alexander Duyck, Andrew Morton, Jesper Dangaard Brouer,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux NFS Mailing List

On Fri, Mar 12, 2021 at 07:22:43PM +0000, Chuck Lever III wrote:
> Mel, I can send you a tidied and tested update to this patch,
> or you can drop the two NFSD patches and I can submit them via
> the NFSD tree when alloc_pages_bulk() is merged.
> 

Send me a tidied version anyway. I'm happy enough to include them in the
series even if it ultimately gets merged via the NFSD tree. It'll need
to be kept as a separate pull request to avoid delaying unrelated NFSD
patches until Andrew merges the mm parts.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-12 19:44   ` Alexander Duyck
  2021-03-12 20:05     ` Ilias Apalodimas
@ 2021-03-13 13:30     ` Mel Gorman
  2021-03-15  8:40       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2021-03-13 13:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On Fri, Mar 12, 2021 at 11:44:09AM -0800, Alexander Duyck wrote:
> > -       /* FUTURE development:
> > -        *
> > -        * Current slow-path essentially falls back to single page
> > -        * allocations, which doesn't improve performance.  This code
> > -        * need bulk allocation support from the page allocator code.
> > -        */
> > -
> > -       /* Cache was empty, do real allocation */
> > -#ifdef CONFIG_NUMA
> > -       page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> > -#else
> > -       page = alloc_pages(gfp, pool->p.order);
> > -#endif
> > -       if (!page)
> > +       if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
> >                 return NULL;
> >
> > +       /* First page is extracted and returned to caller */
> > +       first_page = list_first_entry(&page_list, struct page, lru);
> > +       list_del(&first_page->lru);
> > +
> 
> This seems kind of broken to me. If you pull the first page and then
> cannot map it you end up returning NULL even if you placed a number of
> pages in the cache.
> 

I think you're right but I'm punting this to Jesper to fix. He's more
familiar with this particular code and can verify the performance is
still ok for high speed networks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-13 13:30     ` Mel Gorman
@ 2021-03-15  8:40       ` Jesper Dangaard Brouer
  2021-03-15 19:33         ` [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-15  8:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alexander Duyck, Andrew Morton, Chuck Lever, Christoph Hellwig,
	Matthew Wilcox, LKML, Linux-Net, Linux-MM, Linux-NFS, brouer

On Sat, 13 Mar 2021 13:30:58 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Fri, Mar 12, 2021 at 11:44:09AM -0800, Alexander Duyck wrote:
> > > -       /* FUTURE development:
> > > -        *
> > > -        * Current slow-path essentially falls back to single page
> > > -        * allocations, which doesn't improve performance.  This code
> > > -        * need bulk allocation support from the page allocator code.
> > > -        */
> > > -
> > > -       /* Cache was empty, do real allocation */
> > > -#ifdef CONFIG_NUMA
> > > -       page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> > > -#else
> > > -       page = alloc_pages(gfp, pool->p.order);
> > > -#endif
> > > -       if (!page)
> > > +       if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
> > >                 return NULL;
> > >
> > > +       /* First page is extracted and returned to caller */
> > > +       first_page = list_first_entry(&page_list, struct page, lru);
> > > +       list_del(&first_page->lru);
> > > +  
> > 
> > This seems kind of broken to me. If you pull the first page and then
> > cannot map it you end up returning NULL even if you placed a number of
> > pages in the cache.
> >   
> 
> I think you're right but I'm punting this to Jesper to fix. He's more
> familiar with this particular code and can verify the performance is
> still ok for high speed networks.

Yes, I'll take a look at this, and updated the patch accordingly (and re-run
the performance tests).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-12 20:05     ` Ilias Apalodimas
@ 2021-03-15 13:39       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-15 13:39 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Alexander Duyck, Mel Gorman, Andrew Morton, Chuck Lever,
	Christoph Hellwig, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, brouer

On Fri, 12 Mar 2021 22:05:45 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> [...]
> > 6. return last_page
> >   
> > > +       /* Remaining pages store in alloc.cache */
> > > +       list_for_each_entry_safe(page, next, &page_list, lru) {
> > > +               list_del(&page->lru);
> > > +               if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > > +                   unlikely(!page_pool_dma_map(pool, page))) {
> > > +                       put_page(page);
> > > +                       continue;
> > > +               }  
> > 
> > So if you added a last_page pointer what you could do is check for it
> > here and assign it to the alloc cache. If last_page is not set the
> > block would be skipped.
> >   
> > > +               if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> > > +                       pool->alloc.cache[pool->alloc.count++] = page;
> > > +                       pool->pages_state_hold_cnt++;
> > > +                       trace_page_pool_state_hold(pool, page,
> > > +                                                  pool->pages_state_hold_cnt);
> > > +               } else {
> > > +                       put_page(page);  
> > 
> > If you are just calling put_page here aren't you leaking DMA mappings?
> > Wouldn't you need to potentially unmap the page before you call
> > put_page on it?  
> 
> Oops, I completely missed that. Alexander is right here.

Well, the put_page() case can never happen as the pool->alloc.cache[]
is known to be empty when this function is called.  I do agree that the
code looks cumbersome and should free the DMA mapping, if it could
happen.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series
  2021-03-15  8:40       ` Jesper Dangaard Brouer
@ 2021-03-15 19:33         ` Jesper Dangaard Brouer
  2021-03-15 19:33           ` [PATCH mel-git] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-15 19:33 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, Alexander Duyck, netdev,
	linux-nfs, linux-kernel

This patch is against Mel's git-tree:
 git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git

Using branch: mm-bulk-rebase-v4r2 but replacing the last patch related to
the page_pool using __alloc_pages_bulk().

 https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/log/?h=mm-bulk-rebase-v4r2

While implementing suggestions by Alexander Duyck, I realised that I could
simplify the code further, and simply take the last page from the
pool->alloc.cache given this avoids special casing the last page.

I re-ran performance tests and the improvement have been reduced to 13% from
18% before, but I don't think the rewrite of the specific patch have
anything to do with this.

Notes on tests:
 https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#test-on-mel-git-tree

---

Jesper Dangaard Brouer (1):
      net: page_pool: use alloc_pages_bulk in refill code path


 net/core/page_pool.c |   73 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 26 deletions(-)

--


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH mel-git] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-15 19:33         ` [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series Jesper Dangaard Brouer
@ 2021-03-15 19:33           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-15 19:33 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, Alexander Duyck, netdev,
	linux-nfs, linux-kernel

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 13% from using
the alloc_pages_bulk API (3,810,013 pps -> 4,308,208 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/core/page_pool.c |   73 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 40e1b2beaa6c..7c194335c066 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -203,38 +203,17 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	return true;
 }
 
-/* slow path */
-noinline
-static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
-						 gfp_t _gfp)
+static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
+						 gfp_t gfp)
 {
-	unsigned int pp_flags = pool->p.flags;
 	struct page *page;
-	gfp_t gfp = _gfp;
-
-	/* We could always set __GFP_COMP, and avoid this branch, as
-	 * prep_new_page() can handle order-0 with __GFP_COMP.
-	 */
-	if (pool->p.order)
-		gfp |= __GFP_COMP;
-
-	/* FUTURE development:
-	 *
-	 * Current slow-path essentially falls back to single page
-	 * allocations, which doesn't improve performance.  This code
-	 * need bulk allocation support from the page allocator code.
-	 */
 
-	/* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
+	gfp |= __GFP_COMP;
 	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
-	page = alloc_pages(gfp, pool->p.order);
-#endif
-	if (!page)
+	if (unlikely(!page))
 		return NULL;
 
-	if ((pp_flags & PP_FLAG_DMA_MAP) &&
+	if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
 	    unlikely(!page_pool_dma_map(pool, page))) {
 		put_page(page);
 		return NULL;
@@ -243,6 +222,48 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t gfp)
+{
+	const int bulk = PP_ALLOC_CACHE_REFILL;
+	unsigned int pp_flags = pool->p.flags;
+	unsigned int pp_order = pool->p.order;
+	int pp_nid = pool->p.nid;
+	struct page *page, *next;
+	LIST_HEAD(page_list);
+
+	/* Don't support bulk alloc for high-order pages */
+	if (unlikely(pp_order))
+		return __page_pool_alloc_page_order(pool, gfp);
+
+	if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
+		return NULL;
+
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		if ((pp_flags & PP_FLAG_DMA_MAP) &&
+		    unlikely(!page_pool_dma_map(pool, page))) {
+			put_page(page);
+			continue;
+		}
+		/* Alloc cache have room as it is empty on function call */
+		pool->alloc.cache[pool->alloc.count++] = page;
+		/* Track how many pages are held 'in-flight' */
+		pool->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pool, page,
+					   pool->pages_state_hold_cnt);
+	}
+
+	/* Return last page */
+	if (likely(pool->alloc.count > 0))
+		page = pool->alloc.cache[--pool->alloc.count];
+	else
+		page = NULL;
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
 	return page;



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (6 preceding siblings ...)
  2021-03-12 15:43 ` [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
@ 2021-03-17 16:31 ` Alexander Lobakin
  2021-03-17 16:38   ` Jesper Dangaard Brouer
  7 siblings, 1 reply; 28+ messages in thread
From: Alexander Lobakin @ 2021-03-17 16:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alexander Lobakin, Andrew Morton, Chuck Lever,
	Jesper Dangaard Brouer, Christoph Hellwig, Alexander Duyck,
	Matthew Wilcox, LKML, Linux-Net, Linux-MM, Linux-NFS

From: Mel Gorman <mgorman@techsingularity.net>
Date: Fri, 12 Mar 2021 15:43:24 +0000

Hi there,

> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

I gave this series a go on my setup, it showed a bump of 10 Mbps on
UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

(4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
allocations with MTU of 1508 bytes, linear frames via build_skb(),
GRO + TSO/USO)

I didn't have time to drill into the code, so for now can't provide
any additional details. You can request anything you need though and
I'll try to find a window to collect it.

> Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> vaguely aware that there are other prototype series on netdev that affect
> page_pool. The conflict should be obvious in linux-next.
>
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
>   series, the mm patches have to be merged before the sunrpc and net
>   users.
>
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
>
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
>
> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users. The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it
> can be made more efficient. Despite that, this is a performance-related
> for users that require multiple pages for an operation without multiple
> round-trips to the page allocator. Quoting the last patch for the
> high-speed networking use-case.
>
>     For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
>     redirecting xdp_frame packets into a veth, that does XDP_PASS to
>     create an SKB from the xdp_frame, which then cannot return the page
>     to the page_pool. In this case, we saw[1] an improvement of 18.8%
>     from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> Both users in this series are corner cases (NFS and high-speed networks)
> so it is unlikely that most users will see any benefit in the short
> term. Potential other users are batch allocations for page cache
> readahead, fault around and SLUB allocations when high-order pages are
> unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
>
> Light testing passed, I'm relying on Chuck and Jesper to test the target
> users more aggressively but both report performance improvements with
> the initial RFC.
>
> Patch 1 moves GFP flag initialision to prepare_alloc_pages
>
> Patch 2 renames a variable name that is particularly unpopular
>
> Patch 3 adds a bulk page allocator
>
> Patch 4 is a sunrpc cleanup that is a pre-requisite.
>
> Patch 5 is the sunrpc user. Chuck also has a patch which further caches
> 	pages but is not included in this series. It's not directly
> 	related to the bulk allocator and as it caches pages, it might
> 	have other concerns (e.g. does it need a shrinker?)
>
> Patch 6 is a preparation patch only for the network user
>
> Patch 7 converts the net page pool to the bulk allocator for order-0 pages.
>
>  include/linux/gfp.h   |  12 ++++
>  mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----
>  net/core/page_pool.c  | 101 +++++++++++++++++-----------
>  net/sunrpc/svc_xprt.c |  47 +++++++++----
>  4 files changed, 240 insertions(+), 69 deletions(-)
>
> --
> 2.26.2

Thanks,
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-17 16:31 ` [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Alexander Lobakin
@ 2021-03-17 16:38   ` Jesper Dangaard Brouer
  2021-03-17 16:52     ` Alexander Lobakin
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-17 16:38 UTC (permalink / raw)
  To: Alexander Lobakin, Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Christoph Hellwig, Alexander Duyck,
	Matthew Wilcox, LKML, Linux-Net, Linux-MM, Linux-NFS, brouer

On Wed, 17 Mar 2021 16:31:07 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Mel Gorman <mgorman@techsingularity.net>
> Date: Fri, 12 Mar 2021 15:43:24 +0000
> 
> Hi there,
> 
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  
> 
> I gave this series a go on my setup, it showed a bump of 10 Mbps on
> UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> 
> (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> allocations with MTU of 1508 bytes, linear frames via build_skb(),
> GRO + TSO/USO)

What NIC driver is this?

> I didn't have time to drill into the code, so for now can't provide
> any additional details. You can request anything you need though and
> I'll try to find a window to collect it.
> 
> > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > vaguely aware that there are other prototype series on netdev that affect
> > page_pool. The conflict should be obvious in linux-next.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-17 16:38   ` Jesper Dangaard Brouer
@ 2021-03-17 16:52     ` Alexander Lobakin
  2021-03-17 17:19       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Lobakin @ 2021-03-17 16:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, Mel Gorman, Andrew Morton, Chuck Lever,
	Christoph Hellwig, Alexander Duyck, Matthew Wilcox, LKML,
	Linux-Net, Linux-MM, Linux-NFS

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 17 Mar 2021 17:38:44 +0100

> On Wed, 17 Mar 2021 16:31:07 +0000
> Alexander Lobakin <alobakin@pm.me> wrote:
>
> > From: Mel Gorman <mgorman@techsingularity.net>
> > Date: Fri, 12 Mar 2021 15:43:24 +0000
> >
> > Hi there,
> >
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> >
> > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> >
> > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > GRO + TSO/USO)
>
> What NIC driver is this?

Ah, forgot to mention. It's a WIP driver, not yet mainlined.
The NIC itself is basically on-SoC 1G chip.

> > I didn't have time to drill into the code, so for now can't provide
> > any additional details. You can request anything you need though and
> > I'll try to find a window to collect it.
> >
> > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > > vaguely aware that there are other prototype series on netdev that affect
> > > page_pool. The conflict should be obvious in linux-next.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-17 16:52     ` Alexander Lobakin
@ 2021-03-17 17:19       ` Jesper Dangaard Brouer
  2021-03-17 22:25         ` Alexander Lobakin
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-17 17:19 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Mel Gorman, Andrew Morton, Chuck Lever, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS, brouer

On Wed, 17 Mar 2021 16:52:32 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Wed, 17 Mar 2021 17:38:44 +0100
> 
> > On Wed, 17 Mar 2021 16:31:07 +0000
> > Alexander Lobakin <alobakin@pm.me> wrote:
> >  
> > > From: Mel Gorman <mgorman@techsingularity.net>
> > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > >
> > > Hi there,
> > >  
> > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > following git tree
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  
> > >
> > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > >
> > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > GRO + TSO/USO)  
> >
> > What NIC driver is this?  
> 
> Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> The NIC itself is basically on-SoC 1G chip.

Hmm, then it is really hard to check if your driver is doing something
else that could cause this.

Well, can you try to lower the page_pool bulking size, to test the
theory from Wilcox that we should do smaller bulking to avoid pushing
cachelines into L2 when walking the LRU list.  You might have to go as
low as bulk=8 (for N-way associative level of L1 cache).

In function: __page_pool_alloc_pages_slow() adjust variable:
  const int bulk = PP_ALLOC_CACHE_REFILL;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-17 17:19       ` Jesper Dangaard Brouer
@ 2021-03-17 22:25         ` Alexander Lobakin
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Lobakin @ 2021-03-17 22:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, Mel Gorman, Andrew Morton, Chuck Lever,
	Christoph Hellwig, Alexander Duyck, Matthew Wilcox, LKML,
	Linux-Net, Linux-MM, Linux-NFS

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 17 Mar 2021 18:19:43 +0100

> On Wed, 17 Mar 2021 16:52:32 +0000
> Alexander Lobakin <alobakin@pm.me> wrote:
>
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date: Wed, 17 Mar 2021 17:38:44 +0100
> >
> > > On Wed, 17 Mar 2021 16:31:07 +0000
> > > Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > > From: Mel Gorman <mgorman@techsingularity.net>
> > > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > > >
> > > > Hi there,
> > > >
> > > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > > following git tree
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> > > >
> > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > > >
> > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > > GRO + TSO/USO)
> > >
> > > What NIC driver is this?
> >
> > Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> > The NIC itself is basically on-SoC 1G chip.
>
> Hmm, then it is really hard to check if your driver is doing something
> else that could cause this.
>
> Well, can you try to lower the page_pool bulking size, to test the
> theory from Wilcox that we should do smaller bulking to avoid pushing
> cachelines into L2 when walking the LRU list.  You might have to go as
> low as bulk=8 (for N-way associative level of L1 cache).

Turned out it suffered from GCC's decisions.
All of the following was taken on GCC 10.2.0 with -O2 in dotconfig.

vmlinux differences between baseline and this series:

(I used your followup instead of the last patch from the tree)

Function                                     old     new   delta
__rmqueue_pcplist                              -    2024   +2024
__alloc_pages_bulk                             -    1456   +1456
__page_pool_alloc_pages_slow                 284     600    +316
page_pool_dma_map                              -     164    +164
get_page_from_freelist                      5676    3760   -1916

The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows
down the "regular" page allocator, but makes __alloc_pages_bulk()
much slower than the per-page (in my case at least) due to calling
this function out from the loop.

One possible solution is to mark __rmqueue_pcplist() and
rmqueue_bulk() as __always_inline. Only both and only with
__always_inline, or GCC will emit rmqueue_bulk.constprop and
make the numbers even poorer.
This nearly doubles the size of bulk allocator, but eliminates
all performance hits.

Function                                     old     new   delta
__alloc_pages_bulk                          1456    3512   +2056
get_page_from_freelist                      3760    5744   +1984
find_suitable_fallback.part                    -     160    +160
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136
__rmqueue_pcplist                           2024       -   -2024

Between baseline and this series with __always_inline hints:

Function                                     old     new   delta
__alloc_pages_bulk                             -    3512   +3512
find_suitable_fallback.part                    -     160    +160
get_page_from_freelist                      5676    5744     +68
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136

Another suboptimal place I've found is two functions in Page Pool
code which are marked as 'noinline'.
Maybe there's a reason behind this, but removing the annotations
and additionally marking page_pool_dma_map() as inline simplifies
the object code and in fact improves the performance (+15 Mbps on
my setup):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function                                     old     new   delta
page_pool_alloc_pages                        100    1124   +1024
page_pool_dma_map                            164       -    -164
page_pool_refill_alloc_cache                 332       -    -332
__page_pool_alloc_pages_slow                 600       -    -600

1124 is a normal size for a hotpath function.
These fragmentation and jumps between page_pool_alloc_pages(),
__page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache()
are really excessive and unhealthy for performance, as well as
page_pool_dma_map() uninlined by GCC.

So the best results I got so far were with these additional changes:
 - mark __rmqueue_pcplist() as __always_inline;
 - mark rmqueue_bulk() as __always_inline;
 - drop 'noinline' from page_pool_refill_alloc_cache();
 - drop 'noinline' from __page_pool_alloc_pages_slow();
 - mark page_pool_dma_map() as inline.

(inlines in C files aren't generally recommended, but well, GCC
 is far from perfect)

> In function: __page_pool_alloc_pages_slow() adjust variable:
>   const int bulk = PP_ALLOC_CACHE_REFILL;

Regarding bulk size, it makes no sense on my machine. I tried
{ 8, 16, 32, 64 } and they differed by 1-2 Mbps max / standard
deviation.
Most of the bulk operations I've seen usually take the value of
16 as a "golden ratio" though.

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages
  2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
@ 2021-03-19 16:11   ` Vlastimil Babka
  2021-03-19 17:49     ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2021-03-19 16:11 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On 3/12/21 4:43 PM, Mel Gorman wrote:
> __alloc_pages updates GFP flags to enforce what flags are allowed
> during a global context such as booting or suspend. This patch moves the
> enforcement from __alloc_pages to prepare_alloc_pages so the code can be
> shared between the single page allocator and a new bulk page allocator.
> 
> When moving, it is obvious that __alloc_pages() and __alloc_pages
> use different names for the same variable. This is an unnecessary
> complication so rename gfp_mask to gfp in prepare_alloc_pages() so the
> name is consistent.
> 
> No functional change.

Hm, I have some doubts.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 00b67c47ad87..f0c1d74ead6f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4914,15 +4914,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> -static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> +static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
>  		int preferred_nid, nodemask_t *nodemask,
>  		struct alloc_context *ac, gfp_t *alloc_gfp,
>  		unsigned int *alloc_flags)
>  {
> -	ac->highest_zoneidx = gfp_zone(gfp_mask);
> -	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> +	gfp &= gfp_allowed_mask;
> +	*alloc_gfp = gfp;
> +

...

> @@ -4980,8 +4983,6 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
>  		return NULL;
>  	}
>  
> -	gfp &= gfp_allowed_mask;
> -	alloc_gfp = gfp;
>  	if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
>  			&alloc_gfp, &alloc_flags))
>  		return NULL;

As a result, "gfp" doesn't have the restrictions by gfp_allowed_mask applied,
only alloc_gfp does. But in case we go to slowpath, before
going there we throw away the current alloc_gfp:

    alloc_gfp = current_gfp_context(gfp);
    ...
    page = __alloc_pages_slowpath(alloc_gfp, ...);

So we lost the gfp_allowed_mask restrictions here?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Rename alloced to allocated
  2021-03-12 15:43 ` [PATCH 2/7] mm/page_alloc: Rename alloced to allocated Mel Gorman
@ 2021-03-19 16:22   ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2021-03-19 16:22 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On 3/12/21 4:43 PM, Mel Gorman wrote:
> Review feedback of the bulk allocator twice found problems with "alloced"
> being a counter for pages allocated. The naming was based on the API name
> "alloc" and was based on the idea that verbal communication about malloc
> tends to use the fake word "malloced" instead of the fake word mallocated.
> To be consistent, this preparation patch renames alloced to allocated
> in rmqueue_bulk so the bulk allocator and per-cpu allocator use similar
> names when the bulk allocator is introduced.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f0c1d74ead6f..880b1d6368bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2904,7 +2904,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			unsigned long count, struct list_head *list,
>  			int migratetype, unsigned int alloc_flags)
>  {
> -	int i, alloced = 0;
> +	int i, allocated = 0;
>  
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> @@ -2927,7 +2927,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->lru, list);
> -		alloced++;
> +		allocated++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));
> @@ -2936,12 +2936,12 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  	/*
>  	 * i pages were removed from the buddy list even if some leak due
>  	 * to check_pcp_refill failing so adjust NR_FREE_PAGES based
> -	 * on i. Do not confuse with 'alloced' which is the number of
> +	 * on i. Do not confuse with 'allocated' which is the number of
>  	 * pages added to the pcp list.
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>  	spin_unlock(&zone->lock);
> -	return alloced;
> +	return allocated;
>  }
>  
>  #ifdef CONFIG_NUMA
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages
  2021-03-19 16:11   ` Vlastimil Babka
@ 2021-03-19 17:49     ` Mel Gorman
  0 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-19 17:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Alexander Duyck, Matthew Wilcox, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Fri, Mar 19, 2021 at 05:11:39PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:
> > __alloc_pages updates GFP flags to enforce what flags are allowed
> > during a global context such as booting or suspend. This patch moves the
> > enforcement from __alloc_pages to prepare_alloc_pages so the code can be
> > shared between the single page allocator and a new bulk page allocator.
> > 
> > When moving, it is obvious that __alloc_pages() and __alloc_pages
> > use different names for the same variable. This is an unnecessary
> > complication so rename gfp_mask to gfp in prepare_alloc_pages() so the
> > name is consistent.
> > 
> > No functional change.
> 
> Hm, I have some doubts.
> 

And you were right, I'll drop the patch and apply the same mask to the
bulk allocator.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator
  2021-03-12 15:43 ` [PATCH 3/7] mm/page_alloc: Add a bulk page allocator Mel Gorman
@ 2021-03-19 18:18   ` Vlastimil Babka
  2021-03-22  8:30     ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2021-03-19 18:18 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig,
	Alexander Duyck, Matthew Wilcox, LKML, Linux-Net, Linux-MM,
	Linux-NFS

On 3/12/21 4:43 PM, Mel Gorman wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Although maybe premature, if it changes significantly due to the users'
performance feedback, let's see :)

Some nits below:

...

> @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
>  	return true;
>  }
>  
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> +			nodemask_t *nodemask, int nr_pages,
> +			struct list_head *alloc_list)
> +{
> +	struct page *page;
> +	unsigned long flags;
> +	struct zone *zone;
> +	struct zoneref *z;
> +	struct per_cpu_pages *pcp;
> +	struct list_head *pcp_list;
> +	struct alloc_context ac;
> +	gfp_t alloc_gfp;
> +	unsigned int alloc_flags;
> +	int allocated = 0;
> +
> +	if (WARN_ON_ONCE(nr_pages <= 0))
> +		return 0;
> +
> +	if (nr_pages == 1)
> +		goto failed;
> +
> +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> +	if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
> +	&alloc_gfp, &alloc_flags))

Unusual identation here.

> +		return 0;
> +	gfp = alloc_gfp;
> +
> +	/* Find an allowed local zone that meets the high watermark. */
> +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> +		unsigned long mark;
> +
> +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> +		    !__cpuset_zone_allowed(zone, gfp)) {
> +			continue;
> +		}
> +
> +		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> +		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> +			goto failed;
> +		}
> +
> +		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> +		if (zone_watermark_fast(zone, 0,  mark,
> +				zonelist_zone_idx(ac.preferred_zoneref),
> +				alloc_flags, gfp)) {
> +			break;
> +		}
> +	}
> +	if (!zone)
> +		return 0;

Why not also "goto failed;" here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator
  2021-03-19 18:18   ` Vlastimil Babka
@ 2021-03-22  8:30     ` Mel Gorman
  0 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2021-03-22  8:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, Alexander Duyck, Matthew Wilcox, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Fri, Mar 19, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > to be allocated and added to a list. They can be freed in bulk using
> > free_pages_bulk().
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch
> > if necessary.
> > 
> > Note that this implementation is not very efficient and could be improved
> > but it would require refactoring. The intent is to make it available early
> > to determine what semantics are required by different callers. Once the
> > full semantics are nailed down, it can be refactored.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Although maybe premature, if it changes significantly due to the users'
> performance feedback, let's see :)
> 

Indeed. The next version will have no users so that Jesper and Chuck
can check if an array-based or LRU based version is better. There were
also bugs such as broken accounting of stats that had to be fixed which
increases overhead.

> Some nits below:
> 
> ...
> 
> > @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
> >  	return true;
> >  }
> >  
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + *
> > + * Returns the number of pages allocated.
> > + */
> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > +			nodemask_t *nodemask, int nr_pages,
> > +			struct list_head *alloc_list)
> > +{
> > +	struct page *page;
> > +	unsigned long flags;
> > +	struct zone *zone;
> > +	struct zoneref *z;
> > +	struct per_cpu_pages *pcp;
> > +	struct list_head *pcp_list;
> > +	struct alloc_context ac;
> > +	gfp_t alloc_gfp;
> > +	unsigned int alloc_flags;
> > +	int allocated = 0;
> > +
> > +	if (WARN_ON_ONCE(nr_pages <= 0))
> > +		return 0;
> > +
> > +	if (nr_pages == 1)
> > +		goto failed;
> > +
> > +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > +	if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
> > +	&alloc_gfp, &alloc_flags))
> 
> Unusual identation here.
> 

Fixed

> > +		return 0;
> > +	gfp = alloc_gfp;
> > +
> > +	/* Find an allowed local zone that meets the high watermark. */
> > +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> > +		unsigned long mark;
> > +
> > +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > +		    !__cpuset_zone_allowed(zone, gfp)) {
> > +			continue;
> > +		}
> > +
> > +		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> > +		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> > +			goto failed;
> > +		}
> > +
> > +		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> > +		if (zone_watermark_fast(zone, 0,  mark,
> > +				zonelist_zone_idx(ac.preferred_zoneref),
> > +				alloc_flags, gfp)) {
> > +			break;
> > +		}
> > +	}
> > +	if (!zone)
> > +		return 0;
> 
> Why not also "goto failed;" here?

Good question. When first written, it was because the zone search for the
normal allocator was almost certainly going to fail to find a zone and
it was expected that callers would prefer to fail fast over blocking.
Now we know that sunrpc can sleep on a failing allocation and it would
be better to enter the single page allocator and reclaim pages instead of
"sleep and hope for the best".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-03-22  8:31 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
2021-03-19 16:11   ` Vlastimil Babka
2021-03-19 17:49     ` Mel Gorman
2021-03-12 15:43 ` [PATCH 2/7] mm/page_alloc: Rename alloced to allocated Mel Gorman
2021-03-19 16:22   ` Vlastimil Babka
2021-03-12 15:43 ` [PATCH 3/7] mm/page_alloc: Add a bulk page allocator Mel Gorman
2021-03-19 18:18   ` Vlastimil Babka
2021-03-22  8:30     ` Mel Gorman
2021-03-12 15:43 ` [PATCH 4/7] SUNRPC: Set rq_page_end differently Mel Gorman
2021-03-12 15:43 ` [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator Mel Gorman
2021-03-12 18:44   ` Alexander Duyck
2021-03-12 19:22     ` Chuck Lever III
2021-03-13 12:59       ` Mel Gorman
2021-03-12 15:43 ` [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
2021-03-12 15:43 ` [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
2021-03-12 19:44   ` Alexander Duyck
2021-03-12 20:05     ` Ilias Apalodimas
2021-03-15 13:39       ` Jesper Dangaard Brouer
2021-03-13 13:30     ` Mel Gorman
2021-03-15  8:40       ` Jesper Dangaard Brouer
2021-03-15 19:33         ` [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series Jesper Dangaard Brouer
2021-03-15 19:33           ` [PATCH mel-git] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-03-17 16:31 ` [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Alexander Lobakin
2021-03-17 16:38   ` Jesper Dangaard Brouer
2021-03-17 16:52     ` Alexander Lobakin
2021-03-17 17:19       ` Jesper Dangaard Brouer
2021-03-17 22:25         ` Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).