[RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc
@ 2021-02-24 10:26 Mel Gorman
  2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Mel Gorman @ 2021-02-24 10:26 UTC (permalink / raw)
  To: Chuck Lever, Jesper Dangaard Brouer
  Cc: LKML, Linux-Net, Linux-MM, Linux-NFS, Mel Gorman

This is a prototype series that introduces a bulk order-0 page allocator
with sunrpc being the first user. The implementation is not particularly
efficient and the intention is to iron out what the semantics of the API
should be. That said, sunrpc was reported to have reduced allocation
latency when refilling a pool.

As a side-note, while the implementation could be more efficient, it
would require fairly deep surgery in numerous places. The lock scope would
need to be significantly reduced, particularly as vmstat, per-cpu and the
buddy allocator have different locking protocol that overal -- e.g. all
partially depend on irqs being disabled at various points. Secondly,
the core of the allocator deals with single pages where as both the bulk
allocator and per-cpu allocator operate in batches. All of that has to
be reconciled with all the existing users and their constraints (memory
offline, CMA and cpusets being the trickiest).

In terms of semantics required by new users, my preference is that a pair
of patches be applied -- the first which adds the required semantic to
the bulk allocator and the second which adds the new user.

Patch 1 of this series is a cleanup to sunrpc, it could be merged
	separately but is included here for convenience.

Patch 2 is the prototype bulk allocator

Patch 3 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

This has only been lightly tested on a low-end NFS server. It did not break
but would benefit from an evaluation to see how much, if any, the headline
performance changes. The biggest concern is that a light test case showed
that there are a *lot* of bulk requests for 1 page which gets delegated to
the normal allocator.  The same criteria should apply to any other users.

 include/linux/gfp.h   |  13 +++++
 mm/page_alloc.c       | 113 +++++++++++++++++++++++++++++++++++++++++-
 net/sunrpc/svc_xprt.c |  47 ++++++++++++------
 3 files changed, 157 insertions(+), 16 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] SUNRPC: Set rq_page_end differently
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
@ 2021-02-24 10:26 ` Mel Gorman
  2021-02-24 10:26 ` [PATCH 2/3] mm, page_alloc: Add a bulk page allocator Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-02-24 10:26 UTC (permalink / raw)
  To: Chuck Lever, Jesper Dangaard Brouer
  Cc: LKML, Linux-Net, Linux-MM, Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Refactor:

I'm about to use the loop variable @i for something else.

As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.

Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index dcc50ae54550..cfa7e4776d0e 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -667,8 +667,8 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 			}
 			rqstp->rq_pages[i] = p;
 		}
-	rqstp->rq_page_end = &rqstp->rq_pages[i];
-	rqstp->rq_pages[i++] = NULL; /* this might be seen in nfs_read_actor */
+	rqstp->rq_page_end = &rqstp->rq_pages[pages];
+	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
 	/* Make arg->head point to first page and arg->pages point to rest */
 	arg = &rqstp->rq_arg;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] mm, page_alloc: Add a bulk page allocator
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
  2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
@ 2021-02-24 10:26 ` Mel Gorman
  2021-02-24 10:26 ` [PATCH 3/3] SUNRPC: Refresh rq_pages using " Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-02-24 10:26 UTC (permalink / raw)
  To: Chuck Lever, Jesper Dangaard Brouer
  Cc: LKML, Linux-Net, Linux-MM, Linux-NFS, Mel Gorman

This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  13 +++++
 mm/page_alloc.c     | 113 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6e479e9c48ce..f2a1ae4b95b9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -501,6 +501,10 @@ static inline int arch_make_page_accessible(struct page *page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+				nodemask_t *nodemask, int nr_pages,
+				struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 							nodemask_t *nodemask);
@@ -511,6 +515,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head *list)
+{
+	return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+							nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -580,6 +592,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..a36344bc1045 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 	}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		trace_mm_page_free_batched(page);
+		if (put_page_testzero(page)) {
+			list_del(&page->lru);
+			__free_pages_ok(page, 0, FPI_NONE);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4918,6 +4933,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
+	gfp_mask &= gfp_allowed_mask;
+	*alloc_mask = gfp_mask;
+
 	ac->highest_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
@@ -4959,6 +4977,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+			nodemask_t *nodemask, int nr_pages,
+			struct list_head *alloc_list)
+{
+	struct page *page;
+	unsigned long flags;
+	struct zone *zone;
+	struct zoneref *z;
+	struct per_cpu_pages *pcp;
+	struct list_head *pcp_list;
+	struct alloc_context ac;
+	gfp_t alloc_mask;
+	unsigned int alloc_flags;
+	int alloced = 0;
+
+	if (nr_pages == 1)
+		goto failed;
+
+	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return 0;
+	gfp_mask = alloc_mask;
+
+	/* Find an allowed local zone that meets the high watermark. */
+	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+		unsigned long mark;
+
+		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+		    !__cpuset_zone_allowed(zone, gfp_mask)) {
+			continue;
+		}
+
+		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+			goto failed;
+		}
+
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+		if (zone_watermark_fast(zone, 0,  mark,
+				zonelist_zone_idx(ac.preferred_zoneref),
+				alloc_flags, gfp_mask)) {
+			break;
+		}
+	}
+	if (!zone)
+		return 0;
+
+	/* Attempt the batch allocation */
+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp_list = &pcp->lists[ac.migratetype];
+
+	while (alloced < nr_pages) {
+		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+								pcp, pcp_list);
+		if (!page)
+			break;
+
+		prep_new_page(page, 0, gfp_mask, 0);
+		list_add(&page->lru, alloc_list);
+		alloced++;
+	}
+
+	if (!alloced)
+		goto failed_irq;
+
+	if (alloced) {
+		__count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+		zone_statistics(zone, zone);
+	}
+
+	local_irq_restore(flags);
+
+	return alloced;
+
+failed_irq:
+	local_irq_restore(flags);
+
+failed:
+	page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+	if (page) {
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	return alloced;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -4980,8 +5091,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 		return NULL;
 	}
 
-	gfp_mask &= gfp_allowed_mask;
-	alloc_mask = gfp_mask;
 	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
  2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
  2021-02-24 10:26 ` [PATCH 2/3] mm, page_alloc: Add a bulk page allocator Mel Gorman
@ 2021-02-24 10:26 ` Mel Gorman
  2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-02-24 10:26 UTC (permalink / raw)
  To: Chuck Lever, Jesper Dangaard Brouer
  Cc: LKML, Linux-Net, Linux-MM, Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Reduce the rate at which nfsd threads hammer on the page allocator.
This improve throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cfa7e4776d0e..38a8d6283801 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
 	struct svc_serv *serv = rqstp->rq_server;
+	unsigned long needed;
 	struct xdr_buf *arg;
+	struct page *page;
 	int pages;
 	int i;
 
-	/* now allocate needed pages.  If we get a failure, sleep briefly */
 	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
 	if (pages > RPCSVC_MAXPAGES) {
 		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
@@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 		/* use as many pages as possible */
 		pages = RPCSVC_MAXPAGES;
 	}
-	for (i = 0; i < pages ; i++)
-		while (rqstp->rq_pages[i] == NULL) {
-			struct page *p = alloc_page(GFP_KERNEL);
-			if (!p) {
-				set_current_state(TASK_INTERRUPTIBLE);
-				if (signalled() || kthread_should_stop()) {
-					set_current_state(TASK_RUNNING);
-					return -EINTR;
-				}
-				schedule_timeout(msecs_to_jiffies(500));
+
+	for (needed = 0, i = 0; i < pages ; i++)
+		if (!rqstp->rq_pages[i])
+			needed++;
+	if (needed) {
+		LIST_HEAD(list);
+
+retry:
+		alloc_pages_bulk(GFP_KERNEL, needed, &list);
+		for (i = 0; i < pages; i++) {
+			if (!rqstp->rq_pages[i]) {
+				page = list_first_entry_or_null(&list,
+								struct page,
+								lru);
+				if (unlikely(!page))
+					goto empty_list;
+				list_del(&page->lru);
+				rqstp->rq_pages[i] = page;
+				needed--;
 			}
-			rqstp->rq_pages[i] = p;
 		}
+	}
 	rqstp->rq_page_end = &rqstp->rq_pages[pages];
 	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
@@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 	arg->len = (pages-1)*PAGE_SIZE;
 	arg->tail[0].iov_len = 0;
 	return 0;
+
+empty_list:
+	set_current_state(TASK_INTERRUPTIBLE);
+	if (signalled() || kthread_should_stop()) {
+		set_current_state(TASK_RUNNING);
+		return -EINTR;
+	}
+	schedule_timeout(msecs_to_jiffies(500));
+	goto retry;
 }
 
 static bool
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
                   ` (2 preceding siblings ...)
  2021-02-24 10:26 ` [PATCH 3/3] SUNRPC: Refresh rq_pages using " Mel Gorman
@ 2021-02-24 11:27 ` Jesper Dangaard Brouer
  2021-02-24 11:55   ` Mel Gorman
  2021-02-24 13:20 ` Chuck Lever
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-24 11:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Chuck Lever, LKML, Linux-Net, Linux-MM, Linux-NFS, brouer

On Wed, 24 Feb 2021 10:26:00 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> This is a prototype series that introduces a bulk order-0 page allocator
> with sunrpc being the first user. The implementation is not particularly
> efficient and the intention is to iron out what the semantics of the API
> should be. That said, sunrpc was reported to have reduced allocation
> latency when refilling a pool.

I also have a use-case in page_pool, and I've been testing with the
earlier patches, results are here[1]

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Awesome to see this newer patchset! thanks a lot for working on this!
I'll run some new tests based on this.

> As a side-note, while the implementation could be more efficient, it
> would require fairly deep surgery in numerous places. The lock scope would
> need to be significantly reduced, particularly as vmstat, per-cpu and the
> buddy allocator have different locking protocol that overal -- e.g. all
> partially depend on irqs being disabled at various points. Secondly,
> the core of the allocator deals with single pages where as both the bulk
> allocator and per-cpu allocator operate in batches. All of that has to
> be reconciled with all the existing users and their constraints (memory
> offline, CMA and cpusets being the trickiest).

As you can see in[1], I'm getting a significant speedup from this.  I
guess that the cost of finding the "zone" is higher than I expected, as
this basically what we/you amortize for the bulk.

 
> In terms of semantics required by new users, my preference is that a pair
> of patches be applied -- the first which adds the required semantic to
> the bulk allocator and the second which adds the new user.
> 
> Patch 1 of this series is a cleanup to sunrpc, it could be merged
> 	separately but is included here for convenience.
> 
> Patch 2 is the prototype bulk allocator
> 
> Patch 3 is the sunrpc user. Chuck also has a patch which further caches
> 	pages but is not included in this series. It's not directly
> 	related to the bulk allocator and as it caches pages, it might
> 	have other concerns (e.g. does it need a shrinker?)
> 
> This has only been lightly tested on a low-end NFS server. It did not break
> but would benefit from an evaluation to see how much, if any, the headline
> performance changes. The biggest concern is that a light test case showed
> that there are a *lot* of bulk requests for 1 page which gets delegated to
> the normal allocator.  The same criteria should apply to any other users.

If you change local_irq_save(flags) to local_irq_disable() then you can
likely get better performance for 1 page requests via this API.  This
limits the API to be used in cases where IRQs are enabled (which is
most cases).  (For my use-case I will not do 1 page requests).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc
  2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
@ 2021-02-24 11:55   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-02-24 11:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Chuck Lever, LKML, Linux-Net, Linux-MM, Linux-NFS

On Wed, Feb 24, 2021 at 12:27:23PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 24 Feb 2021 10:26:00 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This is a prototype series that introduces a bulk order-0 page allocator
> > with sunrpc being the first user. The implementation is not particularly
> > efficient and the intention is to iron out what the semantics of the API
> > should be. That said, sunrpc was reported to have reduced allocation
> > latency when refilling a pool.
> 
> I also have a use-case in page_pool, and I've been testing with the
> earlier patches, results are here[1]
> 
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> 
> Awesome to see this newer patchset! thanks a lot for working on this!
> I'll run some new tests based on this.
> 

Thanks and if they get finalised, a patch on top for review would be
nice with the results included in the changelog. Obviously any change
that would need to be made to the allocator would happen first.

> > As a side-note, while the implementation could be more efficient, it
> > would require fairly deep surgery in numerous places. The lock scope would
> > need to be significantly reduced, particularly as vmstat, per-cpu and the
> > buddy allocator have different locking protocol that overal -- e.g. all
> > partially depend on irqs being disabled at various points. Secondly,
> > the core of the allocator deals with single pages where as both the bulk
> > allocator and per-cpu allocator operate in batches. All of that has to
> > be reconciled with all the existing users and their constraints (memory
> > offline, CMA and cpusets being the trickiest).
> 
> As you can see in[1], I'm getting a significant speedup from this.  I
> guess that the cost of finding the "zone" is higher than I expected, as
> this basically what we/you amortize for the bulk.
> 

The obvious goal would be that if a refactoring did happen that the
performance would be at least neutral but hopefully improved.

> > In terms of semantics required by new users, my preference is that a pair
> > of patches be applied -- the first which adds the required semantic to
> > the bulk allocator and the second which adds the new user.
> > 
> > Patch 1 of this series is a cleanup to sunrpc, it could be merged
> > 	separately but is included here for convenience.
> > 
> > Patch 2 is the prototype bulk allocator
> > 
> > Patch 3 is the sunrpc user. Chuck also has a patch which further caches
> > 	pages but is not included in this series. It's not directly
> > 	related to the bulk allocator and as it caches pages, it might
> > 	have other concerns (e.g. does it need a shrinker?)
> > 
> > This has only been lightly tested on a low-end NFS server. It did not break
> > but would benefit from an evaluation to see how much, if any, the headline
> > performance changes. The biggest concern is that a light test case showed
> > that there are a *lot* of bulk requests for 1 page which gets delegated to
> > the normal allocator.  The same criteria should apply to any other users.
> 
> If you change local_irq_save(flags) to local_irq_disable() then you can
> likely get better performance for 1 page requests via this API.  This
> limits the API to be used in cases where IRQs are enabled (which is
> most cases).  (For my use-case I will not do 1 page requests).
> 

I do not want to constrain the API to being IRQ-only prematurely. An
obvious alternative use case is the SLUB allocation path when a high-order
allocation fails. It's known that if the SLUB order is reduced that it has
an impact on hackbench communicating over sockets.  It would be interesting
to see what happens if order-0 pages are bulk allocated when s->min == 0
and that can be called from a blocking context. Tricky to test but could
be fudged by forcing all high-order allocations to fail when s->min ==
0 to evaluate the worst case scenario.  In addition, it would constrain
any potential refactoring if the lower levels have to choose between
local_irq_disable() vs local_irq_save() depending on the caller context.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
                   ` (3 preceding siblings ...)
  2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
@ 2021-02-24 13:20 ` Chuck Lever
  2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  6 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2021-02-24 13:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux NFS Mailing List



> On Feb 24, 2021, at 5:26 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> This is a prototype series that introduces a bulk order-0 page allocator
> with sunrpc being the first user. The implementation is not particularly
> efficient and the intention is to iron out what the semantics of the API
> should be. That said, sunrpc was reported to have reduced allocation
> latency when refilling a pool.
> 
> As a side-note, while the implementation could be more efficient, it
> would require fairly deep surgery in numerous places. The lock scope would
> need to be significantly reduced, particularly as vmstat, per-cpu and the
> buddy allocator have different locking protocol that overal -- e.g. all
> partially depend on irqs being disabled at various points. Secondly,
> the core of the allocator deals with single pages where as both the bulk
> allocator and per-cpu allocator operate in batches. All of that has to
> be reconciled with all the existing users and their constraints (memory
> offline, CMA and cpusets being the trickiest).
> 
> In terms of semantics required by new users, my preference is that a pair
> of patches be applied -- the first which adds the required semantic to
> the bulk allocator and the second which adds the new user.
> 
> Patch 1 of this series is a cleanup to sunrpc, it could be merged
> 	separately but is included here for convenience.
> 
> Patch 2 is the prototype bulk allocator
> 
> Patch 3 is the sunrpc user. Chuck also has a patch which further caches
> 	pages but is not included in this series. It's not directly
> 	related to the bulk allocator and as it caches pages, it might
> 	have other concerns (e.g. does it need a shrinker?)
> 
> This has only been lightly tested on a low-end NFS server. It did not break
> but would benefit from an evaluation to see how much, if any, the headline
> performance changes. The biggest concern is that a light test case showed
> that there are a *lot* of bulk requests for 1 page which gets delegated to
> the normal allocator.  The same criteria should apply to any other users.
> 
> include/linux/gfp.h   |  13 +++++
> mm/page_alloc.c       | 113 +++++++++++++++++++++++++++++++++++++++++-
> net/sunrpc/svc_xprt.c |  47 ++++++++++++------
> 3 files changed, 157 insertions(+), 16 deletions(-)

Hi Mel-

Thank you for carrying the torch!


--
Chuck Lever





^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
                   ` (4 preceding siblings ...)
  2021-02-24 13:20 ` Chuck Lever
@ 2021-02-24 18:56 ` Jesper Dangaard Brouer
  2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
                     ` (2 more replies)
  2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  6 siblings, 3 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-24 18:56 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

This is a followup to Mel Gorman's patchset:
 - Message-Id: <20210224102603.19524-1-mgorman@techsingularity.net>
 - https://lore.kernel.org/netdev/20210224102603.19524-1-mgorman@techsingularity.net/

Showing page_pool usage of the API for alloc_pages_bulk().

---

Jesper Dangaard Brouer (3):
      net: page_pool: refactor dma_map into own function page_pool_dma_map
      net: page_pool: use alloc_pages_bulk in refill code path
      mm: make zone->free_area[order] access faster


 include/linux/mmzone.h |    6 ++-
 net/core/page_pool.c   |   96 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 65 insertions(+), 37 deletions(-)

--



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
@ 2021-02-24 18:56   ` Jesper Dangaard Brouer
  2021-02-24 20:11     ` Ilias Apalodimas
  2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
  2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
  2 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-24 18:56 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..50d52aa6fbeb 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,6 +180,31 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					 pool->p.dma_dir);
 }
 
+static struct page * page_pool_dma_map(struct page_pool *pool,
+				       struct page *page)
+{
+	dma_addr_t dma;
+
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
+				 (PAGE_SIZE << pool->p.order),
+				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	page->dma_addr = dma;
+
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+	return page;
+}
+
 /* slow path */
 noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
@@ -187,7 +212,6 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 {
 	struct page *page;
 	gfp_t gfp = _gfp;
-	dma_addr_t dma;
 
 	/* We could always set __GFP_COMP, and avoid this branch, as
 	 * prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,27 +235,12 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!page)
 		return NULL;
 
-	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
-		goto skip_dma_map;
-
-	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
-	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
-	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
-	 * This mapping is kept for lifetime of page, until leaving pool.
-	 */
-	dma = dma_map_page_attrs(pool->p.dev, page, 0,
-				 (PAGE_SIZE << pool->p.order),
-				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
-	if (dma_mapping_error(pool->p.dev, dma)) {
-		put_page(page);
-		return NULL;
+	if (pool->p.flags & PP_FLAG_DMA_MAP) {
+		page = page_pool_dma_map(pool, page);
+		if (!page)
+			return NULL;
 	}
-	page->dma_addr = dma;
-
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 
-skip_dma_map:
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path
  2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
@ 2021-02-24 18:56   ` Jesper Dangaard Brouer
  2021-02-24 20:15     ` Ilias Apalodimas
  2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
  2 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-24 18:56 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   65 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 50d52aa6fbeb..e0ae95fc59f0 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -210,44 +210,61 @@ noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
-	struct page *page;
+	const int bulk = PP_ALLOC_CACHE_REFILL;
+	struct page *page, *next, *first_page;
+	unsigned int pp_flags = pool->p.flags;
+	unsigned int pp_order = pool->p.order;
+	int pp_nid = pool->p.nid;
+	LIST_HEAD(page_list);
 	gfp_t gfp = _gfp;
 
-	/* We could always set __GFP_COMP, and avoid this branch, as
-	 * prep_new_page() can handle order-0 with __GFP_COMP.
-	 */
-	if (pool->p.order)
+	/* Don't support bulk alloc for high-order pages */
+	if (unlikely(pp_order)) {
 		gfp |= __GFP_COMP;
+		first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+		if (unlikely(!first_page))
+			return NULL;
+		goto out;
+	}
 
-	/* FUTURE development:
-	 *
-	 * Current slow-path essentially falls back to single page
-	 * allocations, which doesn't improve performance.  This code
-	 * need bulk allocation support from the page allocator code.
-	 */
-
-	/* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
-	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
-	page = alloc_pages(gfp, pool->p.order);
-#endif
-	if (!page)
+	if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
+						  bulk, &page_list)))
 		return NULL;
 
+	/* First page is extracted and returned to caller */
+	first_page = list_first_entry(&page_list, struct page, lru);
+	list_del(&first_page->lru);
+
+	/* Remaining pages store in alloc.cache */
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		if (pp_flags & PP_FLAG_DMA_MAP) {
+			page = page_pool_dma_map(pool, page);
+			if (!page)
+				continue;
+		}
+		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+			pool->pages_state_hold_cnt++;
+			trace_page_pool_state_hold(pool, page,
+						   pool->pages_state_hold_cnt);
+		} else {
+			put_page(page);
+		}
+	}
+out:
 	if (pool->p.flags & PP_FLAG_DMA_MAP) {
-		page = page_pool_dma_map(pool, page);
-		if (!page)
+		first_page = page_pool_dma_map(pool, first_page);
+		if (!first_page)
 			return NULL;
 	}
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-
-	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+	trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-	return page;
+	return first_page;
 }
 
 /* For using page_pool replace: alloc_pages() API calls, but provide




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
  2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
  2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
@ 2021-02-24 18:56   ` Jesper Dangaard Brouer
  2021-02-25 11:28     ` Mel Gorman
  2 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-24 18:56 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

Avoid multiplication (imul) operations when accessing:
 zone->free_area[order].nr_free

This was really tricky to find. I was puzzled why perf reported that
rmqueue_bulk was using 44% of the time in an imul operation:

       │     del_page_from_free_list():
 44,54 │ e2:   imul   $0x58,%rax,%rax

This operation was generated (by compiler) because the struct free_area have
size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
and instead choose to use a more expensive imul, to find the offset into the
array free_area[].

The patch align struct free_area to a cache-line, which cause the
compiler avoid the imul operation. The imul operation is very fast on
modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
member 'nr_free' to be first element, which saves one 'add' operation.

Looking up instruction latency this exchange a 3-cycle imul with a
1-cycle shl, saving 2-cycles. It does trade some space to do this.

Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mmzone.h |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..4d83201717e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,10 +93,12 @@ extern int page_group_by_mobility_disabled;
 #define get_pageblock_migratetype(page)					\
 	get_pfnblock_flags_mask(page, page_to_pfn(page), MIGRATETYPE_MASK)
 
+/* Aligned struct to make zone->free_area[order] access faster */
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
-};
+	unsigned long		__pad_to_align_free_list;
+	struct list_head	free_list[MIGRATE_TYPES];
+}  ____cacheline_aligned_in_smp;
 
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
@ 2021-02-24 20:11     ` Ilias Apalodimas
  0 siblings, 0 replies; 21+ messages in thread
From: Ilias Apalodimas @ 2021-02-24 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel

On Wed, Feb 24, 2021 at 07:56:41PM +0100, Jesper Dangaard Brouer wrote:
> In preparation for next patch, move the dma mapping into its own
> function, as this will make it easier to follow the changes.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/page_pool.c |   49 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..50d52aa6fbeb 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -180,6 +180,31 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>  					 pool->p.dma_dir);
>  }
>  
> +static struct page * page_pool_dma_map(struct page_pool *pool,
> +				       struct page *page)
> +{

Why return a struct page* ?
boolean maybe?

> +	dma_addr_t dma;
> +
> +	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> +	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
> +	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> +	 * This mapping is kept for lifetime of page, until leaving pool.
> +	 */
> +	dma = dma_map_page_attrs(pool->p.dev, page, 0,
> +				 (PAGE_SIZE << pool->p.order),
> +				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> +	if (dma_mapping_error(pool->p.dev, dma)) {
> +		put_page(page);

This is a bit confusing when reading it. 
The name of the function should try to map the page and report a yes/no,
instead of trying to call put_page as well.
Can't we explicitly ask the user to call put_page() if the mapping failed?

A clear example is on patch 2/3, when on the first read I was convinced there
was a memory leak.

> +		return NULL;
> +	}
> +	page->dma_addr = dma;
> +
> +	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> +		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> +
> +	return page;
> +}
> +
>  /* slow path */
>  noinline
>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> @@ -187,7 +212,6 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  {
>  	struct page *page;
>  	gfp_t gfp = _gfp;
> -	dma_addr_t dma;
>  
>  	/* We could always set __GFP_COMP, and avoid this branch, as
>  	 * prep_new_page() can handle order-0 with __GFP_COMP.
> @@ -211,27 +235,12 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  	if (!page)
>  		return NULL;
>  
> -	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> -		goto skip_dma_map;
> -
> -	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> -	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
> -	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> -	 * This mapping is kept for lifetime of page, until leaving pool.
> -	 */
> -	dma = dma_map_page_attrs(pool->p.dev, page, 0,
> -				 (PAGE_SIZE << pool->p.order),
> -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> -	if (dma_mapping_error(pool->p.dev, dma)) {
> -		put_page(page);
> -		return NULL;
> +	if (pool->p.flags & PP_FLAG_DMA_MAP) {
> +		page = page_pool_dma_map(pool, page);
> +		if (!page)
> +			return NULL;
>  	}
> -	page->dma_addr = dma;
> -
> -	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> -		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>  
> -skip_dma_map:
>  	/* Track how many pages are held 'in-flight' */
>  	pool->pages_state_hold_cnt++;
>  
> 
> 

Thanks!
/Ilias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path
  2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
@ 2021-02-24 20:15     ` Ilias Apalodimas
  2021-02-26 14:31       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Ilias Apalodimas @ 2021-02-24 20:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel

Hi Jesper, 

On Wed, Feb 24, 2021 at 07:56:46PM +0100, Jesper Dangaard Brouer wrote:
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
> 
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
> 
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

[...]

> +	/* Remaining pages store in alloc.cache */
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		list_del(&page->lru);
> +		if (pp_flags & PP_FLAG_DMA_MAP) {
> +			page = page_pool_dma_map(pool, page);
> +			if (!page)

As I commented on the previous patch, i'd prefer the put_page() here to be
explicitly called, instead of hiding in the page_pool_dma_map()

> +				continue;
> +		}
> +		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +			pool->pages_state_hold_cnt++;
> +			trace_page_pool_state_hold(pool, page,
> +						   pool->pages_state_hold_cnt);
> +		} else {
> +			put_page(page);
> +		}
> +	}
> +out:
>  	if (pool->p.flags & PP_FLAG_DMA_MAP) {
> -		page = page_pool_dma_map(pool, page);
> -		if (!page)
> +		first_page = page_pool_dma_map(pool, first_page);
> +		if (!first_page)
>  			return NULL;
>  	}
>  
[...]

Cheers
/Ilias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
  2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
@ 2021-02-25 11:28     ` Mel Gorman
  2021-02-25 15:16       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-02-25 11:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel

As a side-node, I didn't pick up the other patches as there is review
feedback and I didn't have strong opinions either way. Patch 3 is curious
though, it probably should be split out and sent separetly but still;

On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:
> Avoid multiplication (imul) operations when accessing:
>  zone->free_area[order].nr_free
> 
> This was really tricky to find. I was puzzled why perf reported that
> rmqueue_bulk was using 44% of the time in an imul operation:
> 
>        ???     del_page_from_free_list():
>  44,54 ??? e2:   imul   $0x58,%rax,%rax
> 
> This operation was generated (by compiler) because the struct free_area have
> size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> and instead choose to use a more expensive imul, to find the offset into the
> array free_area[].
> 
> The patch align struct free_area to a cache-line, which cause the
> compiler avoid the imul operation. The imul operation is very fast on
> modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> member 'nr_free' to be first element, which saves one 'add' operation.
> 
> Looking up instruction latency this exchange a 3-cycle imul with a
> 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> 
> Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> 

I'm having some trouble parsing this and matching it to the patch itself.

First off, on my system (x86-64), the size of struct free area is 72,
not 88 bytes. For either size, cache-aligning the structure is a big
increase in the struct size.

struct free_area {
        struct list_head           free_list[4];         /*     0    64 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int          nr_free;              /*    64     8 */

        /* size: 72, cachelines: 2, members: 2 */
        /* last cacheline: 8 bytes */
};

Are there other patches in the tree? What does pahole say?

With gcc-9, I'm also not seeing the imul instruction outputted like you
described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
where it calls get_page_from_free_area, it's using shl for the page list
operation. This might be a compiler glitch but given that free_area is a
different size, I'm less certain and wonder if something else is going on.

Finally, moving nr_free to the end and cache aligning it will make the
started of each free_list cache-aligned because of its location in the
struct zone so what purpose does __pad_to_align_free_list serve?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
  2021-02-25 11:28     ` Mel Gorman
@ 2021-02-25 15:16       ` Jesper Dangaard Brouer
  2021-02-25 15:38         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-25 15:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel, brouer

On Thu, 25 Feb 2021 11:28:49 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> As a side-node, I didn't pick up the other patches as there is review
> feedback and I didn't have strong opinions either way. Patch 3 is curious
> though, it probably should be split out and sent separetly but still;
> 
> On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:
> > Avoid multiplication (imul) operations when accessing:
> >  zone->free_area[order].nr_free
> > 
> > This was really tricky to find. I was puzzled why perf reported that
> > rmqueue_bulk was using 44% of the time in an imul operation:
> > 
> >        ???     del_page_from_free_list():
> >  44,54 ??? e2:   imul   $0x58,%rax,%rax
> > 
> > This operation was generated (by compiler) because the struct free_area have
> > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> > and instead choose to use a more expensive imul, to find the offset into the
> > array free_area[].
> > 
> > The patch align struct free_area to a cache-line, which cause the
> > compiler avoid the imul operation. The imul operation is very fast on
> > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> > member 'nr_free' to be first element, which saves one 'add' operation.
> > 
> > Looking up instruction latency this exchange a 3-cycle imul with a
> > 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> > 
> > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> >   
> 
> I'm having some trouble parsing this and matching it to the patch itself.
> 
> First off, on my system (x86-64), the size of struct free area is 72,
> not 88 bytes. For either size, cache-aligning the structure is a big
> increase in the struct size.

Yes, the increase in size is big. For the struct free_area 40 bytes for
my case and 56 bytes for your case.  The real problem is that this is
multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs
(is it 5?).  Thus, 56*11*5 = 3080 bytes.

Thus, I'm not sure it is worth it!  As I'm only saving 2-cycles, for
something that depends on the compiler generating specific code.  And
the compiler can easily change, and "fix" this on-its-own in a later
release, and then we are just wasting memory.

I did notice this imul happens 45 times in mm/page_alloc.o, with this
offset 0x58, but still this is likely not on hot-path.

> struct free_area {
>         struct list_head           free_list[4];         /*     0    64 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         long unsigned int          nr_free;              /*    64     8 */
> 
>         /* size: 72, cachelines: 2, members: 2 */
>         /* last cacheline: 8 bytes */
> };
> 
> Are there other patches in the tree? What does pahole say?

The size of size of struct free_area varies based on some CONFIG
setting, as free_list[] array size is determined by MIGRATE_TYPES,
which on my system is 5, and not 4 as on your system.

  struct list_head	free_list[MIGRATE_TYPES];

CONFIG_CMA and CONFIG_MEMORY_ISOLATION both increase MIGRATE_TYPES with one.
Thus, the array size can vary from 4 to 6.


> With gcc-9, I'm also not seeing the imul instruction outputted like you
> described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
> where it calls get_page_from_free_area, it's using shl for the page list
> operation. This might be a compiler glitch but given that free_area is a
> different size, I'm less certain and wonder if something else is going on.

I think it is the size variation.

> Finally, moving nr_free to the end and cache aligning it will make the
> started of each free_list cache-aligned because of its location in the
> struct zone so what purpose does __pad_to_align_free_list serve?

The purpose of purpose of __pad_to_align_free_list is because struct
list_head is 16 bytes, thus I wanted to align free_list to 16, given we
already have wasted the space.

Notice I added some more detailed notes in[1]:

 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-optimisations

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
  2021-02-25 15:16       ` Jesper Dangaard Brouer
@ 2021-02-25 15:38         ` Mel Gorman
  2021-02-26 14:34           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-02-25 15:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel

On Thu, Feb 25, 2021 at 04:16:33PM +0100, Jesper Dangaard Brouer wrote:
> > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:
> > > Avoid multiplication (imul) operations when accessing:
> > >  zone->free_area[order].nr_free
> > > 
> > > This was really tricky to find. I was puzzled why perf reported that
> > > rmqueue_bulk was using 44% of the time in an imul operation:
> > > 
> > >        ???     del_page_from_free_list():
> > >  44,54 ??? e2:   imul   $0x58,%rax,%rax
> > > 
> > > This operation was generated (by compiler) because the struct free_area have
> > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> > > and instead choose to use a more expensive imul, to find the offset into the
> > > array free_area[].
> > > 
> > > The patch align struct free_area to a cache-line, which cause the
> > > compiler avoid the imul operation. The imul operation is very fast on
> > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> > > member 'nr_free' to be first element, which saves one 'add' operation.
> > > 
> > > Looking up instruction latency this exchange a 3-cycle imul with a
> > > 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> > > 
> > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> > >   
> > 
> > I'm having some trouble parsing this and matching it to the patch itself.
> > 
> > First off, on my system (x86-64), the size of struct free area is 72,
> > not 88 bytes. For either size, cache-aligning the structure is a big
> > increase in the struct size.
> 
> Yes, the increase in size is big. For the struct free_area 40 bytes for
> my case and 56 bytes for your case.  The real problem is that this is
> multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs
> (is it 5?).  Thus, 56*11*5 = 3080 bytes.
> 
> Thus, I'm not sure it is worth it!  As I'm only saving 2-cycles, for
> something that depends on the compiler generating specific code.  And
> the compiler can easily change, and "fix" this on-its-own in a later
> release, and then we are just wasting memory.
> 
> I did notice this imul happens 45 times in mm/page_alloc.o, with this
> offset 0x58, but still this is likely not on hot-path.
> 

Yeah, I'm not convinced it's worth it. The benefit of 2 cycles is small and
it's config-dependant. While some configurations will benefit, others do
not but the increased consumption is universal. I think there are better
ways to save 2 cycles in the page allocator and this seems like a costly
micro-optimisation.

> > <SNIP>
> >
> > With gcc-9, I'm also not seeing the imul instruction outputted like you
> > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
> > where it calls get_page_from_free_area, it's using shl for the page list
> > operation. This might be a compiler glitch but given that free_area is a
> > different size, I'm less certain and wonder if something else is going on.
> 
> I think it is the size variation.
> 

Yes.

> > Finally, moving nr_free to the end and cache aligning it will make the
> > started of each free_list cache-aligned because of its location in the
> > struct zone so what purpose does __pad_to_align_free_list serve?
> 
> The purpose of purpose of __pad_to_align_free_list is because struct
> list_head is 16 bytes, thus I wanted to align free_list to 16, given we
> already have wasted the space.
> 

Ok, that's fair enough but it's also somewhat of a micro-optimisation as
whether it helps or not depends on the architecture.

I don't think I'll pick this up, certainly in the context of the bulk
allocator but it's worth keeping in mind. It's an interesting corner case
at least.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path
  2021-02-24 20:15     ` Ilias Apalodimas
@ 2021-02-26 14:31       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-26 14:31 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Mel Gorman, linux-mm, chuck.lever, netdev, linux-nfs,
	linux-kernel, brouer

On Wed, 24 Feb 2021 22:15:22 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> Hi Jesper, 
> 
> On Wed, Feb 24, 2021 at 07:56:46PM +0100, Jesper Dangaard Brouer wrote:
> > There are cases where the page_pool need to refill with pages from the
> > page allocator. Some workloads cause the page_pool to release pages
> > instead of recycling these pages.
> > 
> > For these workload it can improve performance to bulk alloc pages from
> > the page-allocator to refill the alloc cache.
> > 
> > For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> > redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> > an SKB from the xdp_frame, which then cannot return the page to the
> > page_pool. In this case, we saw[1] an improvement of 18.8% from using
> > the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> > 
> > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> 
> [...]
> 
> > +	/* Remaining pages store in alloc.cache */
> > +	list_for_each_entry_safe(page, next, &page_list, lru) {
> > +		list_del(&page->lru);
> > +		if (pp_flags & PP_FLAG_DMA_MAP) {
> > +			page = page_pool_dma_map(pool, page);
> > +			if (!page)  
> 
> As I commented on the previous patch, i'd prefer the put_page() here to be
> explicitly called, instead of hiding in the page_pool_dma_map()

I fully agree.  I will fixup the code.

> > +				continue;
> > +		}
> > +		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> > +			pool->alloc.cache[pool->alloc.count++] = page;
> > +			pool->pages_state_hold_cnt++;
> > +			trace_page_pool_state_hold(pool, page,
> > +						   pool->pages_state_hold_cnt);
> > +		} else {
> > +			put_page(page);
> > +		}
> > +	}
> > +out:
> >  	if (pool->p.flags & PP_FLAG_DMA_MAP) {
> > -		page = page_pool_dma_map(pool, page);
> > -		if (!page)
> > +		first_page = page_pool_dma_map(pool, first_page);
> > +		if (!first_page)
> >  			return NULL;
> >  	}
> >    
> [...]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
  2021-02-25 15:38         ` Mel Gorman
@ 2021-02-26 14:34           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-02-26 14:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, chuck.lever, netdev, linux-nfs, linux-kernel, brouer

On Thu, 25 Feb 2021 15:38:15 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Thu, Feb 25, 2021 at 04:16:33PM +0100, Jesper Dangaard Brouer wrote:
> > > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:  
> > > > Avoid multiplication (imul) operations when accessing:
> > > >  zone->free_area[order].nr_free
> > > > 
> > > > This was really tricky to find. I was puzzled why perf reported that
> > > > rmqueue_bulk was using 44% of the time in an imul operation:
> > > > 
> > > >        ???     del_page_from_free_list():
> > > >  44,54 ??? e2:   imul   $0x58,%rax,%rax
> > > > 
> > > > This operation was generated (by compiler) because the struct free_area have
> > > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> > > > and instead choose to use a more expensive imul, to find the offset into the
> > > > array free_area[].
> > > > 
> > > > The patch align struct free_area to a cache-line, which cause the
> > > > compiler avoid the imul operation. The imul operation is very fast on
> > > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> > > > member 'nr_free' to be first element, which saves one 'add' operation.
> > > > 
> > > > Looking up instruction latency this exchange a 3-cycle imul with a
> > > > 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> > > > 
> > > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> > > >     
> > > 
> > > I'm having some trouble parsing this and matching it to the patch itself.
> > > 
> > > First off, on my system (x86-64), the size of struct free area is 72,
> > > not 88 bytes. For either size, cache-aligning the structure is a big
> > > increase in the struct size.  
> > 
> > Yes, the increase in size is big. For the struct free_area 40 bytes for
> > my case and 56 bytes for your case.  The real problem is that this is
> > multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs
> > (is it 5?).  Thus, 56*11*5 = 3080 bytes.
> > 
> > Thus, I'm not sure it is worth it!  As I'm only saving 2-cycles, for
> > something that depends on the compiler generating specific code.  And
> > the compiler can easily change, and "fix" this on-its-own in a later
> > release, and then we are just wasting memory.
> > 
> > I did notice this imul happens 45 times in mm/page_alloc.o, with this
> > offset 0x58, but still this is likely not on hot-path.
> >   
> 
> Yeah, I'm not convinced it's worth it. The benefit of 2 cycles is small and
> it's config-dependant. While some configurations will benefit, others do
> not but the increased consumption is universal. I think there are better
> ways to save 2 cycles in the page allocator and this seems like a costly
> micro-optimisation.
> 
> > > <SNIP>
> > >
> > > With gcc-9, I'm also not seeing the imul instruction outputted like you
> > > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
> > > where it calls get_page_from_free_area, it's using shl for the page list
> > > operation. This might be a compiler glitch but given that free_area is a
> > > different size, I'm less certain and wonder if something else is going on.  
> > 
> > I think it is the size variation.
> >   
> 
> Yes.
> 
> > > Finally, moving nr_free to the end and cache aligning it will make the
> > > started of each free_list cache-aligned because of its location in the
> > > struct zone so what purpose does __pad_to_align_free_list serve?  
> > 
> > The purpose of purpose of __pad_to_align_free_list is because struct
> > list_head is 16 bytes, thus I wanted to align free_list to 16, given we
> > already have wasted the space.
> >   
> 
> Ok, that's fair enough but it's also somewhat of a micro-optimisation as
> whether it helps or not depends on the architecture.
> 
> I don't think I'll pick this up, certainly in the context of the bulk
> allocator but it's worth keeping in mind. It's an interesting corner case
> at least.

I fully agree. Lets drop this patch.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool
  2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
                   ` (5 preceding siblings ...)
  2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
@ 2021-03-01 13:29 ` Jesper Dangaard Brouer
  2021-03-01 13:29   ` [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
  2021-03-01 13:29   ` [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
  6 siblings, 2 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-01 13:29 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

This is a followup to Mel Gorman's patchset:
 - Message-Id: <20210224102603.19524-1-mgorman@techsingularity.net>
 - https://lore.kernel.org/netdev/20210224102603.19524-1-mgorman@techsingularity.net/

Showing page_pool usage of the API for alloc_pages_bulk().

Maybe Mel Gorman will/can carry these patches?
(to keep it together with the alloc_pages_bulk API)

---

Jesper Dangaard Brouer (2):
      net: page_pool: refactor dma_map into own function page_pool_dma_map
      net: page_pool: use alloc_pages_bulk in refill code path


 net/core/page_pool.c |  102 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 63 insertions(+), 39 deletions(-)

--



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
@ 2021-03-01 13:29   ` Jesper Dangaard Brouer
  2021-03-01 13:29   ` [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
  1 sibling, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-01 13:29 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.

V2: make page_pool_dma_map return boolean (Ilias)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..a26f2ceb6a87 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,14 +180,37 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					 pool->p.dma_dir);
 }
 
+static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
+{
+	dma_addr_t dma;
+
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
+				 (PAGE_SIZE << pool->p.order),
+				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	if (dma_mapping_error(pool->p.dev, dma))
+		return false;
+
+	page->dma_addr = dma;
+
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+	return true;
+}
+
 /* slow path */
 noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	unsigned int pp_flags = pool->p.flags;
 	struct page *page;
 	gfp_t gfp = _gfp;
-	dma_addr_t dma;
 
 	/* We could always set __GFP_COMP, and avoid this branch, as
 	 * prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!page)
 		return NULL;
 
-	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
-		goto skip_dma_map;
-
-	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
-	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
-	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
-	 * This mapping is kept for lifetime of page, until leaving pool.
-	 */
-	dma = dma_map_page_attrs(pool->p.dev, page, 0,
-				 (PAGE_SIZE << pool->p.order),
-				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
-	if (dma_mapping_error(pool->p.dev, dma)) {
+	if (pp_flags & PP_FLAG_DMA_MAP &&
+	    unlikely(!page_pool_dma_map(pool, page))) {
 		put_page(page);
 		return NULL;
 	}
-	page->dma_addr = dma;
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
-skip_dma_map:
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-
 	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
  2021-03-01 13:29   ` [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
@ 2021-03-01 13:29   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-01 13:29 UTC (permalink / raw)
  To: Mel Gorman, linux-mm
  Cc: Jesper Dangaard Brouer, chuck.lever, netdev, linux-nfs, linux-kernel

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   63 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a26f2ceb6a87..567680bd91c4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -208,44 +208,61 @@ noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	const int bulk = PP_ALLOC_CACHE_REFILL;
+	struct page *page, *next, *first_page;
 	unsigned int pp_flags = pool->p.flags;
-	struct page *page;
+	unsigned int pp_order = pool->p.order;
+	int pp_nid = pool->p.nid;
+	LIST_HEAD(page_list);
 	gfp_t gfp = _gfp;
 
-	/* We could always set __GFP_COMP, and avoid this branch, as
-	 * prep_new_page() can handle order-0 with __GFP_COMP.
-	 */
-	if (pool->p.order)
+	/* Don't support bulk alloc for high-order pages */
+	if (unlikely(pp_order)) {
 		gfp |= __GFP_COMP;
+		first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+		if (unlikely(!first_page))
+			return NULL;
+		goto out;
+	}
 
-	/* FUTURE development:
-	 *
-	 * Current slow-path essentially falls back to single page
-	 * allocations, which doesn't improve performance.  This code
-	 * need bulk allocation support from the page allocator code.
-	 */
-
-	/* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
-	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
-	page = alloc_pages(gfp, pool->p.order);
-#endif
-	if (!page)
+	if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
+						  bulk, &page_list)))
 		return NULL;
 
+	/* First page is extracted and returned to caller */
+	first_page = list_first_entry(&page_list, struct page, lru);
+	list_del(&first_page->lru);
+
+	/* Remaining pages store in alloc.cache */
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		if (pp_flags & PP_FLAG_DMA_MAP &&
+		    unlikely(!page_pool_dma_map(pool, page))) {
+			put_page(page);
+			continue;
+		}
+		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+			pool->pages_state_hold_cnt++;
+			trace_page_pool_state_hold(pool, page,
+						   pool->pages_state_hold_cnt);
+		} else {
+			put_page(page);
+		}
+	}
+out:
 	if (pp_flags & PP_FLAG_DMA_MAP &&
-	    unlikely(!page_pool_dma_map(pool, page))) {
-		put_page(page);
+	    unlikely(!page_pool_dma_map(pool, first_page))) {
+		put_page(first_page);
 		return NULL;
 	}
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+	trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-	return page;
+	return first_page;
 }
 
 /* For using page_pool replace: alloc_pages() API calls, but provide




^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-01 13:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
2021-02-24 10:26 ` [PATCH 2/3] mm, page_alloc: Add a bulk page allocator Mel Gorman
2021-02-24 10:26 ` [PATCH 3/3] SUNRPC: Refresh rq_pages using " Mel Gorman
2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
2021-02-24 11:55   ` Mel Gorman
2021-02-24 13:20 ` Chuck Lever
2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-02-24 20:11     ` Ilias Apalodimas
2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-02-24 20:15     ` Ilias Apalodimas
2021-02-26 14:31       ` Jesper Dangaard Brouer
2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
2021-02-25 11:28     ` Mel Gorman
2021-02-25 15:16       ` Jesper Dangaard Brouer
2021-02-25 15:38         ` Mel Gorman
2021-02-26 14:34           ` Jesper Dangaard Brouer
2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).