All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7
@ 2017-01-09 16:35 ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

The biggest changes are in the final patch. In v1, it was a rough untested
prototype. This version corrected a number of issues, tested it and includes
a comparison between bulk allocating pages and allocating them one at a time.
While there are still no in-kernel users, it is hoped that the bulk API
would convince network drivers to avoid using high-order allocations. One
slight caveat is that there still may be an advantage to doing the coherent
setup on a high-order page instead of a list of order-0 pages. If that is the
case, it would need to be covered by Jesper's generic page pool allocator.

Changelog since v1
o Remove a scheduler point from the allocation path
o Finalise the bulk allocator and test it

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of
his motivation was due to the overhead of allocating multiple order-0 that
led some drivers to use high-order allocations and splitting them which
can be very slow if high-order pages are unavailable. This long-overdue
series aims to show that raw bulk page allocation can be achieved relatively
easily without introducing a completely new allocator. A new generic page
pool allocator would then ideally focus on just the DMA-coherent part.

The first two patches in the series restructure the allocator such that
it's relatively easy to build a bulk page allocator. The third patch
alters the per-cpu alloctor to make it exclusive to !irq requests. This
cuts allocation/free overhead by roughly 30% but it may not be noticable
to anyone other than users of high-speed networks (I'm not one). The
fourth patch introduces a bulk page allocator with no in-kernel users as
an example for Jesper and others who want to build a page allocator for
DMA-coherent pages.  It hopefully is relatively easy to modify this API
and the one core function to get the semantics they require.  Note that
Patch 3 is not required for patch 4 but it may be desirable if the bulk
allocations happen from !IRQ context.

A comparison of costs of allocating one page at a time on the vanilla
kernel vs the bulk allocator that forces the per-cpu allocator to be
used from a !irq context is as follows

pagealloc
                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla                  bulk-v2r7
Amean    alloc-odr0-1               302.85 (  0.00%)           106.62 ( 64.80%)
Amean    alloc-odr0-2               227.85 (  0.00%)            76.38 ( 66.48%)
Amean    alloc-odr0-4               191.23 (  0.00%)            57.23 ( 70.07%)
Amean    alloc-odr0-8               167.54 (  0.00%)            48.77 ( 70.89%)
Amean    alloc-odr0-16              158.54 (  0.00%)            45.38 ( 71.37%)
Amean    alloc-odr0-32              150.46 (  0.00%)            42.77 ( 71.57%)
Amean    alloc-odr0-64              148.23 (  0.00%)            41.00 ( 72.34%)
Amean    alloc-odr0-128             145.00 (  0.00%)            40.08 ( 72.36%)
Amean    alloc-odr0-256             157.00 (  0.00%)            56.00 ( 64.33%)
Amean    alloc-odr0-512             170.00 (  0.00%)            69.00 ( 59.41%)
Amean    alloc-odr0-1024            181.00 (  0.00%)            76.23 ( 57.88%)
Amean    alloc-odr0-2048            186.00 (  0.00%)            81.15 ( 56.37%)
Amean    alloc-odr0-4096            192.92 (  0.00%)            85.92 ( 55.46%)
Amean    alloc-odr0-8192            194.00 (  0.00%)            88.00 ( 54.64%)
Amean    alloc-odr0-16384           202.15 (  0.00%)            89.00 ( 55.97%)
Amean    free-odr0-1                154.92 (  0.00%)            55.69 ( 64.05%)
Amean    free-odr0-2                115.31 (  0.00%)            49.38 ( 57.17%)
Amean    free-odr0-4                 93.31 (  0.00%)            45.38 ( 51.36%)
Amean    free-odr0-8                 82.62 (  0.00%)            44.23 ( 46.46%)
Amean    free-odr0-16                79.00 (  0.00%)            45.00 ( 43.04%)
Amean    free-odr0-32                75.15 (  0.00%)            43.92 ( 41.56%)
Amean    free-odr0-64                74.00 (  0.00%)            43.00 ( 41.89%)
Amean    free-odr0-128               73.00 (  0.00%)            43.00 ( 41.10%)
Amean    free-odr0-256               91.00 (  0.00%)            60.46 ( 33.56%)
Amean    free-odr0-512              108.00 (  0.00%)            76.00 ( 29.63%)
Amean    free-odr0-1024             119.00 (  0.00%)            85.38 ( 28.25%)
Amean    free-odr0-2048             125.08 (  0.00%)            91.23 ( 27.06%)
Amean    free-odr0-4096             130.00 (  0.00%)            95.62 ( 26.45%)
Amean    free-odr0-8192             130.00 (  0.00%)            97.00 ( 25.38%)
Amean    free-odr0-16384            134.46 (  0.00%)            97.46 ( 27.52%)
Amean    total-odr0-1               457.77 (  0.00%)           162.31 ( 64.54%)
Amean    total-odr0-2               343.15 (  0.00%)           125.77 ( 63.35%)
Amean    total-odr0-4               284.54 (  0.00%)           102.62 ( 63.94%)
Amean    total-odr0-8               250.15 (  0.00%)            93.00 ( 62.82%)
Amean    total-odr0-16              237.54 (  0.00%)            90.38 ( 61.95%)
Amean    total-odr0-32              225.62 (  0.00%)            86.69 ( 61.58%)
Amean    total-odr0-64              222.23 (  0.00%)            84.00 ( 62.20%)
Amean    total-odr0-128             218.00 (  0.00%)            83.08 ( 61.89%)
Amean    total-odr0-256             248.00 (  0.00%)           116.46 ( 53.04%)
Amean    total-odr0-512             278.00 (  0.00%)           145.00 ( 47.84%)
Amean    total-odr0-1024            300.00 (  0.00%)           161.62 ( 46.13%)
Amean    total-odr0-2048            311.08 (  0.00%)           172.38 ( 44.58%)
Amean    total-odr0-4096            322.92 (  0.00%)           181.54 ( 43.78%)
Amean    total-odr0-8192            324.00 (  0.00%)           185.00 ( 42.90%)
Amean    total-odr0-16384           336.62 (  0.00%)           186.46 ( 44.61%)

It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
overall cost of allocating/freeing batches of pages.

 include/linux/gfp.h |  24 ++++
 mm/page_alloc.c     | 353 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 278 insertions(+), 99 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7
@ 2017-01-09 16:35 ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

The biggest changes are in the final patch. In v1, it was a rough untested
prototype. This version corrected a number of issues, tested it and includes
a comparison between bulk allocating pages and allocating them one at a time.
While there are still no in-kernel users, it is hoped that the bulk API
would convince network drivers to avoid using high-order allocations. One
slight caveat is that there still may be an advantage to doing the coherent
setup on a high-order page instead of a list of order-0 pages. If that is the
case, it would need to be covered by Jesper's generic page pool allocator.

Changelog since v1
o Remove a scheduler point from the allocation path
o Finalise the bulk allocator and test it

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of
his motivation was due to the overhead of allocating multiple order-0 that
led some drivers to use high-order allocations and splitting them which
can be very slow if high-order pages are unavailable. This long-overdue
series aims to show that raw bulk page allocation can be achieved relatively
easily without introducing a completely new allocator. A new generic page
pool allocator would then ideally focus on just the DMA-coherent part.

The first two patches in the series restructure the allocator such that
it's relatively easy to build a bulk page allocator. The third patch
alters the per-cpu alloctor to make it exclusive to !irq requests. This
cuts allocation/free overhead by roughly 30% but it may not be noticable
to anyone other than users of high-speed networks (I'm not one). The
fourth patch introduces a bulk page allocator with no in-kernel users as
an example for Jesper and others who want to build a page allocator for
DMA-coherent pages.  It hopefully is relatively easy to modify this API
and the one core function to get the semantics they require.  Note that
Patch 3 is not required for patch 4 but it may be desirable if the bulk
allocations happen from !IRQ context.

A comparison of costs of allocating one page at a time on the vanilla
kernel vs the bulk allocator that forces the per-cpu allocator to be
used from a !irq context is as follows

pagealloc
                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla                  bulk-v2r7
Amean    alloc-odr0-1               302.85 (  0.00%)           106.62 ( 64.80%)
Amean    alloc-odr0-2               227.85 (  0.00%)            76.38 ( 66.48%)
Amean    alloc-odr0-4               191.23 (  0.00%)            57.23 ( 70.07%)
Amean    alloc-odr0-8               167.54 (  0.00%)            48.77 ( 70.89%)
Amean    alloc-odr0-16              158.54 (  0.00%)            45.38 ( 71.37%)
Amean    alloc-odr0-32              150.46 (  0.00%)            42.77 ( 71.57%)
Amean    alloc-odr0-64              148.23 (  0.00%)            41.00 ( 72.34%)
Amean    alloc-odr0-128             145.00 (  0.00%)            40.08 ( 72.36%)
Amean    alloc-odr0-256             157.00 (  0.00%)            56.00 ( 64.33%)
Amean    alloc-odr0-512             170.00 (  0.00%)            69.00 ( 59.41%)
Amean    alloc-odr0-1024            181.00 (  0.00%)            76.23 ( 57.88%)
Amean    alloc-odr0-2048            186.00 (  0.00%)            81.15 ( 56.37%)
Amean    alloc-odr0-4096            192.92 (  0.00%)            85.92 ( 55.46%)
Amean    alloc-odr0-8192            194.00 (  0.00%)            88.00 ( 54.64%)
Amean    alloc-odr0-16384           202.15 (  0.00%)            89.00 ( 55.97%)
Amean    free-odr0-1                154.92 (  0.00%)            55.69 ( 64.05%)
Amean    free-odr0-2                115.31 (  0.00%)            49.38 ( 57.17%)
Amean    free-odr0-4                 93.31 (  0.00%)            45.38 ( 51.36%)
Amean    free-odr0-8                 82.62 (  0.00%)            44.23 ( 46.46%)
Amean    free-odr0-16                79.00 (  0.00%)            45.00 ( 43.04%)
Amean    free-odr0-32                75.15 (  0.00%)            43.92 ( 41.56%)
Amean    free-odr0-64                74.00 (  0.00%)            43.00 ( 41.89%)
Amean    free-odr0-128               73.00 (  0.00%)            43.00 ( 41.10%)
Amean    free-odr0-256               91.00 (  0.00%)            60.46 ( 33.56%)
Amean    free-odr0-512              108.00 (  0.00%)            76.00 ( 29.63%)
Amean    free-odr0-1024             119.00 (  0.00%)            85.38 ( 28.25%)
Amean    free-odr0-2048             125.08 (  0.00%)            91.23 ( 27.06%)
Amean    free-odr0-4096             130.00 (  0.00%)            95.62 ( 26.45%)
Amean    free-odr0-8192             130.00 (  0.00%)            97.00 ( 25.38%)
Amean    free-odr0-16384            134.46 (  0.00%)            97.46 ( 27.52%)
Amean    total-odr0-1               457.77 (  0.00%)           162.31 ( 64.54%)
Amean    total-odr0-2               343.15 (  0.00%)           125.77 ( 63.35%)
Amean    total-odr0-4               284.54 (  0.00%)           102.62 ( 63.94%)
Amean    total-odr0-8               250.15 (  0.00%)            93.00 ( 62.82%)
Amean    total-odr0-16              237.54 (  0.00%)            90.38 ( 61.95%)
Amean    total-odr0-32              225.62 (  0.00%)            86.69 ( 61.58%)
Amean    total-odr0-64              222.23 (  0.00%)            84.00 ( 62.20%)
Amean    total-odr0-128             218.00 (  0.00%)            83.08 ( 61.89%)
Amean    total-odr0-256             248.00 (  0.00%)           116.46 ( 53.04%)
Amean    total-odr0-512             278.00 (  0.00%)           145.00 ( 47.84%)
Amean    total-odr0-1024            300.00 (  0.00%)           161.62 ( 46.13%)
Amean    total-odr0-2048            311.08 (  0.00%)           172.38 ( 44.58%)
Amean    total-odr0-4096            322.92 (  0.00%)           181.54 ( 43.78%)
Amean    total-odr0-8192            324.00 (  0.00%)           185.00 ( 42.90%)
Amean    total-odr0-16384           336.62 (  0.00%)           186.46 ( 44.61%)

It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
overall cost of allocating/freeing batches of pages.

 include/linux/gfp.h |  24 ++++
 mm/page_alloc.c     | 353 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 278 insertions(+), 99 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
  2017-01-09 16:35 ` Mel Gorman
@ 2017-01-09 16:35   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

buffered_rmqueue removes a page from a given zone and uses the per-cpu
list for order-0. This is fine but a hypothetical caller that wanted
multiple order-0 pages has to disable/reenable interrupts multiple
times. This patch structures buffere_rmqueue such that it's relatively
easy to build a bulk order-0 page allocator. There is no functional
change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 126 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 77 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..d8798583eaf8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2610,68 +2610,96 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+/* Remote page from the per-cpu list, caller must protect the list */
+static struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype, bool cold,
+			struct per_cpu_pages *pcp, struct list_head *list)
+{
+	struct page *page;
+
+	do {
+		if (list_empty(list)) {
+			pcp->count += rmqueue_bulk(zone, 0,
+					pcp->batch, list,
+					migratetype, cold);
+			if (unlikely(list_empty(list)))
+				return NULL;
+		}
+
+		if (cold)
+			page = list_last_entry(list, struct page, lru);
+		else
+			page = list_first_entry(list, struct page, lru);
+
+		list_del(&page->lru);
+		pcp->count--;
+	} while (check_new_pcp(page));
+
+	return page;
+}
+
+/* Lock and remove page from the per-cpu list */
+static struct page *rmqueue_pcplist(struct zone *preferred_zone,
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
+{
+	struct per_cpu_pages *pcp;
+	struct list_head *list;
+	bool cold = ((gfp_flags & __GFP_COLD) != 0);
+	struct page *page;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	list = &pcp->lists[migratetype];
+	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
+							cold, pcp, list);
+	if (page) {
+		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+		zone_statistics(preferred_zone, zone, gfp_flags);
+	}
+	local_irq_restore(flags);
+	return page;
+}
+
 /*
  * Allocate a page from the given zone. Use pcplists for order-0 allocations.
  */
 static inline
-struct page *buffered_rmqueue(struct zone *preferred_zone,
+struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, unsigned int alloc_flags,
 			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
-	bool cold = ((gfp_flags & __GFP_COLD) != 0);
-
-	if (likely(order == 0)) {
-		struct per_cpu_pages *pcp;
-		struct list_head *list;
-
-		local_irq_save(flags);
-		do {
-			pcp = &this_cpu_ptr(zone->pageset)->pcp;
-			list = &pcp->lists[migratetype];
-			if (list_empty(list)) {
-				pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, list,
-						migratetype, cold);
-				if (unlikely(list_empty(list)))
-					goto failed;
-			}
-
-			if (cold)
-				page = list_last_entry(list, struct page, lru);
-			else
-				page = list_first_entry(list, struct page, lru);
 
-			list_del(&page->lru);
-			pcp->count--;
+	if (likely(order == 0))
+		return rmqueue_pcplist(preferred_zone, zone, order,
+				gfp_flags, migratetype);
 
-		} while (check_new_pcp(page));
-	} else {
-		/*
-		 * We most definitely don't want callers attempting to
-		 * allocate greater than order-1 page units with __GFP_NOFAIL.
-		 */
-		WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-		spin_lock_irqsave(&zone->lock, flags);
+	/*
+	 * We most definitely don't want callers attempting to
+	 * allocate greater than order-1 page units with __GFP_NOFAIL.
+	 */
+	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+	spin_lock_irqsave(&zone->lock, flags);
 
-		do {
-			page = NULL;
-			if (alloc_flags & ALLOC_HARDER) {
-				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-				if (page)
-					trace_mm_page_alloc_zone_locked(page, order, migratetype);
-			}
-			if (!page)
-				page = __rmqueue(zone, order, migratetype);
-		} while (page && check_new_pages(page, order));
-		spin_unlock(&zone->lock);
+	do {
+		page = NULL;
+		if (alloc_flags & ALLOC_HARDER) {
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			if (page)
+				trace_mm_page_alloc_zone_locked(page, order, migratetype);
+		}
 		if (!page)
-			goto failed;
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
-	}
+			page = __rmqueue(zone, order, migratetype);
+	} while (page && check_new_pages(page, order));
+	spin_unlock(&zone->lock);
+	if (!page)
+		goto failed;
+	__mod_zone_freepage_state(zone, -(1 << order),
+				  get_pcppage_migratetype(page));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -2982,7 +3010,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
+		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
@ 2017-01-09 16:35   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

buffered_rmqueue removes a page from a given zone and uses the per-cpu
list for order-0. This is fine but a hypothetical caller that wanted
multiple order-0 pages has to disable/reenable interrupts multiple
times. This patch structures buffere_rmqueue such that it's relatively
easy to build a bulk order-0 page allocator. There is no functional
change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 126 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 77 insertions(+), 49 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..d8798583eaf8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2610,68 +2610,96 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+/* Remote page from the per-cpu list, caller must protect the list */
+static struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype, bool cold,
+			struct per_cpu_pages *pcp, struct list_head *list)
+{
+	struct page *page;
+
+	do {
+		if (list_empty(list)) {
+			pcp->count += rmqueue_bulk(zone, 0,
+					pcp->batch, list,
+					migratetype, cold);
+			if (unlikely(list_empty(list)))
+				return NULL;
+		}
+
+		if (cold)
+			page = list_last_entry(list, struct page, lru);
+		else
+			page = list_first_entry(list, struct page, lru);
+
+		list_del(&page->lru);
+		pcp->count--;
+	} while (check_new_pcp(page));
+
+	return page;
+}
+
+/* Lock and remove page from the per-cpu list */
+static struct page *rmqueue_pcplist(struct zone *preferred_zone,
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
+{
+	struct per_cpu_pages *pcp;
+	struct list_head *list;
+	bool cold = ((gfp_flags & __GFP_COLD) != 0);
+	struct page *page;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	list = &pcp->lists[migratetype];
+	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
+							cold, pcp, list);
+	if (page) {
+		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+		zone_statistics(preferred_zone, zone, gfp_flags);
+	}
+	local_irq_restore(flags);
+	return page;
+}
+
 /*
  * Allocate a page from the given zone. Use pcplists for order-0 allocations.
  */
 static inline
-struct page *buffered_rmqueue(struct zone *preferred_zone,
+struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, unsigned int alloc_flags,
 			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
-	bool cold = ((gfp_flags & __GFP_COLD) != 0);
-
-	if (likely(order == 0)) {
-		struct per_cpu_pages *pcp;
-		struct list_head *list;
-
-		local_irq_save(flags);
-		do {
-			pcp = &this_cpu_ptr(zone->pageset)->pcp;
-			list = &pcp->lists[migratetype];
-			if (list_empty(list)) {
-				pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, list,
-						migratetype, cold);
-				if (unlikely(list_empty(list)))
-					goto failed;
-			}
-
-			if (cold)
-				page = list_last_entry(list, struct page, lru);
-			else
-				page = list_first_entry(list, struct page, lru);
 
-			list_del(&page->lru);
-			pcp->count--;
+	if (likely(order == 0))
+		return rmqueue_pcplist(preferred_zone, zone, order,
+				gfp_flags, migratetype);
 
-		} while (check_new_pcp(page));
-	} else {
-		/*
-		 * We most definitely don't want callers attempting to
-		 * allocate greater than order-1 page units with __GFP_NOFAIL.
-		 */
-		WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-		spin_lock_irqsave(&zone->lock, flags);
+	/*
+	 * We most definitely don't want callers attempting to
+	 * allocate greater than order-1 page units with __GFP_NOFAIL.
+	 */
+	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+	spin_lock_irqsave(&zone->lock, flags);
 
-		do {
-			page = NULL;
-			if (alloc_flags & ALLOC_HARDER) {
-				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-				if (page)
-					trace_mm_page_alloc_zone_locked(page, order, migratetype);
-			}
-			if (!page)
-				page = __rmqueue(zone, order, migratetype);
-		} while (page && check_new_pages(page, order));
-		spin_unlock(&zone->lock);
+	do {
+		page = NULL;
+		if (alloc_flags & ALLOC_HARDER) {
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			if (page)
+				trace_mm_page_alloc_zone_locked(page, order, migratetype);
+		}
 		if (!page)
-			goto failed;
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
-	}
+			page = __rmqueue(zone, order, migratetype);
+	} while (page && check_new_pages(page, order));
+	spin_unlock(&zone->lock);
+	if (!page)
+		goto failed;
+	__mod_zone_freepage_state(zone, -(1 << order),
+				  get_pcppage_migratetype(page));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -2982,7 +3010,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
+		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
  2017-01-09 16:35 ` Mel Gorman
@ 2017-01-09 16:35   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

alloc_pages_nodemask does a number of preperation steps that determine
what zones can be used for the allocation depending on a variety of
factors. This is fine but a hypothetical caller that wanted multiple
order-0 pages has to do the preparation steps multiple times. This patch
structures __alloc_pages_nodemask such that it's relatively easy to build
a bulk order-0 page allocator. There is no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 81 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 32 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8798583eaf8..4a602b7f258d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,64 +3762,81 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask,
+		struct alloc_context *ac, gfp_t *alloc_mask,
+		unsigned int *alloc_flags)
 {
-	struct page *page;
-	unsigned int cpuset_mems_cookie;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
-	struct alloc_context ac = {
-		.high_zoneidx = gfp_zone(gfp_mask),
-		.zonelist = zonelist,
-		.nodemask = nodemask,
-		.migratetype = gfpflags_to_migratetype(gfp_mask),
-	};
+	ac->high_zoneidx = gfp_zone(gfp_mask);
+	ac->zonelist = zonelist;
+	ac->nodemask = nodemask;
+	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
 	if (cpusets_enabled()) {
-		alloc_mask |= __GFP_HARDWALL;
-		alloc_flags |= ALLOC_CPUSET;
-		if (!ac.nodemask)
-			ac.nodemask = &cpuset_current_mems_allowed;
+		*alloc_mask |= __GFP_HARDWALL;
+		*alloc_flags |= ALLOC_CPUSET;
+		if (!ac->nodemask)
+			ac->nodemask = &cpuset_current_mems_allowed;
 	}
 
-	gfp_mask &= gfp_allowed_mask;
-
 	lockdep_trace_alloc(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		return false;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
 	 * valid zone. It's possible to have an empty zonelist as a result
 	 * of __GFP_THISNODE and a memoryless node
 	 */
-	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+	if (unlikely(!ac->zonelist->_zonerefs->zone))
+		return false;
 
-	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
-		alloc_flags |= ALLOC_CMA;
+	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
+		*alloc_flags |= ALLOC_CMA;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	return true;
+}
 
+/* Determine whether to spread dirty pages and what the first usable zone */
+static inline void finalise_ac(gfp_t gfp_mask,
+		unsigned int order, struct alloc_context *ac)
+{
 	/* Dirty zone balancing only done in the fast path */
-	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
 	/*
 	 * The preferred zone is used for statistics but crucially it is
 	 * also used as the starting point for the zonelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
-	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
-					ac.high_zoneidx, ac.nodemask);
+	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->high_zoneidx, ac->nodemask);
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	struct page *page;
+	unsigned int cpuset_mems_cookie;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
+	struct alloc_context ac = { };
+
+	gfp_mask &= gfp_allowed_mask;
+	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return NULL;
+
+retry_cpuset:
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
+	finalise_ac(gfp_mask, order, &ac);
 	if (!ac.preferred_zoneref) {
 		page = NULL;
 		goto no_zone;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
@ 2017-01-09 16:35   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

alloc_pages_nodemask does a number of preperation steps that determine
what zones can be used for the allocation depending on a variety of
factors. This is fine but a hypothetical caller that wanted multiple
order-0 pages has to do the preparation steps multiple times. This patch
structures __alloc_pages_nodemask such that it's relatively easy to build
a bulk order-0 page allocator. There is no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 81 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 32 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8798583eaf8..4a602b7f258d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,64 +3762,81 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask,
+		struct alloc_context *ac, gfp_t *alloc_mask,
+		unsigned int *alloc_flags)
 {
-	struct page *page;
-	unsigned int cpuset_mems_cookie;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
-	struct alloc_context ac = {
-		.high_zoneidx = gfp_zone(gfp_mask),
-		.zonelist = zonelist,
-		.nodemask = nodemask,
-		.migratetype = gfpflags_to_migratetype(gfp_mask),
-	};
+	ac->high_zoneidx = gfp_zone(gfp_mask);
+	ac->zonelist = zonelist;
+	ac->nodemask = nodemask;
+	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
 	if (cpusets_enabled()) {
-		alloc_mask |= __GFP_HARDWALL;
-		alloc_flags |= ALLOC_CPUSET;
-		if (!ac.nodemask)
-			ac.nodemask = &cpuset_current_mems_allowed;
+		*alloc_mask |= __GFP_HARDWALL;
+		*alloc_flags |= ALLOC_CPUSET;
+		if (!ac->nodemask)
+			ac->nodemask = &cpuset_current_mems_allowed;
 	}
 
-	gfp_mask &= gfp_allowed_mask;
-
 	lockdep_trace_alloc(gfp_mask);
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		return false;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
 	 * valid zone. It's possible to have an empty zonelist as a result
 	 * of __GFP_THISNODE and a memoryless node
 	 */
-	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+	if (unlikely(!ac->zonelist->_zonerefs->zone))
+		return false;
 
-	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
-		alloc_flags |= ALLOC_CMA;
+	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
+		*alloc_flags |= ALLOC_CMA;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
+	return true;
+}
 
+/* Determine whether to spread dirty pages and what the first usable zone */
+static inline void finalise_ac(gfp_t gfp_mask,
+		unsigned int order, struct alloc_context *ac)
+{
 	/* Dirty zone balancing only done in the fast path */
-	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
 	/*
 	 * The preferred zone is used for statistics but crucially it is
 	 * also used as the starting point for the zonelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
-	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
-					ac.high_zoneidx, ac.nodemask);
+	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->high_zoneidx, ac->nodemask);
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	struct page *page;
+	unsigned int cpuset_mems_cookie;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
+	struct alloc_context ac = { };
+
+	gfp_mask &= gfp_allowed_mask;
+	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return NULL;
+
+retry_cpuset:
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
+	finalise_ac(gfp_mask, order, &ac);
 	if (!ac.preferred_zoneref) {
 		page = NULL;
 		goto no_zone;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-09 16:35 ` Mel Gorman
@ 2017-01-09 16:35   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk
of the free path but also a significant percentage of the allocation
path.

This patch alters the locking and checks such that only irq-safe allocation
requests use the per-cpu allocator. All others acquire the irq-safe
zone->lock and allocate from the buddy allocator. It relies on disabling
preemption to safely access the per-cpu structures. It could be slightly
modified to avoid soft IRQs using it but it's not clear it's worthwhile.

This modification may slow allocations from IRQ context slightly but the main
gain from the per-cpu allocator is that it scales better for allocations
from multiple contexts. There is an implicit assumption that intensive
allocations from IRQ contexts on multiple CPUs from a single NUMA node are
rare and that the fast majority of scaling issues are encountered in !IRQ
contexts such as page faulting. It's worth noting that this patch is not
required for a bulk page allocator but it significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator

                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla               irqsafe-v1r5
Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages in
batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
most cases. The free path is reduced by 26-46% and the total reduction
is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation. Hence, the impact on a basic page faulting benchmark
is not that significant

                              4.10.0-rc2            4.10.0-rc2
                                 vanilla          irqsafe-v1r5
Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch. The headline improvement is small as the overall
fault cost, zeroing, page table insertion etc dominate relative to
disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both localhost
and between physical server/clients where other costs dominate. It's
possible that this will only be noticable on very high speed networks.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a602b7f258d..232cadbe9231 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,10 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
+	unsigned long nr_scanned, flags;
 	bool isolated_pageblocks;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
@@ -1139,7 +1139,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -1147,8 +1147,8 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
-	spin_lock(&zone->lock);
+	unsigned long nr_scanned, flags;
+	spin_lock_irqsave(&zone->lock, flags);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
@@ -1158,7 +1158,7 @@ static void free_one_page(struct zone *zone,
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
 	__free_one_page(page, pfn, zone, order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -1236,7 +1236,6 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
@@ -1244,10 +1243,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
+	count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -2219,8 +2216,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			int migratetype, bool cold)
 {
 	int i, alloced = 0;
+	unsigned long flags;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2256,7 +2254,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return alloced;
 }
 
@@ -2444,7 +2442,6 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
+	preempt_disable();
+	count_vm_event(PGFREE);
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2484,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	local_irq_restore(flags);
+	preempt_enable_no_resched();
 }
 
 /*
@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct list_head *list;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
-	unsigned long flags;
 
-	local_irq_save(flags);
+	preempt_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
@@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, gfp_flags);
 	}
-	local_irq_restore(flags);
+	preempt_enable_no_resched();
 	return page;
 }
 
@@ -2674,7 +2670,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0))
+	if (likely(order == 0) && !in_interrupt())
 		return rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 
@@ -3919,7 +3915,7 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
+		if (order == 0 && !in_interrupt())
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-09 16:35   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk
of the free path but also a significant percentage of the allocation
path.

This patch alters the locking and checks such that only irq-safe allocation
requests use the per-cpu allocator. All others acquire the irq-safe
zone->lock and allocate from the buddy allocator. It relies on disabling
preemption to safely access the per-cpu structures. It could be slightly
modified to avoid soft IRQs using it but it's not clear it's worthwhile.

This modification may slow allocations from IRQ context slightly but the main
gain from the per-cpu allocator is that it scales better for allocations
from multiple contexts. There is an implicit assumption that intensive
allocations from IRQ contexts on multiple CPUs from a single NUMA node are
rare and that the fast majority of scaling issues are encountered in !IRQ
contexts such as page faulting. It's worth noting that this patch is not
required for a bulk page allocator but it significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator

                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla               irqsafe-v1r5
Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages in
batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
most cases. The free path is reduced by 26-46% and the total reduction
is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation. Hence, the impact on a basic page faulting benchmark
is not that significant

                              4.10.0-rc2            4.10.0-rc2
                                 vanilla          irqsafe-v1r5
Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch. The headline improvement is small as the overall
fault cost, zeroing, page table insertion etc dominate relative to
disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both localhost
and between physical server/clients where other costs dominate. It's
possible that this will only be noticable on very high speed networks.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a602b7f258d..232cadbe9231 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,10 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
+	unsigned long nr_scanned, flags;
 	bool isolated_pageblocks;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
@@ -1139,7 +1139,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -1147,8 +1147,8 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
-	spin_lock(&zone->lock);
+	unsigned long nr_scanned, flags;
+	spin_lock_irqsave(&zone->lock, flags);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
@@ -1158,7 +1158,7 @@ static void free_one_page(struct zone *zone,
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
 	__free_one_page(page, pfn, zone, order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -1236,7 +1236,6 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
@@ -1244,10 +1243,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
+	count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -2219,8 +2216,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			int migratetype, bool cold)
 {
 	int i, alloced = 0;
+	unsigned long flags;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2256,7 +2254,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return alloced;
 }
 
@@ -2444,7 +2442,6 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
+	preempt_disable();
+	count_vm_event(PGFREE);
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2484,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	local_irq_restore(flags);
+	preempt_enable_no_resched();
 }
 
 /*
@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct list_head *list;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
-	unsigned long flags;
 
-	local_irq_save(flags);
+	preempt_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
@@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, gfp_flags);
 	}
-	local_irq_restore(flags);
+	preempt_enable_no_resched();
 	return page;
 }
 
@@ -2674,7 +2670,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0))
+	if (likely(order == 0) && !in_interrupt())
 		return rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 
@@ -3919,7 +3915,7 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
+		if (order == 0 && !in_interrupt())
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
  2017-01-09 16:35 ` Mel Gorman
@ 2017-01-09 16:35   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
number of pages to be allocated and added to a list. They can be freed in
bulk using free_pages_bulk(). Note that it would theoretically be possible
to use free_hot_cold_page_list for faster frees if the symbol was exported,
the refcounts were 0 and the caller guaranteed it was not in an interrupt.
This would be significantly faster in the free path but also more unsafer
and a harder API to use.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.

The following compares the allocation cost per page for different batch
sizes. The baseline is allocating them one at a time and it compares with
the performance when using the new allocation interface.

pagealloc
                                          4.10.0-rc2                 4.10.0-rc2
                                       one-at-a-time                    bulk-v2
Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)

It shows a roughly 50-60% reduction in the cost of allocating pages.
The free paths are not improved as much but relatively little can be batched
there. It's not quite as fast as it could be but taking further shortcuts
would require making a lot of assumptions about the state of the page and
the context of the caller.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  24 +++++++++++
 mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..b2fe171ee1c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask,
+			unsigned long nr_pages, struct list_head *alloc_list);
+
+static inline unsigned long
+__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, unsigned long nr_pages,
+		struct list_head *list)
+{
+	return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
+						nr_pages, list);
+}
+
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+		unsigned long nr_pages, struct list_head *list)
+{
+	int nid = numa_mem_id();
+	return __alloc_pages_bulk(gfp_mask, order,
+			node_zonelist(nid, gfp_mask), nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -504,6 +527,7 @@ extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, bool cold);
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_drain(struct page *page, unsigned int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 232cadbe9231..4f142270fbf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 }
 
 /*
- * Free a list of 0-order pages
+ * Free a list of 0-order pages whose reference count is already zero.
  */
 void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
@@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
 		trace_mm_page_free_batched(page, cold);
 		free_hot_cold_page(page, cold);
 	}
+
+	INIT_LIST_HEAD(list);
+}
+
+/* Drop reference counts and free pages from a list */
+void free_pages_bulk(struct list_head *list)
+{
+	struct page *page, *next;
+	bool free_percpu = !in_interrupt();
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		trace_mm_page_free_batched(page, 0);
+		if (put_page_testzero(page)) {
+			list_del(&page->lru);
+			if (free_percpu)
+				free_hot_cold_page(page, false);
+			else
+				__free_pages_ok(page, 0);
+		}
+	}
 }
+EXPORT_SYMBOL_GPL(free_pages_bulk);
 
 /*
  * split_page takes a non-compound higher-order page, and splits it into
@@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ * Note that there is no guarantee that nr_pages will be allocated although
+ * every effort will be made to allocate at least one. Unlike the core
+ * allocator, no special effort is made to recover from transient
+ * failures caused by changes in cpusets. It should only be used from !IRQ
+ * context. An attempt to allocate a batch of patches from an interrupt
+ * will allocate a single page.
+ */
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask,
+			unsigned long nr_pages, struct list_head *alloc_list)
+{
+	struct page *page;
+	unsigned long alloced = 0;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	struct zone *zone;
+	struct per_cpu_pages *pcp;
+	struct list_head *pcp_list;
+	int migratetype;
+	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
+	struct alloc_context ac = { };
+	bool cold = ((gfp_mask & __GFP_COLD) != 0);
+
+	/* If there are already pages on the list, don't bother */
+	if (!list_empty(alloc_list))
+		return 0;
+
+	/* Only handle bulk allocation of order-0 */
+	if (order || in_interrupt())
+		goto failed;
+
+	gfp_mask &= gfp_allowed_mask;
+	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return 0;
+
+	finalise_ac(gfp_mask, order, &ac);
+	if (!ac.preferred_zoneref)
+		return 0;
+
+	/*
+	 * Only attempt a batch allocation if watermarks on the preferred zone
+	 * are safe.
+	 */
+	zone = ac.preferred_zoneref->zone;
+	if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] + nr_pages,
+				 zonelist_zone_idx(ac.preferred_zoneref), alloc_flags))
+		goto failed;
+
+	/* Attempt the batch allocation */
+	migratetype = ac.migratetype;
+
+	preempt_disable();
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp_list = &pcp->lists[migratetype];
+
+	while (nr_pages) {
+		page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
+								cold, pcp, pcp_list);
+		if (!page)
+			break;
+
+		prep_new_page(page, order, gfp_mask, 0);
+		nr_pages--;
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	if (!alloced) {
+		preempt_enable_no_resched();
+		goto failed;
+	}
+
+	__count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+	zone_statistics(zone, zone, gfp_mask);
+
+	preempt_enable_no_resched();
+
+	return alloced;
+
+failed:
+	page = __alloc_pages_nodemask(gfp_mask, order, zonelist, nodemask);
+	if (page) {
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	return alloced;
+}
+EXPORT_SYMBOL(__alloc_pages_bulk_nodemask);
+
+/*
  * Common helper functions.
  */
 unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
@ 2017-01-09 16:35   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
number of pages to be allocated and added to a list. They can be freed in
bulk using free_pages_bulk(). Note that it would theoretically be possible
to use free_hot_cold_page_list for faster frees if the symbol was exported,
the refcounts were 0 and the caller guaranteed it was not in an interrupt.
This would be significantly faster in the free path but also more unsafer
and a harder API to use.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.

The following compares the allocation cost per page for different batch
sizes. The baseline is allocating them one at a time and it compares with
the performance when using the new allocation interface.

pagealloc
                                          4.10.0-rc2                 4.10.0-rc2
                                       one-at-a-time                    bulk-v2
Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)

It shows a roughly 50-60% reduction in the cost of allocating pages.
The free paths are not improved as much but relatively little can be batched
there. It's not quite as fast as it could be but taking further shortcuts
would require making a lot of assumptions about the state of the page and
the context of the caller.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  24 +++++++++++
 mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..b2fe171ee1c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask,
+			unsigned long nr_pages, struct list_head *alloc_list);
+
+static inline unsigned long
+__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, unsigned long nr_pages,
+		struct list_head *list)
+{
+	return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
+						nr_pages, list);
+}
+
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+		unsigned long nr_pages, struct list_head *list)
+{
+	int nid = numa_mem_id();
+	return __alloc_pages_bulk(gfp_mask, order,
+			node_zonelist(nid, gfp_mask), nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -504,6 +527,7 @@ extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, bool cold);
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_drain(struct page *page, unsigned int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 232cadbe9231..4f142270fbf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 }
 
 /*
- * Free a list of 0-order pages
+ * Free a list of 0-order pages whose reference count is already zero.
  */
 void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
@@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
 		trace_mm_page_free_batched(page, cold);
 		free_hot_cold_page(page, cold);
 	}
+
+	INIT_LIST_HEAD(list);
+}
+
+/* Drop reference counts and free pages from a list */
+void free_pages_bulk(struct list_head *list)
+{
+	struct page *page, *next;
+	bool free_percpu = !in_interrupt();
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		trace_mm_page_free_batched(page, 0);
+		if (put_page_testzero(page)) {
+			list_del(&page->lru);
+			if (free_percpu)
+				free_hot_cold_page(page, false);
+			else
+				__free_pages_ok(page, 0);
+		}
+	}
 }
+EXPORT_SYMBOL_GPL(free_pages_bulk);
 
 /*
  * split_page takes a non-compound higher-order page, and splits it into
@@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ * Note that there is no guarantee that nr_pages will be allocated although
+ * every effort will be made to allocate at least one. Unlike the core
+ * allocator, no special effort is made to recover from transient
+ * failures caused by changes in cpusets. It should only be used from !IRQ
+ * context. An attempt to allocate a batch of patches from an interrupt
+ * will allocate a single page.
+ */
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask,
+			unsigned long nr_pages, struct list_head *alloc_list)
+{
+	struct page *page;
+	unsigned long alloced = 0;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	struct zone *zone;
+	struct per_cpu_pages *pcp;
+	struct list_head *pcp_list;
+	int migratetype;
+	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
+	struct alloc_context ac = { };
+	bool cold = ((gfp_mask & __GFP_COLD) != 0);
+
+	/* If there are already pages on the list, don't bother */
+	if (!list_empty(alloc_list))
+		return 0;
+
+	/* Only handle bulk allocation of order-0 */
+	if (order || in_interrupt())
+		goto failed;
+
+	gfp_mask &= gfp_allowed_mask;
+	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return 0;
+
+	finalise_ac(gfp_mask, order, &ac);
+	if (!ac.preferred_zoneref)
+		return 0;
+
+	/*
+	 * Only attempt a batch allocation if watermarks on the preferred zone
+	 * are safe.
+	 */
+	zone = ac.preferred_zoneref->zone;
+	if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] + nr_pages,
+				 zonelist_zone_idx(ac.preferred_zoneref), alloc_flags))
+		goto failed;
+
+	/* Attempt the batch allocation */
+	migratetype = ac.migratetype;
+
+	preempt_disable();
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp_list = &pcp->lists[migratetype];
+
+	while (nr_pages) {
+		page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
+								cold, pcp, pcp_list);
+		if (!page)
+			break;
+
+		prep_new_page(page, order, gfp_mask, 0);
+		nr_pages--;
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	if (!alloced) {
+		preempt_enable_no_resched();
+		goto failed;
+	}
+
+	__count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+	zone_statistics(zone, zone, gfp_mask);
+
+	preempt_enable_no_resched();
+
+	return alloced;
+
+failed:
+	page = __alloc_pages_nodemask(gfp_mask, order, zonelist, nodemask);
+	if (page) {
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	return alloced;
+}
+EXPORT_SYMBOL(__alloc_pages_bulk_nodemask);
+
+/*
  * Common helper functions.
  */
 unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-10  4:00     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-10  4:00 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.
> 
> pagealloc
>                                           4.10.0-rc2                 4.10.0-rc2
>                                        one-at-a-time                    bulk-v2
> Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
> Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
> Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
> Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
> Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
> Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
> Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
> Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
> Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
> Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
> Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
> Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
> Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
> Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
> Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
> Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
> Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
> Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
> Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
> Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
> Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
> Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
> Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
> Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
> Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
> Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
> Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
> Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
> Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
> Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
> Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
> Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
> Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
> Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
> Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
> Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
> Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
> Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
> Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
> Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
> Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
> Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
> Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
> Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
> Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)
> 
> It shows a roughly 50-60% reduction in the cost of allocating pages.
> The free paths are not improved as much but relatively little can be batched
> there. It's not quite as fast as it could be but taking further shortcuts
> would require making a lot of assumptions about the state of the page and
> the context of the caller.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/gfp.h |  24 +++++++++++
>  mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 139 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..b2fe171ee1c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>  }
> 
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask,
> +			unsigned long nr_pages, struct list_head *alloc_list);
> +
> +static inline unsigned long
> +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> +		struct zonelist *zonelist, unsigned long nr_pages,
> +		struct list_head *list)
> +{
> +	return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
> +						nr_pages, list);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> +		unsigned long nr_pages, struct list_head *list)
> +{
> +	int nid = numa_mem_id();
> +	return __alloc_pages_bulk(gfp_mask, order,
> +			node_zonelist(nid, gfp_mask), nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid and
>   * online. For more general interface, see alloc_pages_node().
> @@ -504,6 +527,7 @@ extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
>  extern void free_hot_cold_page(struct page *page, bool cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
> +extern void free_pages_bulk(struct list_head *list);
> 
>  struct page_frag_cache;
>  extern void __page_frag_drain(struct page *page, unsigned int order,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 232cadbe9231..4f142270fbf0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  }
> 
>  /*
> - * Free a list of 0-order pages
> + * Free a list of 0-order pages whose reference count is already zero.
>   */
>  void free_hot_cold_page_list(struct list_head *list, bool cold)
>  {
> @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
>  		trace_mm_page_free_batched(page, cold);
>  		free_hot_cold_page(page, cold);
>  	}
> +
> +	INIT_LIST_HEAD(list);

Nit: can we cut this overhead off?
> +}
> +
> +/* Drop reference counts and free pages from a list */
> +void free_pages_bulk(struct list_head *list)
> +{
> +	struct page *page, *next;
> +	bool free_percpu = !in_interrupt();
> +
> +	list_for_each_entry_safe(page, next, list, lru) {
> +		trace_mm_page_free_batched(page, 0);
> +		if (put_page_testzero(page)) {
> +			list_del(&page->lru);
> +			if (free_percpu)
> +				free_hot_cold_page(page, false);
> +			else
> +				__free_pages_ok(page, 0);
> +		}
> +	}
>  }
> +EXPORT_SYMBOL_GPL(free_pages_bulk);
> 
>  /*
>   * split_page takes a non-compound higher-order page, and splits it into
> @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> 
>  /*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + * Note that there is no guarantee that nr_pages will be allocated although
> + * every effort will be made to allocate at least one. Unlike the core
> + * allocator, no special effort is made to recover from transient
> + * failures caused by changes in cpusets. It should only be used from !IRQ
> + * context. An attempt to allocate a batch of patches from an interrupt
> + * will allocate a single page.
> + */
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask,
> +			unsigned long nr_pages, struct list_head *alloc_list)
> +{
> +	struct page *page;
> +	unsigned long alloced = 0;
> +	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> +	struct zone *zone;
> +	struct per_cpu_pages *pcp;
> +	struct list_head *pcp_list;
> +	int migratetype;
> +	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
> +	struct alloc_context ac = { };
> +	bool cold = ((gfp_mask & __GFP_COLD) != 0);
> +
> +	/* If there are already pages on the list, don't bother */
> +	if (!list_empty(alloc_list))
> +		return 0;

Nit: can we move the check to the call site?
> +
> +	/* Only handle bulk allocation of order-0 */
> +	if (order || in_interrupt())
> +		goto failed;

Ditto

Hillf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
@ 2017-01-10  4:00     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-10  4:00 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.
> 
> pagealloc
>                                           4.10.0-rc2                 4.10.0-rc2
>                                        one-at-a-time                    bulk-v2
> Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
> Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
> Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
> Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
> Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
> Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
> Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
> Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
> Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
> Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
> Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
> Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
> Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
> Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
> Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
> Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
> Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
> Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
> Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
> Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
> Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
> Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
> Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
> Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
> Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
> Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
> Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
> Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
> Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
> Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
> Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
> Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
> Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
> Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
> Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
> Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
> Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
> Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
> Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
> Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
> Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
> Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
> Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
> Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
> Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)
> 
> It shows a roughly 50-60% reduction in the cost of allocating pages.
> The free paths are not improved as much but relatively little can be batched
> there. It's not quite as fast as it could be but taking further shortcuts
> would require making a lot of assumptions about the state of the page and
> the context of the caller.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/gfp.h |  24 +++++++++++
>  mm/page_alloc.c     | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 139 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..b2fe171ee1c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>  }
> 
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask,
> +			unsigned long nr_pages, struct list_head *alloc_list);
> +
> +static inline unsigned long
> +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> +		struct zonelist *zonelist, unsigned long nr_pages,
> +		struct list_head *list)
> +{
> +	return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
> +						nr_pages, list);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> +		unsigned long nr_pages, struct list_head *list)
> +{
> +	int nid = numa_mem_id();
> +	return __alloc_pages_bulk(gfp_mask, order,
> +			node_zonelist(nid, gfp_mask), nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid and
>   * online. For more general interface, see alloc_pages_node().
> @@ -504,6 +527,7 @@ extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
>  extern void free_hot_cold_page(struct page *page, bool cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
> +extern void free_pages_bulk(struct list_head *list);
> 
>  struct page_frag_cache;
>  extern void __page_frag_drain(struct page *page, unsigned int order,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 232cadbe9231..4f142270fbf0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  }
> 
>  /*
> - * Free a list of 0-order pages
> + * Free a list of 0-order pages whose reference count is already zero.
>   */
>  void free_hot_cold_page_list(struct list_head *list, bool cold)
>  {
> @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
>  		trace_mm_page_free_batched(page, cold);
>  		free_hot_cold_page(page, cold);
>  	}
> +
> +	INIT_LIST_HEAD(list);

Nit: can we cut this overhead off?
> +}
> +
> +/* Drop reference counts and free pages from a list */
> +void free_pages_bulk(struct list_head *list)
> +{
> +	struct page *page, *next;
> +	bool free_percpu = !in_interrupt();
> +
> +	list_for_each_entry_safe(page, next, list, lru) {
> +		trace_mm_page_free_batched(page, 0);
> +		if (put_page_testzero(page)) {
> +			list_del(&page->lru);
> +			if (free_percpu)
> +				free_hot_cold_page(page, false);
> +			else
> +				__free_pages_ok(page, 0);
> +		}
> +	}
>  }
> +EXPORT_SYMBOL_GPL(free_pages_bulk);
> 
>  /*
>   * split_page takes a non-compound higher-order page, and splits it into
> @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> 
>  /*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + * Note that there is no guarantee that nr_pages will be allocated although
> + * every effort will be made to allocate at least one. Unlike the core
> + * allocator, no special effort is made to recover from transient
> + * failures caused by changes in cpusets. It should only be used from !IRQ
> + * context. An attempt to allocate a batch of patches from an interrupt
> + * will allocate a single page.
> + */
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask,
> +			unsigned long nr_pages, struct list_head *alloc_list)
> +{
> +	struct page *page;
> +	unsigned long alloced = 0;
> +	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> +	struct zone *zone;
> +	struct per_cpu_pages *pcp;
> +	struct list_head *pcp_list;
> +	int migratetype;
> +	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
> +	struct alloc_context ac = { };
> +	bool cold = ((gfp_mask & __GFP_COLD) != 0);
> +
> +	/* If there are already pages on the list, don't bother */
> +	if (!list_empty(alloc_list))
> +		return 0;

Nit: can we move the check to the call site?
> +
> +	/* Only handle bulk allocation of order-0 */
> +	if (order || in_interrupt())
> +		goto failed;

Ditto

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
  2017-01-10  4:00     ` Hillf Danton
@ 2017-01-10  8:34       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-10  8:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Tue, Jan 10, 2017 at 12:00:27PM +0800, Hillf Danton wrote:
> > It shows a roughly 50-60% reduction in the cost of allocating pages.
> > The free paths are not improved as much but relatively little can be batched
> > there. It's not quite as fast as it could be but taking further shortcuts
> > would require making a lot of assumptions about the state of the page and
> > the context of the caller.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> 

Thanks.

> > @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  }
> > 
> >  /*
> > - * Free a list of 0-order pages
> > + * Free a list of 0-order pages whose reference count is already zero.
> >   */
> >  void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  {
> > @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  		trace_mm_page_free_batched(page, cold);
> >  		free_hot_cold_page(page, cold);
> >  	}
> > +
> > +	INIT_LIST_HEAD(list);
> 
> Nit: can we cut this overhead off?

Yes, but note that any caller of free_hot_cold_page_list() is then
required to reinit the list themselves or it'll cause corruption. It's
unlikely that a user of the bulk interface will handle the refcounts and
be able to use this interface properly but if they do, they need to
either reinit this or add the hunk back in.

It happens that all callers currently don't care.

> >  /*
> >   * split_page takes a non-compound higher-order page, and splits it into
> > @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> > 
> >  /*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + * Note that there is no guarantee that nr_pages will be allocated although
> > + * every effort will be made to allocate at least one. Unlike the core
> > + * allocator, no special effort is made to recover from transient
> > + * failures caused by changes in cpusets. It should only be used from !IRQ
> > + * context. An attempt to allocate a batch of patches from an interrupt
> > + * will allocate a single page.
> > + */
> > +unsigned long
> > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > +			struct zonelist *zonelist, nodemask_t *nodemask,
> > +			unsigned long nr_pages, struct list_head *alloc_list)
> > +{
> > +	struct page *page;
> > +	unsigned long alloced = 0;
> > +	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> > +	struct zone *zone;
> > +	struct per_cpu_pages *pcp;
> > +	struct list_head *pcp_list;
> > +	int migratetype;
> > +	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
> > +	struct alloc_context ac = { };
> > +	bool cold = ((gfp_mask & __GFP_COLD) != 0);
> > +
> > +	/* If there are already pages on the list, don't bother */
> > +	if (!list_empty(alloc_list))
> > +		return 0;
> 
> Nit: can we move the check to the call site?

Yes, but it makes the API slightly more hazardous to use.

> > +
> > +	/* Only handle bulk allocation of order-0 */
> > +	if (order || in_interrupt())
> > +		goto failed;
> 
> Ditto
> 

Same, if the caller is in interrupt context, there is a slight risk that
they'll corrupt the list in a manner that will be tricky to catch. The
checks are to minimise the risk of being surprising.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
@ 2017-01-10  8:34       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-10  8:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Tue, Jan 10, 2017 at 12:00:27PM +0800, Hillf Danton wrote:
> > It shows a roughly 50-60% reduction in the cost of allocating pages.
> > The free paths are not improved as much but relatively little can be batched
> > there. It's not quite as fast as it could be but taking further shortcuts
> > would require making a lot of assumptions about the state of the page and
> > the context of the caller.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> 

Thanks.

> > @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  }
> > 
> >  /*
> > - * Free a list of 0-order pages
> > + * Free a list of 0-order pages whose reference count is already zero.
> >   */
> >  void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  {
> > @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  		trace_mm_page_free_batched(page, cold);
> >  		free_hot_cold_page(page, cold);
> >  	}
> > +
> > +	INIT_LIST_HEAD(list);
> 
> Nit: can we cut this overhead off?

Yes, but note that any caller of free_hot_cold_page_list() is then
required to reinit the list themselves or it'll cause corruption. It's
unlikely that a user of the bulk interface will handle the refcounts and
be able to use this interface properly but if they do, they need to
either reinit this or add the hunk back in.

It happens that all callers currently don't care.

> >  /*
> >   * split_page takes a non-compound higher-order page, and splits it into
> > @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> > 
> >  /*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + * Note that there is no guarantee that nr_pages will be allocated although
> > + * every effort will be made to allocate at least one. Unlike the core
> > + * allocator, no special effort is made to recover from transient
> > + * failures caused by changes in cpusets. It should only be used from !IRQ
> > + * context. An attempt to allocate a batch of patches from an interrupt
> > + * will allocate a single page.
> > + */
> > +unsigned long
> > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > +			struct zonelist *zonelist, nodemask_t *nodemask,
> > +			unsigned long nr_pages, struct list_head *alloc_list)
> > +{
> > +	struct page *page;
> > +	unsigned long alloced = 0;
> > +	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> > +	struct zone *zone;
> > +	struct per_cpu_pages *pcp;
> > +	struct list_head *pcp_list;
> > +	int migratetype;
> > +	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
> > +	struct alloc_context ac = { };
> > +	bool cold = ((gfp_mask & __GFP_COLD) != 0);
> > +
> > +	/* If there are already pages on the list, don't bother */
> > +	if (!list_empty(alloc_list))
> > +		return 0;
> 
> Nit: can we move the check to the call site?

Yes, but it makes the API slightly more hazardous to use.

> > +
> > +	/* Only handle bulk allocation of order-0 */
> > +	if (order || in_interrupt())
> > +		goto failed;
> 
> Ditto
> 

Same, if the caller is in interrupt context, there is a slight risk that
they'll corrupt the list in a manner that will be tricky to catch. The
checks are to minimise the risk of being surprising.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-11 12:31     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:15 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> buffered_rmqueue removes a page from a given zone and uses the per-cpu
> list for order-0. This is fine but a hypothetical caller that wanted
> multiple order-0 pages has to disable/reenable interrupts multiple
> times. This patch structures buffere_rmqueue such that it's relatively
> easy to build a bulk order-0 page allocator. There is no functional
> change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
@ 2017-01-11 12:31     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:15 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> buffered_rmqueue removes a page from a given zone and uses the per-cpu
> list for order-0. This is fine but a hypothetical caller that wanted
> multiple order-0 pages has to disable/reenable interrupts multiple
> times. This patch structures buffere_rmqueue such that it's relatively
> easy to build a bulk order-0 page allocator. There is no functional
> change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-11 12:32     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:16 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> alloc_pages_nodemask does a number of preperation steps that determine
> what zones can be used for the allocation depending on a variety of
> factors. This is fine but a hypothetical caller that wanted multiple
> order-0 pages has to do the preparation steps multiple times. This patch
> structures __alloc_pages_nodemask such that it's relatively easy to build
> a bulk order-0 page allocator. There is no functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
@ 2017-01-11 12:32     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:16 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> alloc_pages_nodemask does a number of preperation steps that determine
> what zones can be used for the allocation depending on a variety of
> factors. This is fine but a hypothetical caller that wanted multiple
> order-0 pages has to do the preparation steps multiple times. This patch
> structures __alloc_pages_nodemask such that it's relatively easy to build
> a bulk order-0 page allocator. There is no functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-11 12:44     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer


On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
 
> The following is results from a page allocator micro-benchmark. Only
> order-0 is interesting as higher orders do not use the per-cpu allocator

Micro-benchmarked with [1] page_bench02:
 modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
  rmmod page_bench02 ; dmesg --notime | tail -n 4

Compared to baseline: 213 cycles(tsc) 53.417 ns
 - against this     : 184 cycles(tsc) 46.056 ns
 - Saving           : -29 cycles
 - Very close to expected 27 cycles saving [see below [2]]


> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
-
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[2] Expected saving comes from Mel removing a local_irq_{save,restore}
and adding a preempt_{disable,enable} instead.

Micro benchmarking via time_bench_sample[3], we get the cost of these
operations:

 time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
 time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
 time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
 time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
 time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
 time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
 time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
 time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
 time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
 [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
 time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
 [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
 time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
 time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
 time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)

Thus, expected improvement is: 38-11 = 27 cycles.

[3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

Config options of interest:
 CONFIG_NUMA=y
 CONFIG_DEBUG_LIST=n
 CONFIG_VM_EVENT_COUNTERS=y

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-11 12:44     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 12:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer


On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
 
> The following is results from a page allocator micro-benchmark. Only
> order-0 is interesting as higher orders do not use the per-cpu allocator

Micro-benchmarked with [1] page_bench02:
 modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
  rmmod page_bench02 ; dmesg --notime | tail -n 4

Compared to baseline: 213 cycles(tsc) 53.417 ns
 - against this     : 184 cycles(tsc) 46.056 ns
 - Saving           : -29 cycles
 - Very close to expected 27 cycles saving [see below [2]]


> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
-
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[2] Expected saving comes from Mel removing a local_irq_{save,restore}
and adding a preempt_{disable,enable} instead.

Micro benchmarking via time_bench_sample[3], we get the cost of these
operations:

 time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
 time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
 time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
 time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
 time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
 time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
 time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
 time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
 time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
 [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
 time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
 [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
 time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
 time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
 time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)

Thus, expected improvement is: 38-11 = 27 cycles.

[3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

Config options of interest:
 CONFIG_NUMA=y
 CONFIG_DEBUG_LIST=n
 CONFIG_VM_EVENT_COUNTERS=y

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-11 12:44     ` Jesper Dangaard Brouer
@ 2017-01-11 13:27       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 13:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>  
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator  
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

 alloc_pages
  -> alloc_pages_current
      -> __alloc_pages_nodemask
          -> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

 __alloc_pages_nodemask
  -> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
 - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
 - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
 - Saving:       :  41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>  
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
> 
> Config options of interest:
>  CONFIG_NUMA=y
>  CONFIG_DEBUG_LIST=n
>  CONFIG_VM_EVENT_COUNTERS=y



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-11 13:27       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-11 13:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Wed, 11 Jan 2017 13:44:20 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>  
> > The following is results from a page allocator micro-benchmark. Only
> > order-0 is interesting as higher orders do not use the per-cpu allocator  
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]

When perf benchmarking I noticed that the "summed" children perf
overhead from calling alloc_pages_current() is 65.05%. Compared to
"free-path" of summed 28.28% of calls "under" __free_pages().

This is caused by CONFIG_NUMA=y, as call path is long with NUMA
(and other helpers are also non-inlined calls):

 alloc_pages
  -> alloc_pages_current
      -> __alloc_pages_nodemask
          -> get_page_from_freelist

Without NUMA the call levels gets compacted by inlining to:

 __alloc_pages_nodemask
  -> get_page_from_freelist

After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
side is more balanced.

Saving by disabling CONFIG_NUMA of:
 - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
 - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
 - Saving:       :  41 cycles (approx 22%)

I would conclude, there is room for improvements with CONFIG_NUMA code
path case. Lets followup on that in a later patch series...


> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>  
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> -
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [2] Expected saving comes from Mel removing a local_irq_{save,restore}
> and adding a preempt_{disable,enable} instead.
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
> 
> Config options of interest:
>  CONFIG_NUMA=y
>  CONFIG_DEBUG_LIST=n
>  CONFIG_VM_EVENT_COUNTERS=y



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-12  3:09     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-12  3:09 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> buffered_rmqueue removes a page from a given zone and uses the per-cpu
> list for order-0. This is fine but a hypothetical caller that wanted
> multiple order-0 pages has to disable/reenable interrupts multiple
> times. This patch structures buffere_rmqueue such that it's relatively
> easy to build a bulk order-0 page allocator. There is no functional
> change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue
@ 2017-01-12  3:09     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-12  3:09 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> buffered_rmqueue removes a page from a given zone and uses the per-cpu
> list for order-0. This is fine but a hypothetical caller that wanted
> multiple order-0 pages has to disable/reenable interrupts multiple
> times. This patch structures buffere_rmqueue such that it's relatively
> easy to build a bulk order-0 page allocator. There is no functional
> change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-12  3:11     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-12  3:11 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> alloc_pages_nodemask does a number of preperation steps that determine
> what zones can be used for the allocation depending on a variety of
> factors. This is fine but a hypothetical caller that wanted multiple
> order-0 pages has to do the preparation steps multiple times. This patch
> structures __alloc_pages_nodemask such that it's relatively easy to build
> a bulk order-0 page allocator. There is no functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask
@ 2017-01-12  3:11     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-12  3:11 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'

On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> alloc_pages_nodemask does a number of preperation steps that determine
> what zones can be used for the allocation depending on a variety of
> factors. This is fine but a hypothetical caller that wanted multiple
> order-0 pages has to do the preparation steps multiple times. This patch
> structures __alloc_pages_nodemask such that it's relatively easy to build
> a bulk order-0 page allocator. There is no functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-11 13:27       ` Jesper Dangaard Brouer
@ 2017-01-12 10:47         ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-12 10:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On Wed, Jan 11, 2017 at 02:27:12PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 11 Jan 2017 13:44:20 +0100
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> >  
> > > The following is results from a page allocator micro-benchmark. Only
> > > order-0 is interesting as higher orders do not use the per-cpu allocator  
> > 
> > Micro-benchmarked with [1] page_bench02:
> >  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
> >   rmmod page_bench02 ; dmesg --notime | tail -n 4
> > 
> > Compared to baseline: 213 cycles(tsc) 53.417 ns
> >  - against this     : 184 cycles(tsc) 46.056 ns
> >  - Saving           : -29 cycles
> >  - Very close to expected 27 cycles saving [see below [2]]
> 
> When perf benchmarking I noticed that the "summed" children perf
> overhead from calling alloc_pages_current() is 65.05%. Compared to
> "free-path" of summed 28.28% of calls "under" __free_pages().
> 
> This is caused by CONFIG_NUMA=y, as call path is long with NUMA
> (and other helpers are also non-inlined calls):
> 
>  alloc_pages
>   -> alloc_pages_current
>       -> __alloc_pages_nodemask
>           -> get_page_from_freelist
> 
> Without NUMA the call levels gets compacted by inlining to:
> 
>  __alloc_pages_nodemask
>   -> get_page_from_freelist
> 
> After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
> side is more balanced.
> 
> Saving by disabling CONFIG_NUMA of:
>  - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
>  - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
>  - Saving:       :  41 cycles (approx 22%)
> 
> I would conclude, there is room for improvements with CONFIG_NUMA code
> path case. Lets followup on that in a later patch series...
> 

Potentially. The NUMA paths do memory policy work and has more
complexity in the statistics path. It may be possible to side-step some
of it. There were not many safe options when I last looked but that was
a long time ago. Most of the focus has been on the core allocator
itself and not the wrappers around it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-12 10:47         ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-12 10:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On Wed, Jan 11, 2017 at 02:27:12PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 11 Jan 2017 13:44:20 +0100
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > On Mon,  9 Jan 2017 16:35:17 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> >  
> > > The following is results from a page allocator micro-benchmark. Only
> > > order-0 is interesting as higher orders do not use the per-cpu allocator  
> > 
> > Micro-benchmarked with [1] page_bench02:
> >  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
> >   rmmod page_bench02 ; dmesg --notime | tail -n 4
> > 
> > Compared to baseline: 213 cycles(tsc) 53.417 ns
> >  - against this     : 184 cycles(tsc) 46.056 ns
> >  - Saving           : -29 cycles
> >  - Very close to expected 27 cycles saving [see below [2]]
> 
> When perf benchmarking I noticed that the "summed" children perf
> overhead from calling alloc_pages_current() is 65.05%. Compared to
> "free-path" of summed 28.28% of calls "under" __free_pages().
> 
> This is caused by CONFIG_NUMA=y, as call path is long with NUMA
> (and other helpers are also non-inlined calls):
> 
>  alloc_pages
>   -> alloc_pages_current
>       -> __alloc_pages_nodemask
>           -> get_page_from_freelist
> 
> Without NUMA the call levels gets compacted by inlining to:
> 
>  __alloc_pages_nodemask
>   -> get_page_from_freelist
> 
> After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%)
> side is more balanced.
> 
> Saving by disabling CONFIG_NUMA of:
>  - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns
>  - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns
>  - Saving:       :  41 cycles (approx 22%)
> 
> I would conclude, there is room for improvements with CONFIG_NUMA code
> path case. Lets followup on that in a later patch series...
> 

Potentially. The NUMA paths do memory policy work and has more
complexity in the statistics path. It may be possible to side-step some
of it. There were not many safe options when I last looked but that was
a long time ago. Most of the focus has been on the core allocator
itself and not the wrappers around it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
  2017-01-09 16:35   ` Mel Gorman
@ 2017-01-16 14:25     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-16 14:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:18 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.

I've also played with testing the bulking API here:
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c

My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns

Using bulking API:
 Bulk:   1 cycles: 128 nanosec: 32.134
 Bulk:   2 cycles: 107 nanosec: 26.783
 Bulk:   3 cycles: 100 nanosec: 25.047
 Bulk:   4 cycles:  95 nanosec: 23.988
 Bulk:   8 cycles:  91 nanosec: 22.823
 Bulk:  16 cycles:  88 nanosec: 22.093
 Bulk:  32 cycles:  85 nanosec: 21.338
 Bulk:  64 cycles:  85 nanosec: 21.315
 Bulk: 128 cycles:  84 nanosec: 21.214
 Bulk: 256 cycles: 115 nanosec: 28.979

This bulk API (and other improvements part of patchset) definitely
moves the speed of the page allocator closer to my (crazy) time budget
target of between 201 to 269 cycles per packet[1].  Remember I was
reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
2016, so this is a huge improvement since then.

The bulk numbers are great, but it still cannot compete with the
recycles tricks used by drivers.  Looking at the code (and as Mel also
mentions) there is room for improvements especially on the bulk free-side.


[1] http://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
[2] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

> pagealloc
>                                           4.10.0-rc2                 4.10.0-rc2
>                                        one-at-a-time                    bulk-v2
> Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
> Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
> Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
> Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
> Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
> Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
> Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
> Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
> Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
> Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
> Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
> Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
> Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
> Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
> Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
> Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
> Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
> Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
> Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
> Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
> Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
> Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
> Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
> Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
> Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
> Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
> Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
> Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
> Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
> Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
> Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
> Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
> Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
> Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
> Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
> Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
> Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
> Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
> Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
> Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
> Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
> Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
> Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
> Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
> Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)
> 
> It shows a roughly 50-60% reduction in the cost of allocating pages.
> The free paths are not improved as much but relatively little can be batched
> there. It's not quite as fast as it could be but taking further shortcuts
> would require making a lot of assumptions about the state of the page and
> the context of the caller.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
@ 2017-01-16 14:25     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-16 14:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, Hillf Danton, brouer

On Mon,  9 Jan 2017 16:35:18 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.

I've also played with testing the bulking API here:
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c

My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns

Using bulking API:
 Bulk:   1 cycles: 128 nanosec: 32.134
 Bulk:   2 cycles: 107 nanosec: 26.783
 Bulk:   3 cycles: 100 nanosec: 25.047
 Bulk:   4 cycles:  95 nanosec: 23.988
 Bulk:   8 cycles:  91 nanosec: 22.823
 Bulk:  16 cycles:  88 nanosec: 22.093
 Bulk:  32 cycles:  85 nanosec: 21.338
 Bulk:  64 cycles:  85 nanosec: 21.315
 Bulk: 128 cycles:  84 nanosec: 21.214
 Bulk: 256 cycles: 115 nanosec: 28.979

This bulk API (and other improvements part of patchset) definitely
moves the speed of the page allocator closer to my (crazy) time budget
target of between 201 to 269 cycles per packet[1].  Remember I was
reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
2016, so this is a huge improvement since then.

The bulk numbers are great, but it still cannot compete with the
recycles tricks used by drivers.  Looking at the code (and as Mel also
mentions) there is room for improvements especially on the bulk free-side.


[1] http://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
[2] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

> pagealloc
>                                           4.10.0-rc2                 4.10.0-rc2
>                                        one-at-a-time                    bulk-v2
> Amean    alloc-odr0-1               259.54 (  0.00%)           106.62 ( 58.92%)
> Amean    alloc-odr0-2               193.38 (  0.00%)            76.38 ( 60.50%)
> Amean    alloc-odr0-4               162.38 (  0.00%)            57.23 ( 64.76%)
> Amean    alloc-odr0-8               144.31 (  0.00%)            48.77 ( 66.20%)
> Amean    alloc-odr0-16              134.08 (  0.00%)            45.38 ( 66.15%)
> Amean    alloc-odr0-32              128.62 (  0.00%)            42.77 ( 66.75%)
> Amean    alloc-odr0-64              126.00 (  0.00%)            41.00 ( 67.46%)
> Amean    alloc-odr0-128             125.00 (  0.00%)            40.08 ( 67.94%)
> Amean    alloc-odr0-256             136.62 (  0.00%)            56.00 ( 59.01%)
> Amean    alloc-odr0-512             152.00 (  0.00%)            69.00 ( 54.61%)
> Amean    alloc-odr0-1024            158.00 (  0.00%)            76.23 ( 51.75%)
> Amean    alloc-odr0-2048            163.00 (  0.00%)            81.15 ( 50.21%)
> Amean    alloc-odr0-4096            169.77 (  0.00%)            85.92 ( 49.39%)
> Amean    alloc-odr0-8192            170.00 (  0.00%)            88.00 ( 48.24%)
> Amean    alloc-odr0-16384           170.00 (  0.00%)            89.00 ( 47.65%)
> Amean    free-odr0-1                 88.69 (  0.00%)            55.69 ( 37.21%)
> Amean    free-odr0-2                 66.00 (  0.00%)            49.38 ( 25.17%)
> Amean    free-odr0-4                 54.23 (  0.00%)            45.38 ( 16.31%)
> Amean    free-odr0-8                 48.23 (  0.00%)            44.23 (  8.29%)
> Amean    free-odr0-16                47.00 (  0.00%)            45.00 (  4.26%)
> Amean    free-odr0-32                44.77 (  0.00%)            43.92 (  1.89%)
> Amean    free-odr0-64                44.00 (  0.00%)            43.00 (  2.27%)
> Amean    free-odr0-128               43.00 (  0.00%)            43.00 (  0.00%)
> Amean    free-odr0-256               60.69 (  0.00%)            60.46 (  0.38%)
> Amean    free-odr0-512               79.23 (  0.00%)            76.00 (  4.08%)
> Amean    free-odr0-1024              86.00 (  0.00%)            85.38 (  0.72%)
> Amean    free-odr0-2048              91.00 (  0.00%)            91.23 ( -0.25%)
> Amean    free-odr0-4096              94.85 (  0.00%)            95.62 ( -0.81%)
> Amean    free-odr0-8192              97.00 (  0.00%)            97.00 (  0.00%)
> Amean    free-odr0-16384             98.00 (  0.00%)            97.46 (  0.55%)
> Amean    total-odr0-1               348.23 (  0.00%)           162.31 ( 53.39%)
> Amean    total-odr0-2               259.38 (  0.00%)           125.77 ( 51.51%)
> Amean    total-odr0-4               216.62 (  0.00%)           102.62 ( 52.63%)
> Amean    total-odr0-8               192.54 (  0.00%)            93.00 ( 51.70%)
> Amean    total-odr0-16              181.08 (  0.00%)            90.38 ( 50.08%)
> Amean    total-odr0-32              173.38 (  0.00%)            86.69 ( 50.00%)
> Amean    total-odr0-64              170.00 (  0.00%)            84.00 ( 50.59%)
> Amean    total-odr0-128             168.00 (  0.00%)            83.08 ( 50.55%)
> Amean    total-odr0-256             197.31 (  0.00%)           116.46 ( 40.97%)
> Amean    total-odr0-512             231.23 (  0.00%)           145.00 ( 37.29%)
> Amean    total-odr0-1024            244.00 (  0.00%)           161.62 ( 33.76%)
> Amean    total-odr0-2048            254.00 (  0.00%)           172.38 ( 32.13%)
> Amean    total-odr0-4096            264.62 (  0.00%)           181.54 ( 31.40%)
> Amean    total-odr0-8192            267.00 (  0.00%)           185.00 ( 30.71%)
> Amean    total-odr0-16384           268.00 (  0.00%)           186.46 ( 30.42%)
> 
> It shows a roughly 50-60% reduction in the cost of allocating pages.
> The free paths are not improved as much but relatively little can be batched
> there. It's not quite as fast as it could be but taking further shortcuts
> would require making a lot of assumptions about the state of the page and
> the context of the caller.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
  2017-01-16 14:25     ` Jesper Dangaard Brouer
@ 2017-01-16 15:01       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-16 15:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On Mon, Jan 16, 2017 at 03:25:18PM +0100, Jesper Dangaard Brouer wrote:
> On Mon,  9 Jan 2017 16:35:18 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> > number of pages to be allocated and added to a list. They can be freed in
> > bulk using free_pages_bulk(). Note that it would theoretically be possible
> > to use free_hot_cold_page_list for faster frees if the symbol was exported,
> > the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> > This would be significantly faster in the free path but also more unsafer
> > and a harder API to use.
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch if
> > necessary.
> > 
> > The following compares the allocation cost per page for different batch
> > sizes. The baseline is allocating them one at a time and it compares with
> > the performance when using the new allocation interface.
> 
> I've also played with testing the bulking API here:
>  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
> 
> My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns
> 
> Using bulking API:
>  Bulk:   1 cycles: 128 nanosec: 32.134
>  Bulk:   2 cycles: 107 nanosec: 26.783
>  Bulk:   3 cycles: 100 nanosec: 25.047
>  Bulk:   4 cycles:  95 nanosec: 23.988
>  Bulk:   8 cycles:  91 nanosec: 22.823
>  Bulk:  16 cycles:  88 nanosec: 22.093
>  Bulk:  32 cycles:  85 nanosec: 21.338
>  Bulk:  64 cycles:  85 nanosec: 21.315
>  Bulk: 128 cycles:  84 nanosec: 21.214
>  Bulk: 256 cycles: 115 nanosec: 28.979
> 
> This bulk API (and other improvements part of patchset) definitely
> moves the speed of the page allocator closer to my (crazy) time budget
> target of between 201 to 269 cycles per packet[1].  Remember I was
> reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
> 2016, so this is a huge improvement since then.
> 

Good to hear.

> The bulk numbers are great, but it still cannot compete with the
> recycles tricks used by drivers.  Looking at the code (and as Mel also
> mentions) there is room for improvements especially on the bulk free-side.
> 

A major component there is how the ref handling is done and the safety
checks. If necessary, you could mandate that callers drop the reference
count or allow pages to be freed with an elevated count to avoid the atomic
ops. In an early prototype, I made the refcount "mistake" and freeing was
half the cost. I restored it in the final version to have an API that was
almost identical to the existing allocator other than the bulking aspects.

You could also disable all the other safety checks and flag that the bulk
alloc/free potentially frees pages in inconsistent state.  That would
increase the performance at the cost of safety but that may be acceptable
given that driver recycling of pages also avoids the same checks.

You could also consider disabling the statistics updates to avoid a bunch
of per-cpu stat operations, particularly if the pages were mostly recycled
by the generic pool allocator.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator
@ 2017-01-16 15:01       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-16 15:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On Mon, Jan 16, 2017 at 03:25:18PM +0100, Jesper Dangaard Brouer wrote:
> On Mon,  9 Jan 2017 16:35:18 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> > number of pages to be allocated and added to a list. They can be freed in
> > bulk using free_pages_bulk(). Note that it would theoretically be possible
> > to use free_hot_cold_page_list for faster frees if the symbol was exported,
> > the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> > This would be significantly faster in the free path but also more unsafer
> > and a harder API to use.
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch if
> > necessary.
> > 
> > The following compares the allocation cost per page for different batch
> > sizes. The baseline is allocating them one at a time and it compares with
> > the performance when using the new allocation interface.
> 
> I've also played with testing the bulking API here:
>  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
> 
> My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns
> 
> Using bulking API:
>  Bulk:   1 cycles: 128 nanosec: 32.134
>  Bulk:   2 cycles: 107 nanosec: 26.783
>  Bulk:   3 cycles: 100 nanosec: 25.047
>  Bulk:   4 cycles:  95 nanosec: 23.988
>  Bulk:   8 cycles:  91 nanosec: 22.823
>  Bulk:  16 cycles:  88 nanosec: 22.093
>  Bulk:  32 cycles:  85 nanosec: 21.338
>  Bulk:  64 cycles:  85 nanosec: 21.315
>  Bulk: 128 cycles:  84 nanosec: 21.214
>  Bulk: 256 cycles: 115 nanosec: 28.979
> 
> This bulk API (and other improvements part of patchset) definitely
> moves the speed of the page allocator closer to my (crazy) time budget
> target of between 201 to 269 cycles per packet[1].  Remember I was
> reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
> 2016, so this is a huge improvement since then.
> 

Good to hear.

> The bulk numbers are great, but it still cannot compete with the
> recycles tricks used by drivers.  Looking at the code (and as Mel also
> mentions) there is room for improvements especially on the bulk free-side.
> 

A major component there is how the ref handling is done and the safety
checks. If necessary, you could mandate that callers drop the reference
count or allow pages to be freed with an elevated count to avoid the atomic
ops. In an early prototype, I made the refcount "mistake" and freeing was
half the cost. I restored it in the final version to have an API that was
almost identical to the existing allocator other than the bulking aspects.

You could also disable all the other safety checks and flag that the bulk
alloc/free potentially frees pages in inconsistent state.  That would
increase the performance at the cost of safety but that may be acceptable
given that driver recycling of pages also avoids the same checks.

You could also consider disabling the statistics updates to avoid a bunch
of per-cpu stat operations, particularly if the pages were mostly recycled
by the generic pool allocator.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7
  2017-01-09 16:35 ` Mel Gorman
@ 2017-01-29  4:00   ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2017-01-29  4:00 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On 01/09/2017 08:35 AM, Mel Gorman wrote:
> The
> fourth patch introduces a bulk page allocator with no in-kernel users as
> an example for Jesper and others who want to build a page allocator for
> DMA-coherent pages.

If you want an in-kernel user as a test, to validate the API's sanity, 
and to improve performance, how about __vmalloc_area_node()?  :)

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7
@ 2017-01-29  4:00   ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2017-01-29  4:00 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton

On 01/09/2017 08:35 AM, Mel Gorman wrote:
> The
> fourth patch introduces a bulk page allocator with no in-kernel users as
> an example for Jesper and others who want to build a page allocator for
> DMA-coherent pages.

If you want an in-kernel user as a test, to validate the API's sanity, 
and to improve performance, how about __vmalloc_area_node()?  :)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-09  9:48           ` Mel Gorman
@ 2017-01-09  9:55             ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-09  9:55 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Monday, January 09, 2017 5:48 PM Mel Gorman wrote:
> On Mon, Jan 09, 2017 at 11:14:29AM +0800, Hillf Danton wrote:
> > > On Friday, January 06, 2017 6:16 PM Mel Gorman wrote:
> > >
> > > On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> > > >
> > > > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > > >  	struct list_head *list;
> > > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > > >  	struct page *page;
> > > > > -	unsigned long flags;
> > > > >
> > > > > -	local_irq_save(flags);
> > > > > +	preempt_disable();
> > > > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > >  	list = &pcp->lists[migratetype];
> > > > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > > > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > > > >  	}
> > > > > -	local_irq_restore(flags);
> > > > > +	preempt_enable();
> > > > >  	return page;
> > > > >  }
> > > > >
> > > > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > > > Is that needed when we try to allocate a page?
> > > >
> > >
> > > Not necessarily but what are you proposing as an alternative?
> >
> > preempt_enable_no_resched() looks at first glance a choice for us to
> > avoid flipping interrupts.
> >
> 
> Ok, I wasn't sure if you were proposing something more drastic. I can
> make it this although I have no reason to believe it will really matter.
> The path should be short enough that it's unlikely a scheduler event
> would ever occur at that point. Still, no harm in doing what you
> suggest.
> 
If spin, feel free to add 

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

to the patchset.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-09  9:55             ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-09  9:55 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Monday, January 09, 2017 5:48 PM Mel Gorman wrote:
> On Mon, Jan 09, 2017 at 11:14:29AM +0800, Hillf Danton wrote:
> > > On Friday, January 06, 2017 6:16 PM Mel Gorman wrote:
> > >
> > > On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> > > >
> > > > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > > >  	struct list_head *list;
> > > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > > >  	struct page *page;
> > > > > -	unsigned long flags;
> > > > >
> > > > > -	local_irq_save(flags);
> > > > > +	preempt_disable();
> > > > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > >  	list = &pcp->lists[migratetype];
> > > > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > > > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > > > >  	}
> > > > > -	local_irq_restore(flags);
> > > > > +	preempt_enable();
> > > > >  	return page;
> > > > >  }
> > > > >
> > > > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > > > Is that needed when we try to allocate a page?
> > > >
> > >
> > > Not necessarily but what are you proposing as an alternative?
> >
> > preempt_enable_no_resched() looks at first glance a choice for us to
> > avoid flipping interrupts.
> >
> 
> Ok, I wasn't sure if you were proposing something more drastic. I can
> make it this although I have no reason to believe it will really matter.
> The path should be short enough that it's unlikely a scheduler event
> would ever occur at that point. Still, no harm in doing what you
> suggest.
> 
If spin, feel free to add 

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

to the patchset.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-09  3:14         ` Hillf Danton
@ 2017-01-09  9:48           ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09  9:48 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Mon, Jan 09, 2017 at 11:14:29AM +0800, Hillf Danton wrote:
> > Sent: Friday, January 06, 2017 6:16 PM Mel Gorman wrote: 
> > 
> > On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> > >
> > > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > >  	struct list_head *list;
> > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > >  	struct page *page;
> > > > -	unsigned long flags;
> > > >
> > > > -	local_irq_save(flags);
> > > > +	preempt_disable();
> > > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > >  	list = &pcp->lists[migratetype];
> > > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > > >  	}
> > > > -	local_irq_restore(flags);
> > > > +	preempt_enable();
> > > >  	return page;
> > > >  }
> > > >
> > > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > > Is that needed when we try to allocate a page?
> > >
> > 
> > Not necessarily but what are you proposing as an alternative? 
> 
> preempt_enable_no_resched() looks at first glance a choice for us to 
> avoid flipping interrupts.
> 

Ok, I wasn't sure if you were proposing something more drastic. I can
make it this although I have no reason to believe it will really matter.
The path should be short enough that it's unlikely a scheduler event
would ever occur at that point. Still, no harm in doing what you
suggest.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-09  9:48           ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-09  9:48 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Mon, Jan 09, 2017 at 11:14:29AM +0800, Hillf Danton wrote:
> > Sent: Friday, January 06, 2017 6:16 PM Mel Gorman wrote: 
> > 
> > On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> > >
> > > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > >  	struct list_head *list;
> > > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > > >  	struct page *page;
> > > > -	unsigned long flags;
> > > >
> > > > -	local_irq_save(flags);
> > > > +	preempt_disable();
> > > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > >  	list = &pcp->lists[migratetype];
> > > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > > >  	}
> > > > -	local_irq_restore(flags);
> > > > +	preempt_enable();
> > > >  	return page;
> > > >  }
> > > >
> > > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > > Is that needed when we try to allocate a page?
> > >
> > 
> > Not necessarily but what are you proposing as an alternative? 
> 
> preempt_enable_no_resched() looks at first glance a choice for us to 
> avoid flipping interrupts.
> 

Ok, I wasn't sure if you were proposing something more drastic. I can
make it this although I have no reason to believe it will really matter.
The path should be short enough that it's unlikely a scheduler event
would ever occur at that point. Still, no harm in doing what you
suggest.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-06 10:15       ` Mel Gorman
@ 2017-01-09  3:14         ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-09  3:14 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

> Sent: Friday, January 06, 2017 6:16 PM Mel Gorman wrote: 
> 
> On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> >
> > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > >  	struct list_head *list;
> > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > >  	struct page *page;
> > > -	unsigned long flags;
> > >
> > > -	local_irq_save(flags);
> > > +	preempt_disable();
> > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > >  	list = &pcp->lists[migratetype];
> > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > >  	}
> > > -	local_irq_restore(flags);
> > > +	preempt_enable();
> > >  	return page;
> > >  }
> > >
> > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > Is that needed when we try to allocate a page?
> >
> 
> Not necessarily but what are you proposing as an alternative? 

preempt_enable_no_resched() looks at first glance a choice for us to 
avoid flipping interrupts.

> get_cpu()
> is not an alternative and the point is to avoid disabling interrupts
> which is a much more expensive operation.
> 
Agree with every word.

Hillf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-09  3:14         ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-09  3:14 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

> Sent: Friday, January 06, 2017 6:16 PM Mel Gorman wrote: 
> 
> On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> >
> > On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote:
> > > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > >  	struct list_head *list;
> > >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> > >  	struct page *page;
> > > -	unsigned long flags;
> > >
> > > -	local_irq_save(flags);
> > > +	preempt_disable();
> > >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > >  	list = &pcp->lists[migratetype];
> > >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> > >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> > >  		zone_statistics(preferred_zone, zone, gfp_flags);
> > >  	}
> > > -	local_irq_restore(flags);
> > > +	preempt_enable();
> > >  	return page;
> > >  }
> > >
> > With PREEMPT configured, preempt_enable() adds entry point to schedule().
> > Is that needed when we try to allocate a page?
> >
> 
> Not necessarily but what are you proposing as an alternative? 

preempt_enable_no_resched() looks at first glance a choice for us to 
avoid flipping interrupts.

> get_cpu()
> is not an alternative and the point is to avoid disabling interrupts
> which is a much more expensive operation.
> 
Agree with every word.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-06  3:26     ` Hillf Danton
@ 2017-01-06 10:15       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-06 10:15 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> 
> On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote: 
> > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> >  	struct list_head *list;
> >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  	struct page *page;
> > -	unsigned long flags;
> > 
> > -	local_irq_save(flags);
> > +	preempt_disable();
> >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >  	list = &pcp->lists[migratetype];
> >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> >  		zone_statistics(preferred_zone, zone, gfp_flags);
> >  	}
> > -	local_irq_restore(flags);
> > +	preempt_enable();
> >  	return page;
> >  }
> > 
> With PREEMPT configured, preempt_enable() adds entry point to schedule().
> Is that needed when we try to allocate a page?
> 

Not necessarily but what are you proposing as an alternative? get_cpu()
is not an alternative and the point is to avoid disabling interrupts
which is a much more expensive operation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-06 10:15       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-06 10:15 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Jesper Dangaard Brouer', 'Linux Kernel',
	'Linux-MM'

On Fri, Jan 06, 2017 at 11:26:46AM +0800, Hillf Danton wrote:
> 
> On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote: 
> > @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> >  	struct list_head *list;
> >  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  	struct page *page;
> > -	unsigned long flags;
> > 
> > -	local_irq_save(flags);
> > +	preempt_disable();
> >  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >  	list = &pcp->lists[migratetype];
> >  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> > @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> >  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> >  		zone_statistics(preferred_zone, zone, gfp_flags);
> >  	}
> > -	local_irq_restore(flags);
> > +	preempt_enable();
> >  	return page;
> >  }
> > 
> With PREEMPT configured, preempt_enable() adds entry point to schedule().
> Is that needed when we try to allocate a page?
> 

Not necessarily but what are you proposing as an alternative? get_cpu()
is not an alternative and the point is to avoid disabling interrupts
which is a much more expensive operation.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-04 11:10   ` Mel Gorman
@ 2017-01-06  3:26     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-06  3:26 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'


On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote: 
> @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	struct list_head *list;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  	struct page *page;
> -	unsigned long flags;
> 
> -	local_irq_save(flags);
> +	preempt_disable();
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list = &pcp->lists[migratetype];
>  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>  		zone_statistics(preferred_zone, zone, gfp_flags);
>  	}
> -	local_irq_restore(flags);
> +	preempt_enable();
>  	return page;
>  }
> 
With PREEMPT configured, preempt_enable() adds entry point to schedule().
Is that needed when we try to allocate a page?

Hillf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-06  3:26     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-01-06  3:26 UTC (permalink / raw)
  To: 'Mel Gorman', 'Jesper Dangaard Brouer'
  Cc: 'Linux Kernel', 'Linux-MM'


On Wednesday, January 04, 2017 7:11 PM Mel Gorman wrote: 
> @@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	struct list_head *list;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  	struct page *page;
> -	unsigned long flags;
> 
> -	local_irq_save(flags);
> +	preempt_disable();
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list = &pcp->lists[migratetype];
>  	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
> @@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>  		zone_statistics(preferred_zone, zone, gfp_flags);
>  	}
> -	local_irq_restore(flags);
> +	preempt_enable();
>  	return page;
>  }
> 
With PREEMPT configured, preempt_enable() adds entry point to schedule().
Is that needed when we try to allocate a page?

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-04 11:10   ` Mel Gorman
@ 2017-01-04 14:20     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-04 14:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, brouer

On Wed,  4 Jan 2017 11:10:48 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> Many workloads that allocate pages are not handling an interrupt at a
> time. As allocation requests may be from IRQ context, it's necessary to
> disable/enable IRQs for every page allocation. This cost is the bulk
> of the free path but also a significant percentage of the allocation
> path.
> 
> This patch alters the locking and checks such that only irq-safe allocation
> requests use the per-cpu allocator. All others acquire the irq-safe
> zone->lock and allocate from the buddy allocator. It relies on disabling
> preemption to safely access the per-cpu structures. 

I love this idea and patch :-)

> It could be slightly
> modified to avoid soft IRQs using it but it's not clear it's worthwhile.

NICs usually refill their RX-ring from SoftIRQ context (NAPI).
Thus, we do want this optimization to work in softirq.

 
> This modification may slow allocations from IRQ context slightly but the main
> gain from the per-cpu allocator is that it scales better for allocations
> from multiple contexts. There is an implicit assumption that intensive
> allocations from IRQ contexts on multiple CPUs from a single NUMA node are
> rare and that the fast majority of scaling issues are encountered in !IRQ
> contexts such as page faulting. 

IHMO, I agree with this implicit assumption.


> It's worth noting that this patch is not
> required for a bulk page allocator but it significantly reduces the overhead.
> 
> The following is results from a page allocator micro-benchmark. Only
> order-0 is interesting as higher orders do not use the per-cpu allocator

I'm seeing approx 34% reduction in a order-0 micro-benchmark! amazing! :-)
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/

>                                           4.10.0-rc2                 4.10.0-rc2
>                                              vanilla               irqsafe-v1r5
> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
> 
> This is the alloc, free and total overhead of allocating order-0 pages in
> batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
> massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
> most cases. The free path is reduced by 26-46% and the total reduction
> is significant.
> 
[...]
> 
> Similarly, little benefit was seen on networking benchmarks both localhost
> and between physical server/clients where other costs dominate. It's
> possible that this will only be noticable on very high speed networks.

The networking results highly depend on NIC drivers.  As you mention in
the cover-letter, (1) some drivers (e.g mlx4) alloc high-order pages to
work-around order-0 pages and DMA-map being too slow (for their HW
use-case), (2) drivers that do use order-0 pages have driver specific
page-recycling tricks (e.g. mlx5 and ixgbe).  The page_pool target
making a more generic recycle mechanism for drivers to use.

I'm very excited to see improvements in this area! :-)))
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-04 14:20     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-04 14:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux Kernel, Linux-MM, brouer

On Wed,  4 Jan 2017 11:10:48 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> Many workloads that allocate pages are not handling an interrupt at a
> time. As allocation requests may be from IRQ context, it's necessary to
> disable/enable IRQs for every page allocation. This cost is the bulk
> of the free path but also a significant percentage of the allocation
> path.
> 
> This patch alters the locking and checks such that only irq-safe allocation
> requests use the per-cpu allocator. All others acquire the irq-safe
> zone->lock and allocate from the buddy allocator. It relies on disabling
> preemption to safely access the per-cpu structures. 

I love this idea and patch :-)

> It could be slightly
> modified to avoid soft IRQs using it but it's not clear it's worthwhile.

NICs usually refill their RX-ring from SoftIRQ context (NAPI).
Thus, we do want this optimization to work in softirq.

 
> This modification may slow allocations from IRQ context slightly but the main
> gain from the per-cpu allocator is that it scales better for allocations
> from multiple contexts. There is an implicit assumption that intensive
> allocations from IRQ contexts on multiple CPUs from a single NUMA node are
> rare and that the fast majority of scaling issues are encountered in !IRQ
> contexts such as page faulting. 

IHMO, I agree with this implicit assumption.


> It's worth noting that this patch is not
> required for a bulk page allocator but it significantly reduces the overhead.
> 
> The following is results from a page allocator micro-benchmark. Only
> order-0 is interesting as higher orders do not use the per-cpu allocator

I'm seeing approx 34% reduction in a order-0 micro-benchmark! amazing! :-)
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/

>                                           4.10.0-rc2                 4.10.0-rc2
>                                              vanilla               irqsafe-v1r5
> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
> 
> This is the alloc, free and total overhead of allocating order-0 pages in
> batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
> massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
> most cases. The free path is reduced by 26-46% and the total reduction
> is significant.
> 
[...]
> 
> Similarly, little benefit was seen on networking benchmarks both localhost
> and between physical server/clients where other costs dominate. It's
> possible that this will only be noticable on very high speed networks.

The networking results highly depend on NIC drivers.  As you mention in
the cover-letter, (1) some drivers (e.g mlx4) alloc high-order pages to
work-around order-0 pages and DMA-map being too slow (for their HW
use-case), (2) drivers that do use order-0 pages have driver specific
page-recycling tricks (e.g. mlx5 and ixgbe).  The page_pool target
making a more generic recycle mechanism for drivers to use.

I'm very excited to see improvements in this area! :-)))
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
  2017-01-04 11:10 [RFC PATCH 0/4] Fast noirq bulk page allocator Mel Gorman
@ 2017-01-04 11:10   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-04 11:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Mel Gorman

Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk
of the free path but also a significant percentage of the allocation
path.

This patch alters the locking and checks such that only irq-safe allocation
requests use the per-cpu allocator. All others acquire the irq-safe
zone->lock and allocate from the buddy allocator. It relies on disabling
preemption to safely access the per-cpu structures. It could be slightly
modified to avoid soft IRQs using it but it's not clear it's worthwhile.

This modification may slow allocations from IRQ context slightly but the main
gain from the per-cpu allocator is that it scales better for allocations
from multiple contexts. There is an implicit assumption that intensive
allocations from IRQ contexts on multiple CPUs from a single NUMA node are
rare and that the fast majority of scaling issues are encountered in !IRQ
contexts such as page faulting. It's worth noting that this patch is not
required for a bulk page allocator but it significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator

                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla               irqsafe-v1r5
Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages in
batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
most cases. The free path is reduced by 26-46% and the total reduction
is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation. Hence, the impact on a basic page faulting benchmark
is not that significant

                              4.10.0-rc2            4.10.0-rc2
                                 vanilla          irqsafe-v1r5
Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch. The headline improvement is small as the overall
fault cost, zeroing, page table insertion etc dominate relative to
disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both localhost
and between physical server/clients where other costs dominate. It's
possible that this will only be noticable on very high speed networks.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a602b7f258d..01b09f9da288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,10 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
+	unsigned long nr_scanned, flags;
 	bool isolated_pageblocks;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
@@ -1139,7 +1139,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -1147,8 +1147,8 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
-	spin_lock(&zone->lock);
+	unsigned long nr_scanned, flags;
+	spin_lock_irqsave(&zone->lock, flags);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
@@ -1158,7 +1158,7 @@ static void free_one_page(struct zone *zone,
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
 	__free_one_page(page, pfn, zone, order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -1236,7 +1236,6 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
@@ -1244,10 +1243,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
+	count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -2219,8 +2216,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			int migratetype, bool cold)
 {
 	int i, alloced = 0;
+	unsigned long flags;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2256,7 +2254,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return alloced;
 }
 
@@ -2444,7 +2442,6 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
+	preempt_disable();
+	count_vm_event(PGFREE);
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2484,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	local_irq_restore(flags);
+	preempt_enable();
 }
 
 /*
@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct list_head *list;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
-	unsigned long flags;
 
-	local_irq_save(flags);
+	preempt_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
@@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, gfp_flags);
 	}
-	local_irq_restore(flags);
+	preempt_enable();
 	return page;
 }
 
@@ -2674,7 +2670,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0))
+	if (likely(order == 0) && !in_interrupt())
 		return rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 
@@ -3919,7 +3915,7 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
+		if (order == 0 && !in_interrupt())
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests
@ 2017-01-04 11:10   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-01-04 11:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Mel Gorman

Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk
of the free path but also a significant percentage of the allocation
path.

This patch alters the locking and checks such that only irq-safe allocation
requests use the per-cpu allocator. All others acquire the irq-safe
zone->lock and allocate from the buddy allocator. It relies on disabling
preemption to safely access the per-cpu structures. It could be slightly
modified to avoid soft IRQs using it but it's not clear it's worthwhile.

This modification may slow allocations from IRQ context slightly but the main
gain from the per-cpu allocator is that it scales better for allocations
from multiple contexts. There is an implicit assumption that intensive
allocations from IRQ contexts on multiple CPUs from a single NUMA node are
rare and that the fast majority of scaling issues are encountered in !IRQ
contexts such as page faulting. It's worth noting that this patch is not
required for a bulk page allocator but it significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator

                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla               irqsafe-v1r5
Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages in
batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead
massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in
most cases. The free path is reduced by 26-46% and the total reduction
is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation. Hence, the impact on a basic page faulting benchmark
is not that significant

                              4.10.0-rc2            4.10.0-rc2
                                 vanilla          irqsafe-v1r5
Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch. The headline improvement is small as the overall
fault cost, zeroing, page table insertion etc dominate relative to
disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both localhost
and between physical server/clients where other costs dominate. It's
possible that this will only be noticable on very high speed networks.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a602b7f258d..01b09f9da288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1087,10 +1087,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
+	unsigned long nr_scanned, flags;
 	bool isolated_pageblocks;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
@@ -1139,7 +1139,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -1147,8 +1147,8 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
-	spin_lock(&zone->lock);
+	unsigned long nr_scanned, flags;
+	spin_lock_irqsave(&zone->lock, flags);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
@@ -1158,7 +1158,7 @@ static void free_one_page(struct zone *zone,
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
 	__free_one_page(page, pfn, zone, order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -1236,7 +1236,6 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
@@ -1244,10 +1243,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
+	count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -2219,8 +2216,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			int migratetype, bool cold)
 {
 	int i, alloced = 0;
+	unsigned long flags;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2256,7 +2254,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return alloced;
 }
 
@@ -2444,7 +2442,6 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
+	preempt_disable();
+	count_vm_event(PGFREE);
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2484,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	local_irq_restore(flags);
+	preempt_enable();
 }
 
 /*
@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct list_head *list;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
-	unsigned long flags;
 
-	local_irq_save(flags);
+	preempt_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  order, gfp_flags, migratetype,
@@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, gfp_flags);
 	}
-	local_irq_restore(flags);
+	preempt_enable();
 	return page;
 }
 
@@ -2674,7 +2670,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0))
+	if (likely(order == 0) && !in_interrupt())
 		return rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 
@@ -3919,7 +3915,7 @@ EXPORT_SYMBOL(get_zeroed_page);
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
-		if (order == 0)
+		if (order == 0 && !in_interrupt())
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-01-29  4:00 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-09 16:35 [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Mel Gorman
2017-01-09 16:35 ` Mel Gorman
2017-01-09 16:35 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:31   ` Jesper Dangaard Brouer
2017-01-11 12:31     ` Jesper Dangaard Brouer
2017-01-12  3:09   ` Hillf Danton
2017-01-12  3:09     ` Hillf Danton
2017-01-09 16:35 ` [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:32   ` Jesper Dangaard Brouer
2017-01-11 12:32     ` Jesper Dangaard Brouer
2017-01-12  3:11   ` Hillf Danton
2017-01-12  3:11     ` Hillf Danton
2017-01-09 16:35 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:44   ` Jesper Dangaard Brouer
2017-01-11 12:44     ` Jesper Dangaard Brouer
2017-01-11 13:27     ` Jesper Dangaard Brouer
2017-01-11 13:27       ` Jesper Dangaard Brouer
2017-01-12 10:47       ` Mel Gorman
2017-01-12 10:47         ` Mel Gorman
2017-01-09 16:35 ` [PATCH 4/4] mm, page_alloc: Add a bulk page allocator Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-10  4:00   ` Hillf Danton
2017-01-10  4:00     ` Hillf Danton
2017-01-10  8:34     ` Mel Gorman
2017-01-10  8:34       ` Mel Gorman
2017-01-16 14:25   ` Jesper Dangaard Brouer
2017-01-16 14:25     ` Jesper Dangaard Brouer
2017-01-16 15:01     ` Mel Gorman
2017-01-16 15:01       ` Mel Gorman
2017-01-29  4:00 ` [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Andy Lutomirski
2017-01-29  4:00   ` Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2017-01-04 11:10 [RFC PATCH 0/4] Fast noirq bulk page allocator Mel Gorman
2017-01-04 11:10 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 14:20   ` Jesper Dangaard Brouer
2017-01-04 14:20     ` Jesper Dangaard Brouer
2017-01-06  3:26   ` Hillf Danton
2017-01-06  3:26     ` Hillf Danton
2017-01-06 10:15     ` Mel Gorman
2017-01-06 10:15       ` Mel Gorman
2017-01-09  3:14       ` Hillf Danton
2017-01-09  3:14         ` Hillf Danton
2017-01-09  9:48         ` Mel Gorman
2017-01-09  9:48           ` Mel Gorman
2017-01-09  9:55           ` Hillf Danton
2017-01-09  9:55             ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.